Hello,
I have been working on a custom compression algorithm for market trading data. The data is quite big (goes to PBs) and so savings in storage are some visible cost savings. I used ORC as the baseline, and extended it by creating custom encoders for different types of data. The encoders are not meant as the replacement for the standard ORC encoders but rather are use-case specific, exploiting known redundancies in the data (eg. predicting value of one fields based on the others). I was able to achieve pretty good improvements (about 48%) over standard ORC for my type of data. Currently, I had to fork and create my own version of the ORC library (Java), which is not ideal. If there are any improvements, it will require merge. Also, it's hard to integrate this into other higher-level frameworks, such as Spark. And other people can't use my work. My target is actually to be able to use this codec in Databricks. By looking at the implementation I had a thought that it would be nice to have some sort of extensibility mechanism standard as part of ORC (Java). Based on a column type, and perhaps some configuration, to be able to overwrite the standard "Writer" for certain types. For example, I have an improved "Timestamp" writer which exploits some patterns in the data (see-saw pattern), which could be applicable to other data as well. It would be nice if I could replace the standard writer for certain fields without the need to modify the ORC library, or people could opt-out to use my encoder for their data. And ideally, be able to simply load my library alongside the default ORC implementation into Spark, and have my "plugins" or "extensions" automatically discovered by ORC and integrated. Has anybody thought about anything similar? Would that work? Would it be beneficial? What's the best way to implement something like that, where would you start? Thanks, Denis
