Hi all, While we are working on the new Orc v2 spec, I want to bounce some ideas in this group. If we can get something concrete, I will open JIRAs to follow up. Some of these ideas were mentioned before in various discussion, but I just put them together in a list so people can comment and provide feedback. Thanks.
* Clustered Index In a lot of time, data written into Orc file has sorting property. For example a sort merge join output will result in the data stream being sorted on join key(s). Another example is the DISTRIBUTED BY … SORTED BY … keywords can enforce the sorting property on certain data set. Under such cases, if we can just record the sort key(s) values for the first row of each row group it will help us a lot while doing lookup using key ranges, because we already have row group index which gives us ~O(1) seek time. During query execution, a SQL filter predicate can be easily turned into a key range. For example “WHERE id > 0 and id <= 100” will be translated into range (0, 100], and this key range can be passed down all the way to the Orc reader. Then we only need to load the corresponding row groups that covers this range. * Stripe Footer Location Today stripe footers are stored at the end of each stripe. This design probably come from the Hive world where the implementation tries to align Orc stripe with an HDFS block. It would make sense when you only need to read one HDFS block for both the data and the footer. But the alignment assumption doesn’t hold in other systems that leverage Orc as a columnar data format. Besides even for Hive, often time it’s hard to make sure good alignment due to various reasons - for example, when memory pressure is high stripe needs to be flushed to disk earlier. With this in mind, it will make sense to support saving stripe footer at the end of the file, together with all the other file meta. This would be easier for one sequential IO to load all the meta, and is easier to cache them all together. And we can make this configurable through writer options. * File Level Dictionary Currently Orc builds dictionary at stripe level. Each stripe has its own dictionary. But in most cases, data across stripes share a lot of similarities. Building one file level dictionary is probably more efficient than having one dictionary each stripe. We can reduce storage footprint and also improve read performance since we only have one dictionary per column per file. One challenge with this design is how to do file merge. Two files can have two different dictionary, and we need to be able to merge them without rewriting all the data. To solve this problem, we will need to support multiple dictionaries identified by uuid. Each stripe records the dictionary ID that identifies the dictionary it uses. And the reader loads the particular dictionary based on the ID when it loads a stripe. When you merge two files, dictionary data doesn’t need to be changed, but just to save the dictionaries from both files in the new merged file. * Breaking Compression Block and RLE Runs at Row Group Boundary Owen has mentioned this in previous discussion. We did a prototype and are able to show that there’s only a slight increase of file size (< 1%) with the change. But the benefit is obvious - all the seek to row group operation will not involve unnecessary decoding/decompression, making it really efficient. And this is critical in scenarios such as predicate pushdown or range scan using clustered index (see my first bullet point). The other benefit is doing so will greatly simply the index implementation we have today. We will only need to record a file offset for row group index. * Encoding and Compression The encoding today doesn’t have a lot of flexibility. Sometimes we would need to configure and fine tune encoding when it’s needed. For example, in previous discussions Gang brought up, we found LEB128 causes zStd to perform really bad. We would end up with much better result by just disabling LEB128 under zstd compression. We don’t have flexibility for these kind of things today. And we will need additional meta fields for that.
