Hi all,

While we are working on the new Orc v2 spec, I want to bounce some ideas in 
this group. If we can get something concrete, I will open JIRAs to follow up. 
Some of these ideas were mentioned before in various discussion, but I just put 
them together in a list so people can comment and provide feedback. Thanks.


  *   Clustered Index

In a lot of time, data written into Orc file has sorting property. For example 
a sort merge join output will result in the data stream being sorted on join 
key(s). Another example is the DISTRIBUTED BY … SORTED BY … keywords can 
enforce the sorting property on certain data set. Under such cases, if we can 
just record the sort key(s) values for the first row of each row group it will 
help us a lot while doing lookup using key ranges, because we already have row 
group index which gives us ~O(1) seek time. During query execution, a SQL 
filter predicate can be easily turned into a key range. For example “WHERE id > 
0 and id <= 100” will be translated into range (0, 100], and this key range can 
be passed down all the way to the Orc reader. Then we only need to load the 
corresponding row groups that covers this range.


  *   Stripe Footer Location

Today stripe footers are stored at the end of each stripe. This design probably 
come from the Hive world where the implementation tries to align Orc stripe 
with an HDFS block. It would make sense when you only need to read one HDFS 
block for both the data and the footer. But the alignment assumption doesn’t 
hold in other systems that leverage Orc as a columnar data format. Besides even 
for Hive, often time it’s hard to make sure good alignment due to various 
reasons - for example, when memory pressure is high stripe needs to be flushed 
to disk earlier. With this in mind, it will make sense to support saving stripe 
footer at the end of the file, together with all the other file meta. This 
would be easier for one sequential IO to load all the meta, and is easier to 
cache them all together. And we can make this configurable through writer 
options.


  *   File Level Dictionary

Currently Orc builds dictionary at stripe level. Each stripe has its own 
dictionary. But in most cases, data across stripes share a lot of similarities. 
Building one file level dictionary is probably more efficient than having one 
dictionary each stripe. We can reduce storage footprint and also improve read 
performance since we only have one dictionary per column per file. One 
challenge with this design is how to do file merge. Two files can have two 
different dictionary, and we need to be able to merge them without rewriting 
all the data. To solve this problem, we will need to support multiple 
dictionaries identified by uuid. Each stripe records the dictionary ID that 
identifies the dictionary it uses. And the reader loads the particular 
dictionary based on the ID when it loads a stripe. When you merge two files, 
dictionary data doesn’t need to be changed, but just to save the dictionaries 
from both files in the new merged file.


  *   Breaking Compression Block and RLE Runs at Row Group Boundary

Owen has mentioned this in previous discussion. We did a prototype and are able 
to show that there’s only a slight increase of file size (< 1%) with the 
change. But the benefit is obvious - all the seek to row group operation will 
not involve unnecessary decoding/decompression, making it really efficient. And 
this is critical in scenarios such as predicate pushdown or range scan using 
clustered index (see my first bullet point). The other benefit is doing so will 
greatly simply the index implementation we have today. We will only need to 
record a file offset for row group index.


  *   Encoding and Compression

The encoding today doesn’t have a lot of flexibility. Sometimes we would need 
to configure and fine tune encoding when it’s needed. For example, in previous 
discussions Gang brought up, we found LEB128 causes zStd to perform really bad. 
We would end up with much better result by just disabling LEB128 under zstd 
compression. We don’t have flexibility for these kind of things today. And we 
will need additional meta fields for that.


Reply via email to