> On Sep 28, 2018, at 2:40 PM, Xiening Dai <xndai....@live.com> wrote:
> 
> Hi all,
> 
> While we are working on the new Orc v2 spec, I want to bounce some ideas in 
> this group. If we can get something concrete, I will open JIRAs to follow up. 
> Some of these ideas were mentioned before in various discussion, but I just 
> put them together in a list so people can comment and provide feedback. 
> Thanks.
> 
> 
>  *   Clustered Index
> 
> In a lot of time, data written into Orc file has sorting property. For 
> example a sort merge join output will result in the data stream being sorted 
> on join key(s). Another example is the DISTRIBUTED BY … SORTED BY … keywords 
> can enforce the sorting property on certain data set. Under such cases, if we 
> can just record the sort key(s) values for the first row of each row group it 
> will help us a lot while doing lookup using key ranges, because we already 
> have row group index which gives us ~O(1) seek time. During query execution, 
> a SQL filter predicate can be easily turned into a key range. For example 
> “WHERE id > 0 and id <= 100” will be translated into range (0, 100], and this 
> key range can be passed down all the way to the Orc reader. Then we only need 
> to load the corresponding row groups that covers this range.

Don’t we already have min and max for every row group?  If so, isn’t that a 
superset of this feature?

>  *   Stripe Footer Location
> 
> Today stripe footers are stored at the end of each stripe. This design 
> probably come from the Hive world where the implementation tries to align Orc 
> stripe with an HDFS block. It would make sense when you only need to read one 
> HDFS block for both the data and the footer. But the alignment assumption 
> doesn’t hold in other systems that leverage Orc as a columnar data format. 
> Besides even for Hive, often time it’s hard to make sure good alignment due 
> to various reasons - for example, when memory pressure is high stripe needs 
> to be flushed to disk earlier. With this in mind, it will make sense to 
> support saving stripe footer at the end of the file, together with all the 
> other file meta. This would be easier for one sequential IO to load all the 
> meta, and is easier to cache them all together. And we can make this 
> configurable through writer options.

The end of the file contains a metadata section with a summary of the stripe 
statistics.  I’m curious what information you would like that isn’t already 
present that data structure.  

Also, I’m curious how these other  systems "that leverage Orc as a columnar 
data format" are accessing this stripe footer information.  Specifically, is it 
cached or loaded on demand from storage?  Are you using this for planning or 
data skipping?

>  *   File Level Dictionary
> 
> Currently Orc builds dictionary at stripe level. Each stripe has its own 
> dictionary. But in most cases, data across stripes share a lot of 
> similarities. Building one file level dictionary is probably more efficient 
> than having one dictionary each stripe. We can reduce storage footprint and 
> also improve read performance since we only have one dictionary per column 
> per file. One challenge with this design is how to do file merge. Two files 
> can have two different dictionary, and we need to be able to merge them 
> without rewriting all the data. To solve this problem, we will need to 
> support multiple dictionaries identified by uuid. Each stripe records the 
> dictionary ID that identifies the dictionary it uses. And the reader loads 
> the particular dictionary based on the ID when it loads a stripe. When you 
> merge two files, dictionary data doesn’t need to be changed, but just to save 
> the dictionaries from both files in the new merged file.

In my experience, this over head is very small when using the default 128MB 
stripe settings, and I’d guess is  reasonable at 64MB or 32MB.  What we 
typically see is that dictionary columns compress amazingly well, and the other 
columns in the table take up the majority of the space in a table.  Even when 
you include the repeated cost of the dictionary per stripe, the over head is 
tiny.

On the other hand, there are cases where you have columns that have a small 
common set of values mixed in with pretty distinct values (skewed data), and in 
those cases the dictionary blows up as you add more rows to the stripe.  The FB 
fork of ORC, DWRF, addresses this by having support for a "per row group" 
dictionary.  Another, alternative is to support mixed direct and dictionary in 
the same column, but that is pretty complex to implement and effectively 
disables downstream dictionary processing.

BTW, I’m not advocating for either of these dictionary changes, just providing 
my observations.

>  *   Breaking Compression Block and RLE Runs at Row Group Boundary
> 
> Owen has mentioned this in previous discussion. We did a prototype and are 
> able to show that there’s only a slight increase of file size (< 1%) with the 
> change. But the benefit is obvious - all the seek to row group operation will 
> not involve unnecessary decoding/decompression, making it really efficient. 
> And this is critical in scenarios such as predicate pushdown or range scan 
> using clustered index (see my first bullet point). The other benefit is doing 
> so will greatly simply the index implementation we have today. We will only 
> need to record a file offset for row group index.

This is one I’d like to see.  The complexity of encodings spanning row groups, 
makes skipping super complex. Would you extend this to compression blocks?

BTW, there are datasets that would become much bigger with this change, but I 
think on average the change would be tiny.

>  *   Encoding and Compression
> 
> The encoding today doesn’t have a lot of flexibility. Sometimes we would need 
> to configure and fine tune encoding when it’s needed. For example, in 
> previous discussions Gang brought up, we found LEB128 causes zStd to perform 
> really bad. We would end up with much better result by just disabling LEB128 
> under zstd compression. We don’t have flexibility for these kind of things 
> today. And we will need additional meta fields for that.

I think this brings up a much bigger issue.  Why have such flexibility in the 
compression algorithms?  The interaction between the block compression and the 
data encodings, can have dramatic effects on the format.  Can we consider 
limiting the compression to LZ4 and ZSTD (or may be just ZSTD), and then design 
encodings that play well with them?  Also, ZSTD can have pre-trained 
"dictionary" that might help with specific encodings…. Just a thought.

-dain

Reply via email to