On Mon, Oct 1, 2018 at 3:56 PM Dain Sundstrom <d...@iq80.com> wrote: > > Interesting idea. This could help some processors of the data. Also, if > the format has this, it would be good to support "clustered" and "unique" > as flags for data that isn’t strictly sorted, but has all of the same > values clustered together. Then again, this seems like a property for the > table/partition. >
Clustered and unique are easy to measure while we are in dictionary mode, but very hard while in direct mode. What are you thinking? > >> > >> * Stripe Footer Location > >> > >> Today stripe footers are stored at the end of each stripe. This design > >> probably come from the Hive world where the implementation tries to > align > >> Orc stripe with an HDFS block. It would make sense when you only need to > >> read one HDFS block for both the data and the footer. But the alignment > >> assumption doesn’t hold in other systems that leverage Orc as a columnar > >> data format. Besides even for Hive, often time it’s hard to make sure > good > >> alignment due to various reasons - for example, when memory pressure is > >> high stripe needs to be flushed to disk earlier. With this in mind, it > will > >> make sense to support saving stripe footer at the end of the file, > together > >> with all the other file meta. This would be easier for one sequential > IO to > >> load all the meta, and is easier to cache them all together. And we can > >> make this configurable through writer options. > >> > > > > Some of the systems like S3 and Wasabi have a strong preference for > reading > > forward in files, so I think it is reasonable to put the stripe "footer" > > first in the file, followed by the indexes, and finally the data. Putting > > the "footer" first isn't hard for the implementation since it has to > buffer > > the entire stripe before anything is > > On a per-stripe basis, the footer, data and index could be in any order, > because any implementation will need to completely buffer the stripe > somewhere (it is just a reality of columnar writing). Across the whole > file, that is a completely different story. > I misread the original proposal. Stripes being co-located is a feature, not a problem. So -1 to having the stripe footer at the bottom of the file. > > One of the ideas that I've been playing with is to make "stripelets" > where > > each one contains 128k rows and flush the streams at that point. That > would > > enable you to read the first stripelet and start processing while you > read > > the next stripelet. > > What is the advantage of a "stripelet" over just flushing the stripe every > 128k rows? > After thinking about it more, I agree. There were some advantages, but the complexity is probably too high to justify it. > >> * File Level Dictionary > >> > >> Currently Orc builds dictionary at stripe level. Each stripe has its own > >> dictionary. But in most cases, data across stripes share a lot of > >> similarities. Building one file level dictionary is probably more > efficient > >> than having one dictionary each stripe. We can reduce storage footprint > and > >> also improve read performance since we only have one dictionary per > column > >> per file. One challenge with this design is how to do file merge. Two > files > >> can have two different dictionary, and we need to be able to merge them > >> without rewriting all the data. To solve this problem, we will need to > >> support multiple dictionaries identified by uuid. Each stripe records > the > >> dictionary ID that identifies the dictionary it uses. And the reader > loads > >> the particular dictionary based on the ID when it loads a stripe. When > you > >> merge two files, dictionary data doesn’t need to be changed, but just to > >> save the dictionaries from both files in the new merged file. > >> > > > > I haven't seen the dictionary duplication being a problem. On the other > > hand, having cross-block reads, which a global dictionary would require, > > are painful. As you point out, stripe merging become much more > complicated > > in this case. (You don't need UUID schemes, because the writer would > always > > know the mapping of which stripes & columns are using which dictionary.) > > To be honest, I rarely see cross stripe/block reads. In Presto, people > normally want to be as a parallel as possible, so effectively every strip > is read by a different threads/machines. Of course there are corner cases, > where a writer went crazy and wrote tiny stripes, but these are the > exception. > Agreed. We do need to do better about giving better guidance and default values for wide schemas. There are many cases where the default behavior ends up with 20k rows per a stripe, which is really really bad for read performance. > >> * Breaking Compression Block and RLE Runs at Row Group Boundary > >> > >> Owen has mentioned this in previous discussion. We did a prototype and > are > >> able to show that there’s only a slight increase of file size (< 1%) > with > >> the change. But the benefit is obvious - all the seek to row group > >> operation will not involve unnecessary decoding/decompression, making it > >> really efficient. And this is critical in scenarios such as predicate > >> pushdown or range scan using clustered index (see my first bullet > point). > >> The other benefit is doing so will greatly simply the index > implementation > >> we have today. We will only need to record a file offset for row group > >> index. > >> > > > > Yeah, this is the alternative to the stripelets that I discussed above. > > > >> * Encoding and Compression > >> > >> The encoding today doesn’t have a lot of flexibility. Sometimes we would > >> need to configure and fine tune encoding when it’s needed. For example, > in > >> previous discussions Gang brought up, we found LEB128 causes zStd to > >> perform really bad. We would end up with much better result by just > >> disabling LEB128 under zstd compression. We don’t have flexibility for > >> these kind of things today. And we will need additional meta fields for > >> that. > >> > > > > I certainly have had requests for custom encodings, but I've tended to > push > > back because it makes it hard to ensure the files are readable on all of > > the platforms. I did just add the option to turn off dictionary encoding > > for particular columns. > > Yep. As someone that maintains a reader/writer implementation, I would > prefer to keep the variety of encodings down for the same reason :) > > As for flexibility, the dictionary encoding flag you mentioned wouldn’t > effect the format, so it seems like a reasonable change to me. One format > level flexibility change, I’d like to see is the ability to not sort > dictionaries, because no one is taking advantage of it, and it make it > impossible to predict the output size of the stipe (sorting can make > compression better or worse). > Absolutely. We probably should make that the default for ORC v2. > I guess that "breaking the compression" at row group boundaries could be > done without format changes, but I’d prefer to see it required as it makes > skipping a pain. > > > With respect to zstd, we need to test it under different data sets and > > build up an understanding of when it works and when it doesn't. It sounds > > like zstd with the options that you were using were a bad fit for what we > > need. I would guess that longer windows and pre-loaded dictionaries may > > help. We need more work to figure out what the right parameters are in > > general by looking at more data sets. > > Totally agree. My guess is it works good for thinks like timestamps and > dates, but not great for varchar and binary. Then again, if you are > writing a lot of data, you could use the data from the previous stripes to > speed up compression for later stripes. My guess is that would be really > complex to implement. > > If we decided we may want to pursue this path in the future, we could > profile a "dictionary" section in the stream information. > > -dain