On Mon, Oct 1, 2018 at 3:56 PM Dain Sundstrom <d...@iq80.com> wrote:

>
> Interesting idea.  This could help some processors of the data.  Also, if
> the format has this, it would be good to support "clustered" and "unique"
> as flags for data that isn’t strictly sorted, but has all of the same
> values clustered together.  Then again, this seems like a property for the
> table/partition.
>

Clustered and unique are easy to measure while we are in dictionary mode,
but very hard while in direct mode. What are you thinking?


> >>
> >>  *   Stripe Footer Location
> >>
> >> Today stripe footers are stored at the end of each stripe. This design
> >> probably come from the Hive world where the implementation tries to
> align
> >> Orc stripe with an HDFS block. It would make sense when you only need to
> >> read one HDFS block for both the data and the footer. But the alignment
> >> assumption doesn’t hold in other systems that leverage Orc as a columnar
> >> data format. Besides even for Hive, often time it’s hard to make sure
> good
> >> alignment due to various reasons - for example, when memory pressure is
> >> high stripe needs to be flushed to disk earlier. With this in mind, it
> will
> >> make sense to support saving stripe footer at the end of the file,
> together
> >> with all the other file meta. This would be easier for one sequential
> IO to
> >> load all the meta, and is easier to cache them all together. And we can
> >> make this configurable through writer options.
> >>
> >
> > Some of the systems like S3 and Wasabi have a strong preference for
> reading
> > forward in files, so I think it is reasonable to put the stripe "footer"
> > first in the file, followed by the indexes, and finally the data. Putting
> > the "footer" first isn't hard for the implementation since it has to
> buffer
> > the entire stripe before anything is
>
> On a per-stripe basis, the footer, data and index could be in any order,
> because any implementation will need to completely buffer the stripe
> somewhere (it is just a reality of columnar writing).  Across the whole
> file, that is a completely different story.
>

I misread the original proposal. Stripes being co-located is a feature, not
a problem. So -1 to having the stripe footer at the bottom of the file.


> > One of the ideas that I've been playing with is to make "stripelets"
> where
> > each one contains 128k rows and flush the streams at that point. That
> would
> > enable you to read the first stripelet and start processing while you
> read
> > the next stripelet.
>
> What is the advantage of a "stripelet" over just flushing the stripe every
> 128k rows?
>

After thinking about it more, I agree. There were some advantages, but the
complexity is probably too high to justify it.


> >>  *   File Level Dictionary
> >>
> >> Currently Orc builds dictionary at stripe level. Each stripe has its own
> >> dictionary. But in most cases, data across stripes share a lot of
> >> similarities. Building one file level dictionary is probably more
> efficient
> >> than having one dictionary each stripe. We can reduce storage footprint
> and
> >> also improve read performance since we only have one dictionary per
> column
> >> per file. One challenge with this design is how to do file merge. Two
> files
> >> can have two different dictionary, and we need to be able to merge them
> >> without rewriting all the data. To solve this problem, we will need to
> >> support multiple dictionaries identified by uuid. Each stripe records
> the
> >> dictionary ID that identifies the dictionary it uses. And the reader
> loads
> >> the particular dictionary based on the ID when it loads a stripe. When
> you
> >> merge two files, dictionary data doesn’t need to be changed, but just to
> >> save the dictionaries from both files in the new merged file.
> >>
> >
> > I haven't seen the dictionary duplication being a problem. On the other
> > hand, having cross-block reads, which a global dictionary would require,
> > are painful. As you point out, stripe merging become much more
> complicated
> > in this case. (You don't need UUID schemes, because the writer would
> always
> > know the mapping of which stripes & columns are using which dictionary.)
>
> To be honest, I rarely see cross stripe/block reads.  In Presto, people
> normally want to be as a parallel as possible, so effectively every strip
> is read by a different threads/machines.  Of course there are corner cases,
> where a writer went crazy and wrote tiny stripes, but these are the
> exception.
>

Agreed.

We do need to do better about giving better guidance and default values for
wide schemas. There are many cases where the default behavior ends up with
20k rows per a stripe, which is really really bad for read performance.


> >>  *   Breaking Compression Block and RLE Runs at Row Group Boundary
> >>
> >> Owen has mentioned this in previous discussion. We did a prototype and
> are
> >> able to show that there’s only a slight increase of file size (< 1%)
> with
> >> the change. But the benefit is obvious - all the seek to row group
> >> operation will not involve unnecessary decoding/decompression, making it
> >> really efficient. And this is critical in scenarios such as predicate
> >> pushdown or range scan using clustered index (see my first bullet
> point).
> >> The other benefit is doing so will greatly simply the index
> implementation
> >> we have today. We will only need to record a file offset for row group
> >> index.
> >>
> >
> > Yeah, this is the alternative to the stripelets that I discussed above.
> >
> >>  *   Encoding and Compression
> >>
> >> The encoding today doesn’t have a lot of flexibility. Sometimes we would
> >> need to configure and fine tune encoding when it’s needed. For example,
> in
> >> previous discussions Gang brought up, we found LEB128 causes zStd to
> >> perform really bad. We would end up with much better result by just
> >> disabling LEB128 under zstd compression. We don’t have flexibility for
> >> these kind of things today. And we will need additional meta fields for
> >> that.
> >>
> >
> > I certainly have had requests for custom encodings, but I've tended to
> push
> > back because it makes it hard to ensure the files are readable on all of
> > the platforms. I did just add the option to turn off dictionary encoding
> > for particular columns.
>
> Yep. As someone that maintains a reader/writer implementation, I would
> prefer to keep the variety of encodings down for the same reason :)
>
> As for flexibility, the dictionary encoding flag you mentioned wouldn’t
> effect the format, so it seems like a reasonable change to me.  One format
> level flexibility change, I’d like to see is the ability to not sort
> dictionaries, because no one is taking advantage of it, and it make it
> impossible to predict the output size of the stipe (sorting can make
> compression better or worse).
>

Absolutely. We probably should make that the default for ORC v2.


> I guess that "breaking the compression" at row group boundaries could be
> done without format changes, but I’d prefer to see it required as it makes
> skipping a pain.
>
> > With respect to zstd, we need to test it under different data sets and
> > build up an understanding of when it works and when it doesn't. It sounds
> > like zstd with the options that you were using were a bad fit for what we
> > need. I would guess that longer windows and pre-loaded dictionaries may
> > help. We need more work to figure out what the right parameters are in
> > general by looking at more data sets.
>
> Totally agree.  My guess is it works good for thinks like timestamps and
> dates, but not great for varchar and binary.  Then again, if you are
> writing a lot of data, you could use the data from the previous stripes to
> speed up compression for later stripes.  My guess is that would be really
> complex to implement.
>
> If we decided we may want to pursue this path in the future, we could
> profile a "dictionary" section in the stream information.
>
> -dain

Reply via email to