Re: Drill and Parquet - Best practices - part 1

Stefán Baxter Thu, 05 Nov 2015 07:51:16 -0800

Thanks a lot Jason, this is appreciated.

I hope you get a change to address the other questions as well.


Drill and Parquet seem to have a great set feature that should provide a
killer combo and that is the main reason I want to be sure that I'm using
it in the best way possible.

It's obviously preferred that vectorized reader(s) can take full advantage
of the all the encoding options in Parquet so speed and space efficiency go
hand-in-hand.

Looking forward to learning more!

Regards,
 -Stefán

On Thu, Nov 5, 2015 at 3:36 PM, Jason Altekruse <altekruseja...@gmail.com>
wrote:

> Hi Stefan,
>
> I can answer a few of these questions, please see below.
>
>  - How efficient is the Drill use of Parquet compared to Presto?
> Unfortunately this is a bit of a complex question, the truth is that they
> are both using parquet differently right now. Drill has two parquet
> readers, one optimized for speed on flat data and another for handling
> complex (nested and repeated) data. The "complex" reader leverages the
> regular parquet-mr interfaces as is the case with the reader in presto (I'm
> not sure if they support complex data, but they are using the standard
> interfaces).
>
> To get the best performance with Presto as I understand it, you should use
> their fork of the ORC format, DWRF. I haven't worked with this, but I know
> that the team at Netflix (I've added one of the engineer to this thread)
> has been working to add some of the performance enhancements available for
> DWRF to the Parquet reader in Presto. They have a PR out for their
> vectorization changes, I don't believe they are integrated into the
> standard presto build yet.
>
> - Can other solutions use Parquet files created by Drill when partition
>    pruning is used? (footer information)
> Yes. The way we store which columns are prunable is just by setting the min
> and max values in the column statistics to the same value (after making
> sure we partition the data so that we only write a single value into each
> file). Any engine doing row group level pruning by looking at these
> statistics should be able to take advantage of it.
>    - What is the status of the Parquet fork and getting Drill "on track" in
>    that regard?
> What we had previously been maintaining in a fork was a large change set to
> allow us to leverage the Hadoop 2.x API for reading directly into a
> ByteBuffer. We manage all of our memory off-heap, so without these changes
> we would have had to make a lot of reads into byte arrays and immediately
> copy them into direct memory. This has just been merged into mainline and
> we no longer have the fork.
>
> That being said, the optimized flat reader we use in Dill is not designed
> to work with other tools right now. I think it is quite possible that this
> is some of fastest java code for reading parquet files, because we are
> making a direct columnar transformation from the disk format into our
> columnar in-memory format. There are still some optimizations we could add
> there, but generally it has given very good performance on our parquet
> reads. It generally doesn't do the best against dictionary encoded files,
> because we just go ahead and materialize all of the dictionary values into
> the full dataset right away at the reader, so we don't currently do any
> dictionary based filtering right now.
>
> Looking back in this thread seems like there are a lot of other questions,
> I'll try to come back to this soon to answer more.
>
> On Tue, Nov 3, 2015 at 1:55 AM, Stefán Baxter <ste...@activitystream.com>
> wrote:
>
> > Hi again,
> >
> > Are incrimental timestamp values (long) being encoded in Parquet as
> > incremental values?
> > (This option in parquet to refrain from storing complete numbers and
> store
> > only the delta between numbers to save space)
> >
> > Regards,
> >  -Stefan
> >
> > On Mon, Nov 2, 2015 at 5:54 PM, Stefán Baxter <ste...@activitystream.com
> >
> > wrote:
> >
> > > Hi Aman,
> > >
> > > Thank you for this information.
> > >
> > > I'm not sure I understand this correctly but...
> > >
> > >    1. Preventing scan where values are out of range is a core feature
> in
> > >    Parquet
> > >    - is this not used/supported in Drill... for sure?
> > >
> > >    2. Who can tell if Dictionary encoding ...
> > >    A) works as expected?
> > >    B) is used in scanning to prevent pointless searches?
> > >
> > > New questions that have emerged:
> > >
> > >    - How efficient is the Drill use of Parquet compared to Presto?
> > >    - Can other solutions use Parquet files created by Drill when
> > >    partition pruning is used? (footer information)
> > >    - What is the status of the Parquet fork and getting Drill "on
> track"
> > >    in that regard?
> > >
> > > Regards,
> > >  -Stefan
> > >
> > > On Mon, Nov 2, 2015 at 4:44 PM, Aman Sinha <amansi...@apache.org>
> wrote:
> > >
> > >> > >    - If I have multiple files containing a days worth of logging,
> in
> > >> > >    chronological order, will all the irrelevant files be ignored
> > when
> > >> > looking
> > >> > >    for a data or a date range?
> > >> > >    - AKA - Will the min-max headers in Parquet be used to prevent
> > >> > >    scanning of data outside the range?
> > >>
> > >> If these files were created using CTAS auto-partitioning, Drill will
> > apply
> > >> partition pruning to eliminate files that are not needed.  If they
> were
> > >> not
> > >> created this way, currently the files are not eliminated...this is
> > >> something that should be addressed by DRILL-1950; this has been on the
> > >> list
> > >> of important JIRAs and hopefully will get addressed in one of the
> > upcoming
> > >> releases.
> > >>
> > >> > >    - Is there anything I need to do to make sure that the write
> > >> > >    optimizations in Parquet are used?
> > >> > >    - dictionaries for low cardinality fields
> > >>
> > >> Note that the default value of
> store.parquet.enable_dictionary_encoding
> > >> is
> > >> current False.   There were some issues with dictionary encoding in
> the
> > >> past with certain data types such as date;  I thought they were
> > fixed...we
> > >> should discuss on the Drill dev list about whether this can be enabled
> > >> (there is substantial testing effort to make sure the encoding works
> for
> > >> all data types).
> > >>
> > >> Aman
> > >>
> > >>
> > >>
> > >>
> > >> On Sun, Nov 1, 2015 at 8:25 AM, John Omernik <j...@omernik.com>
> wrote:
> > >>
> > >> > I've read through your post and had similar thoughts on trying to
> get
> > >> > together information around Parquet files .  I feel like it would be
> > >> really
> > >> > helpful for to have a section of the Drill User Docs dedicated to
> user
> > >> > stories on Parquet files.  I know stories sound odd to put into
> > >> > documentation, but I think that the challenge of explaining
> > >> optimization of
> > >> > something like Parquet is you can either do it from a dry academic
> > >> point of
> > >> > view, which can be hard for the user base to really understand, or
> you
> > >> can
> > >> > try to provide lots of stories that could be annotated by devs or
> > >> improved
> > >> > with links to other stories.
> > >> >
> > >> > What I mean by stories is example data and how it it queried, and
> why
> > it
> > >> > was stored (with partitions based on directories, options for
> > "loading"
> > >> > data into directories, using partitions within the files, how
> Parquet
> > >> > optimizes so folks know where to put extra effort into typing etc.)
> > >> >
> > >> > As to your specific questions, I can't myself answer them,I've
> > wondered
> > >> > about some myself, but haven't gotten to asking them. My experiences
> > >> with
> > >> > Parquet have bene generally positive, but have involved a good
> amount
> > of
> > >> > trial and error (as you can see from some of my user posts) (also,
> the
> > >> user
> > >> > group has been great, but to my point about user stories, my
> education
> > >> has
> > >> > come from posting stories and getting feedback from the community,
> it
> > >> would
> > >> > neat to see this as a first class part of documentation, as I think
> it
> > >> > could help folks with Parquet, Drill and optimizing their
> > environment.)
> > >> >
> > >> > Wish I could be of more help beyond +1 :)
> > >> >
> > >> >
> > >> >
> > >> > On Sun, Nov 1, 2015 at 1:48 AM, Stefán Baxter <
> > >> ste...@activitystream.com>
> > >> > wrote:
> > >> >
> > >> > > So we are off to a flying start :)
> > >> > >
> > >> > > On Thu, Oct 29, 2015 at 9:50 PM, Stefán Baxter <
> > >> > ste...@activitystream.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > >
> > >> > > > We are using Avro, JSON and Parquet for collection various types
> > of
> > >> > data
> > >> > > > for analytical processing.
> > >> > > >
> > >> > > > I have not used Parquet before we starting to play around with
> > Drill
> > >> > and
> > >> > > > now I'm wondering if we are planing our data structures
> correctly
> > >> and
> > >> > if
> > >> > > we
> > >> > > > will be able to get the most out of Drill+Parquet.
> > >> > > >
> > >> > > > I have some questions and I hope the answers can be turned into
> a
> > >> Best
> > >> > > > Practices document.
> > >> > > >
> > >> > > > So here we go:
> > >> > > >
> > >> > > >    - Are there any rules that we must abide by to make scanning
> of
> > >> > > >    "low-cardinality" columns as effective as they can be?
> > >> > > >    - I understand it so that the Parquet dictionary is scanned
> for
> > >> the
> > >> > > >    value(s) and if they are not in the dictionary that the
> section
> > >> is
> > >> > > ignored
> > >> > > >
> > >> > > >    - Can dictionary based scanning (as described above) work on
> > >> arrays?
> > >> > > >    - like: {"some":"simple","tags":["blue","green","yellow"]}
> > >> > > >
> > >> > > >    - If I have multiple files containing a days worth of
> logging,
> > in
> > >> > > >    chronological order, will all the irrelevant files be ignored
> > >> when
> > >> > > looking
> > >> > > >    for a data or a date range?
> > >> > > >    - AKA - Will the min-max headers in Parquet be used to
> prevent
> > >> > > >    scanning of data outside the range?
> > >> > > >
> > >> > > >    - Is there anything I need to do to make sure that the write
> > >> > > >    optimizations in Parquet are used?
> > >> > > >    - dictionaries for low cardinality fields
> > >> > > >    - "number folding" for numerical sequences
> > >> > > >    - compression etc.
> > >> > > >
> > >> > > >    - Are there any Parquet features that are not available in
> > >> Parquet?
> > >> > > >    - I know Drill is using a fork of Parquet and I wonder if any
> > >> major
> > >> > > >    improvements in parquet are unavailable
> > >> > > >
> > >> > > >    - Storing Dates with timezone information (stored in two
> > separate
> > >> > > >    fields?)
> > >> > > >    - What is the common approach?
> > >> > > >
> > >> > > >    - Are there any caveats in converting Avro to Parquet?
> > >> > > >    - other than to convert unix dates from Avor (only long
> > >> > > >    available) into timsetamp fields in Parquet
> > >> > > >
> > >> > > >
> > >> > > > There will, in all likelihood, be future installment to this
> entry
> > >> as
> > >> > new
> > >> > > > questions arise.
> > >> > > >
> > >> > > > All help is appreciated.
> > >> > > >
> > >> > > > Regards,
> > >> > > >  -Stefan
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Drill and Parquet - Best practices - part 1

Reply via email to