Re: High performance vectorized reader meeting notes

Jason Altekruse Tue, 07 Oct 2014 13:13:27 -0700

Hello All,

No updates from me yet, just sending out another message for some of the
Netflix engineers that were still just subscribed to the google group mail.
This will allow them to respond directly with their research on the
optimized ORC reader for consideration in the design discussion.


-Jason

On Mon, Oct 6, 2014 at 10:51 PM, Jason Altekruse <[email protected]>
wrote:

> Hello Parquet team,
>
> I wanted to report the results of a discussion between the Drill team and
> the engineers  at Netflix working to make Parquet run faster with Presto.
> As we have said in the last few hangouts we both want to make contributions
> back to parquet-mr to add features and performance. We thought it would be
> good to sit down and speak directly about our real goals and the best next
> steps to get an engineering effort started to accomplish these goals.
>
> Below is a summary of the meeting.
>
> - Meeting notes
>
>    - Attendees:
>
>        - Netflix : Eva Tse, Daniel Weeks, Zhenxiao Luo
>
>        - MapR (Drill Team) : Jacques Nadeau, Jason Altekruse, Parth Chandra
>
> - Minutes
>
>    - Introductions/ Background
>
>    - Netflix
>
>        - Working on providing interactive SQL querying to users
>
>        - have chosen Presto as the query engine and Parquet as high
> performance data
>
>          storage format
>
>        - Presto is providing needed speed in some cases, but others are
> missing optimizations
>
>          that could be avoiding reads
>
>        - Have already started some development and investigation, have
> identified key goals
>
>        - Some initial benchmarks with a modified ORC reader DWRF, written
> by the Presto
>
>          team shows that such gains are possible with a different reader
> implementation
>
>        - goals
>
>            - filter pushdown
>
>                - skipping reads based on filter evaluation on one or more
> columns
>
>                - this can happen at several granularities : row group,
> page, record/value
>
>            - late/lazy materialization
>
>                - for columns not involved in a filter, avoid materializing
> them entirely
>
>                  until they are know to be needed after evaluating a
> filter on other columns
>
>    - Drill
>
>        - the Drill engine uses an in-memory vectorized representation of
> records
>
>        - for scalar and repeated types we have implemented a fast
> vectorized reader
>
>          that is optimized to transform between Parquet's on disk and our
> in-memory format
>
>        - this is currently producing performant table scans, but has no
> facility for filter
>
>          push down
>
>        - Major goals going forward
>
>            - filter pushdown
>
>                - decide the best implementation for incorporating filter
> pushdown into
>
>                  our current implementation, or figure out a way to
> leverage existing
>
>                  work in the parquet-mr library to accomplish this goal
>
>            - late/lazy materialization
>
>                - see above
>
>            - contribute existing code back to parquet
>
>                - the Drill parquet reader has a very strong emphasis on
> performance, a
>
>                  clear interface to consume, that is sufficiently
> separated from Drill
>
>                  could prove very useful for other projects
>
>    - First steps
>
>        - Netflix team will share some of their thoughts and research from
> working with
>
>          the DWRF code
>
>            - we can have a discussion based off of this, which aspects are
> done well,
>
>              and any opportunities they may have missed that we can
> incorporate into our
>
>              design
>
>            - do further investigation and ask the existing community for
> guidance on existing
>
>              parquet-mr features or planned APIs that may provide desired
> functionality
>
>        - We will begin a discussion of an API for the new functionality
>
>            - some outstanding thoughts for down the road
>
>                - The Drill team has an interest in very late
> materialization for data stored
>
>                  in dictionary encoded pages, such as running a join or
> filter on the dictionary
>
>                  and then going back to the reader to grab all of the
> values in the data that match
>
>                  the needed members of the dictionary
>
>                    - this is a later consideration, but just some of the
> idea of the reason we are
>
>                      opening up the design discussion early so that the
> API can be flexible enough
>                      to allow this in the further, even if not implemented
> too soon
>

Re: High performance vectorized reader meeting notes

Reply via email to