Hi Ryan, I just created this JIRA for it: https://issues.apache.org/jira/browse/PARQUET-131
Comments and suggestions are welcome. Thanks, Zhenxiao On Mon, Nov 10, 2014 at 10:59 AM, Ryan Blue <[email protected]> wrote: > Hi everyone, > > Is there a JIRA issue tracking the vectorized reader API? Brock and I have > been working through how we would integrate this with Hive and have a few > questions and comments. Thanks! > > rb > > > On 11/01/2014 01:03 PM, Brock Noland wrote: > >> Hi, >> >> Great! I will take a look soon. >> >> Cheers! >> Brock >> >> On Mon, Oct 27, 2014 at 11:18 PM, Zhenxiao Luo <[email protected]> wrote: >> >> >>> Thanks Jacques. >>> >>> Here is the gist: >>> https://gist.github.com/zhenxiao/2728ce4fe0a7be2d3b30 >>> >>> Comments and Suggestions are appreciated. >>> >>> Thanks, >>> Zhenxiao >>> >>> On Mon, Oct 27, 2014 at 10:55 PM, Jacques Nadeau <[email protected]> >>> wrote: >>> >>> You can't send attachments. Can you post as google doc or gist? >>>> >>>> On Mon, Oct 27, 2014 at 7:41 PM, Zhenxiao Luo <[email protected] >>>> > >>>> wrote: >>>> >>>> >>>>> Thanks Brock and Jason. >>>>> >>>>> I just drafted a proposed APIs for vectorized Parquet reader(attached >>>>> in >>>>> this email). Any comments and suggestions are appreciated. >>>>> >>>>> Thanks, >>>>> Zhenxiao >>>>> >>>>> On Tue, Oct 7, 2014 at 5:34 PM, Brock Noland <[email protected]> >>>>> >>>> wrote: >>>> >>>>> >>>>> Hi, >>>>>> >>>>>> The Hive + Parquet community is very interested in improving >>>>>> >>>>> performance >>>> >>>>> of >>>>>> Hive + Parquet and Parquet generally. We are very interested in >>>>>> contributing to the Parquet vectorization and lazy materialization >>>>>> >>>>> effort. >>>> >>>>> Please add myself to any future meetings on this topic. >>>>>> >>>>>> BTW here it the JIRA tracking this effort from the Hive side: >>>>>> https://issues.apache.org/jira/browse/HIVE-8120 >>>>>> >>>>>> Brock >>>>>> >>>>>> On Tue, Oct 7, 2014 at 2:04 PM, Zhenxiao Luo <[email protected] >>>>>> >>>>> >>>>> wrote: >>>>>> >>>>>> Thanks Jason. >>>>>>> >>>>>>> Yes, Netflix is using Presto and Parquet for our BigDataPlatform( >>>>>>> >>>>>>> >>>>>>> >>>>>> http://techblog.netflix.com/2014/10/using-presto-in-our- >>>> big-data-platform.html >>>> >>>>> ). >>>>>>> >>>>>>> The fastest format currently in Presto is ORC, not DWRF(Parquet is >>>>>>> >>>>>> fast, >>>> >>>>> but not as fast as ORC). We are referring to ORC, not facebook's DWRF >>>>>>> implementation. >>>>>>> >>>>>>> We already get Parquet working in Presto. We definitely would like to >>>>>>> >>>>>> get >>>>>> >>>>>>> it as fast as ORC. >>>>>>> >>>>>>> Facebook has did native support for ORC in Presto, which does not use >>>>>>> >>>>>> the >>>>>> >>>>>>> ORCRecordReader at all. They parses the ORC footer, and does >>>>>>> >>>>>> Predicate >>>> >>>>> Pushdown by skipping row groups, Vectorization by introducing Type >>>>>>> >>>>>> Specific >>>>>> >>>>>>> Vectors, and Lazy Materialization by introducing LazyVectors(their >>>>>>> >>>>>> code >>>> >>>>> has >>>>>> >>>>>>> not been committed yet, I mean their pull request). We are planning >>>>>>> >>>>>> to >>>> >>>>> do >>>>>> >>>>>>> similar optimization for Parquet in Presto. >>>>>>> >>>>>>> For the ParquetRecordReader, we need additional APIs to read the next >>>>>>> >>>>>> Batch >>>>>> >>>>>>> of values, and read in a vector of values. For example, here are the >>>>>>> related APIs in the ORC code: >>>>>>> >>>>>>> /** >>>>>>> * Read the next row batch. The size of the batch to read cannot >>>>>>> be >>>>>>> controlled >>>>>>> * by the callers. Caller need to look at VectorizedRowBatch.size >>>>>>> >>>>>> of >>>> >>>>> the >>>>>> >>>>>>> retunred >>>>>>> * object to know the batch size read. >>>>>>> * @param previousBatch a row batch object that can be reused by >>>>>>> >>>>>> the >>>> >>>>> reader >>>>>>> * @return the row batch that was read >>>>>>> * @throws java.io.IOException >>>>>>> */ >>>>>>> VectorizedRowBatch nextBatch(VectorizedRowBatch previousBatch) >>>>>>> >>>>>> throws >>>> >>>>> IOException; >>>>>>> >>>>>>> And, here are the related APIs in Presto code, which is used for ORC >>>>>>> support in Presto: >>>>>>> >>>>>>> public void readVector(int columnIndex, Object vector); >>>>>>> >>>>>>> For lazy materialization, we may also consider adding LazyVectors or >>>>>>> LazyBlocks, so that the value is not materialized until they are >>>>>>> >>>>>> accessed >>>>>> >>>>>>> by the Operator. >>>>>>> >>>>>>> Any comments and suggestions are appreciated. >>>>>>> >>>>>>> Thanks, >>>>>>> Zhenxiao >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 7, 2014 at 1:05 PM, Jason Altekruse < >>>>>>> >>>>>> [email protected]> >>>>>> >>>>>>> wrote: >>>>>>> >>>>>>> Hello All, >>>>>>>> >>>>>>>> No updates from me yet, just sending out another message for some >>>>>>>> >>>>>>> of >>>> >>>>> the >>>>>> >>>>>>> Netflix engineers that were still just subscribed to the google >>>>>>>> >>>>>>> group >>>> >>>>> mail. >>>>>>> >>>>>>>> This will allow them to respond directly with their research on the >>>>>>>> optimized ORC reader for consideration in the design discussion. >>>>>>>> >>>>>>>> -Jason >>>>>>>> >>>>>>>> On Mon, Oct 6, 2014 at 10:51 PM, Jason Altekruse < >>>>>>>> >>>>>>> [email protected] >>>>>>> >>>>>>>> >>>>>>>>> wrote: >>>>>>>> >>>>>>>> Hello Parquet team, >>>>>>>>> >>>>>>>>> I wanted to report the results of a discussion between the Drill >>>>>>>>> >>>>>>>> team >>>>>> >>>>>>> and >>>>>>> >>>>>>>> the engineers at Netflix working to make Parquet run faster with >>>>>>>>> >>>>>>>> Presto. >>>>>>> >>>>>>>> As we have said in the last few hangouts we both want to make >>>>>>>>> >>>>>>>> contributions >>>>>>>> >>>>>>>>> back to parquet-mr to add features and performance. We thought it >>>>>>>>> >>>>>>>> would >>>>>> >>>>>>> be >>>>>>>> >>>>>>>>> good to sit down and speak directly about our real goals and the >>>>>>>>> >>>>>>>> best >>>>>> >>>>>>> next >>>>>>>> >>>>>>>>> steps to get an engineering effort started to accomplish these >>>>>>>>> >>>>>>>> goals. >>>>>> >>>>>>> >>>>>>>>> Below is a summary of the meeting. >>>>>>>>> >>>>>>>>> - Meeting notes >>>>>>>>> >>>>>>>>> - Attendees: >>>>>>>>> >>>>>>>>> - Netflix : Eva Tse, Daniel Weeks, Zhenxiao Luo >>>>>>>>> >>>>>>>>> - MapR (Drill Team) : Jacques Nadeau, Jason Altekruse, >>>>>>>>> >>>>>>>> Parth >>>> >>>>> Chandra >>>>>>>> >>>>>>>>> >>>>>>>>> - Minutes >>>>>>>>> >>>>>>>>> - Introductions/ Background >>>>>>>>> >>>>>>>>> - Netflix >>>>>>>>> >>>>>>>>> - Working on providing interactive SQL querying to users >>>>>>>>> >>>>>>>>> - have chosen Presto as the query engine and Parquet as >>>>>>>>> >>>>>>>> high >>>> >>>>> performance data >>>>>>>>> >>>>>>>>> storage format >>>>>>>>> >>>>>>>>> - Presto is providing needed speed in some cases, but >>>>>>>>> >>>>>>>> others >>>> >>>>> are >>>>>> >>>>>>> missing optimizations >>>>>>>>> >>>>>>>>> that could be avoiding reads >>>>>>>>> >>>>>>>>> - Have already started some development and investigation, >>>>>>>>> >>>>>>>> have >>>>>> >>>>>>> identified key goals >>>>>>>>> >>>>>>>>> - Some initial benchmarks with a modified ORC reader DWRF, >>>>>>>>> >>>>>>>> written >>>>>>> >>>>>>>> by the Presto >>>>>>>>> >>>>>>>>> team shows that such gains are possible with a different >>>>>>>>> >>>>>>>> reader >>>>>>> >>>>>>>> implementation >>>>>>>>> >>>>>>>>> - goals >>>>>>>>> >>>>>>>>> - filter pushdown >>>>>>>>> >>>>>>>>> - skipping reads based on filter evaluation on >>>>>>>>> >>>>>>>> one or >>>> >>>>> more >>>>>>> >>>>>>>> columns >>>>>>>>> >>>>>>>>> - this can happen at several granularities : row >>>>>>>>> >>>>>>>> group, >>>>>> >>>>>>> page, record/value >>>>>>>>> >>>>>>>>> - late/lazy materialization >>>>>>>>> >>>>>>>>> - for columns not involved in a filter, avoid >>>>>>>>> >>>>>>>> materializing >>>>>>>> >>>>>>>>> them entirely >>>>>>>>> >>>>>>>>> until they are know to be needed after >>>>>>>>> >>>>>>>> evaluating a >>>> >>>>> filter on other columns >>>>>>>>> >>>>>>>>> - Drill >>>>>>>>> >>>>>>>>> - the Drill engine uses an in-memory vectorized >>>>>>>>> >>>>>>>> representation >>>>>> >>>>>>> of >>>>>>> >>>>>>>> records >>>>>>>>> >>>>>>>>> - for scalar and repeated types we have implemented a fast >>>>>>>>> vectorized reader >>>>>>>>> >>>>>>>>> that is optimized to transform between Parquet's on disk >>>>>>>>> >>>>>>>> and >>>>>> >>>>>>> our >>>>>>> >>>>>>>> in-memory format >>>>>>>>> >>>>>>>>> - this is currently producing performant table scans, but >>>>>>>>> >>>>>>>> has no >>>>>> >>>>>>> facility for filter >>>>>>>>> >>>>>>>>> push down >>>>>>>>> >>>>>>>>> - Major goals going forward >>>>>>>>> >>>>>>>>> - filter pushdown >>>>>>>>> >>>>>>>>> - decide the best implementation for incorporating >>>>>>>>> >>>>>>>> filter >>>>>>> >>>>>>>> pushdown into >>>>>>>>> >>>>>>>>> our current implementation, or figure out a way >>>>>>>>> >>>>>>>> to >>>> >>>>> leverage existing >>>>>>>>> >>>>>>>>> work in the parquet-mr library to accomplish >>>>>>>>> >>>>>>>> this >>>> >>>>> goal >>>>>> >>>>>>> >>>>>>>>> - late/lazy materialization >>>>>>>>> >>>>>>>>> - see above >>>>>>>>> >>>>>>>>> - contribute existing code back to parquet >>>>>>>>> >>>>>>>>> - the Drill parquet reader has a very strong >>>>>>>>> >>>>>>>> emphasis on >>>>>> >>>>>>> performance, a >>>>>>>>> >>>>>>>>> clear interface to consume, that is sufficiently >>>>>>>>> separated from Drill >>>>>>>>> >>>>>>>>> could prove very useful for other projects >>>>>>>>> >>>>>>>>> - First steps >>>>>>>>> >>>>>>>>> - Netflix team will share some of their thoughts and >>>>>>>>> >>>>>>>> research >>>> >>>>> from >>>>>>> >>>>>>>> working with >>>>>>>>> >>>>>>>>> the DWRF code >>>>>>>>> >>>>>>>>> - we can have a discussion based off of this, which >>>>>>>>> >>>>>>>> aspects >>>>>> >>>>>>> are >>>>>>>> >>>>>>>>> done well, >>>>>>>>> >>>>>>>>> and any opportunities they may have missed that we >>>>>>>>> >>>>>>>> can >>>> >>>>> incorporate into our >>>>>>>>> >>>>>>>>> design >>>>>>>>> >>>>>>>>> - do further investigation and ask the existing >>>>>>>>> >>>>>>>> community >>>> >>>>> for >>>>>>> >>>>>>>> guidance on existing >>>>>>>>> >>>>>>>>> parquet-mr features or planned APIs that may provide >>>>>>>>> >>>>>>>> desired >>>>>>> >>>>>>>> functionality >>>>>>>>> >>>>>>>>> - We will begin a discussion of an API for the new >>>>>>>>> >>>>>>>> functionality >>>>>> >>>>>>> >>>>>>>>> - some outstanding thoughts for down the road >>>>>>>>> >>>>>>>>> - The Drill team has an interest in very late >>>>>>>>> materialization for data stored >>>>>>>>> >>>>>>>>> in dictionary encoded pages, such as running a >>>>>>>>> >>>>>>>> join or >>>>>> >>>>>>> filter on the dictionary >>>>>>>>> >>>>>>>>> and then going back to the reader to grab all of >>>>>>>>> >>>>>>>> the >>>>>> >>>>>>> values in the data that match >>>>>>>>> >>>>>>>>> the needed members of the dictionary >>>>>>>>> >>>>>>>>> - this is a later consideration, but just >>>>>>>>> >>>>>>>> some of >>>> >>>>> the >>>>>>> >>>>>>>> idea of the reason we are >>>>>>>>> >>>>>>>>> opening up the design discussion early so >>>>>>>>> >>>>>>>> that >>>> >>>>> the >>>>>> >>>>>>> API can be flexible enough >>>>>>>>> to allow this in the further, even if not >>>>>>>>> >>>>>>>> implemented >>>>>>>> >>>>>>>>> too soon >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
