Re: Parquet and filtering

Jacques Nadeau Thu, 08 Jan 2015 07:47:22 -0800

That is correct.

On Wed, Jan 7, 2015 at 7:57 PM, Adam Gilmore <a...@pharmadata.net.au> wrote:


> That makes a lot of sense.  Just one question with regarding to handling
> complex types - do you mean maps/arrays/etc. (repetitions in Parquet)?  As
> in, if I created a Parquet table from some JSON files with a rather
> complex/nested structure, it would fall back to individual copies?
>
>
> Regards,
>
>
> *Adam Gilmore*
>
> Director of Technology
>
> a...@pharmadata.net.au
>
>
> +61 421 997 655 (Mobile)
>
> 1300 733 876 (AU)
>
> +617 3171 9902 (Intl)
>
>
> *PharmaData*
>
> Data Intelligence Solutions for Pharmacy
>
> www.PharmaData.net.au <http://www.pharmadata.net.au/>
>
>
>
> [image: pharmadata-sig]
>
>
>
> *Disclaimer*
>
> This communication including any attachments may contain information that
> is either confidential or otherwise protected from disclosure and is
> intended solely for the use of the intended recipient. If you are not the
> intended recipient please immediately notify the sender by e-mail and
> delete the original transmission and its contents. Any unauthorised use,
> dissemination, forwarding, printing, or copying of this communication
> including any file attachments is prohibited. The recipient should check
> this email and any attachments for viruses and other defects. The Company
> disclaims any liability for loss or damage arising in any way from this
> communication including any file attachments.
>
> On Thu, Jan 8, 2015 at 12:05 PM, Jason Altekruse <altekruseja...@gmail.com
> > wrote:
>
>> The parquet library provides an interface for accessing individual values
>> of each column (as well as a record assembly interface for populating java
>> objects). As parquet is columnar, and the Drill in-memory storage format
>> is
>> also columnar, we can get much better read performance on queries where
>> most of the data is needed if we do copies of long runs of values rather
>> than a large number of individual copies.
>>
>> This obviously does not give us great performance for point queries, where
>> a small subset of the data is needed. While these use cases are prevalent
>> and we are hoping to fix this issue soon, as we wrote the original
>> implementation we were interested in stretching the bounds of how fast we
>> could load a volume of data into the engine.
>>
>> The second reader that was written to handle complex types does use the
>> current 'columnar' interface exposed by the parquet library, but it still
>> requires us to make individual copies for each value. Even as we
>> experimented with early versions of the project pushdown provided by the
>> parquet codebase, we were unable to match the performance of reading and
>> filtering the data ourselves. This was not fully explored, and a number of
>> enhancements have been made to the parquet mainline that may give us the
>> performance we are looking for in these cases. We haven't had time to
>> revisit it so far.
>>
>> -Jason Altekruse
>>
>> On Wed, Jan 7, 2015 at 4:04 PM, Adam Gilmore <dragoncu...@gmail.com>
>> wrote:
>>
>> > Out of interest, is there a reason Drill implemented effectively its own
>> > Parquet reading implementation as opposed to using the reading classes
>> from
>> > the Parquet project itself?  Were there particular performance reasons
>> for
>> > this?
>> >
>> > On Thu, Jan 8, 2015 at 2:22 AM, Jason Altekruse <
>> altekruseja...@gmail.com>
>> > wrote:
>> >
>> > > Just made one, I put some comments there from the design discussions
>> we
>> > > have had in the past.
>> > >
>> > > https://issues.apache.org/jira/browse/DRILL-1950
>> > >
>> > > - Jason Altekruse
>> > >
>> > > On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore <dragoncu...@gmail.com>
>> > > wrote:
>> > >
>> > > > Just a quick follow up on this - is there a JIRA item for
>> implementing
>> > > push
>> > > > down predicates for Parquet scans or do we need to create one?
>> > > >
>> > > > On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse <
>> > > altekruseja...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi Adam,
>> > > > >
>> > > > > I have a few thoughts that might explain the difference in query
>> > times.
>> > > > > Drill is able to read a subset of the data from a parquet file,
>> when
>> > > > > selecting only a few columns out of a large file. Drill will give
>> you
>> > > > > faster results if you ask for 3 columns instead of 10 in terms of
>> > read
>> > > > > performance. However, we are still working on further optimizing
>> the
>> > > > reader
>> > > > > by making use of the statistics contained in the block and page
>> > > > meta-data,
>> > > > > that will allow us to skip reading a subset of a column, as the
>> > parquet
>> > > > > writer can store min/max values for blocks of data.
>> > > > >
>> > > > > If you ran a query that was summing over a column, the reason it
>> was
>> > > > faster
>> > > > > is because it avoided a bunch of individual value copies as we
>> > filtered
>> > > > out
>> > > > > the records that were not needed. This currently takes place in a
>> > > > separate
>> > > > > filter operator and should be pushed down into the read operation
>> to
>> > > make
>> > > > > use of the file meta-data and eliminate some of the reads.
>> > > > >
>> > > > > -Jason
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <
>> dragoncu...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Hi guys,
>> > > > > >
>> > > > > > I have a question re Parquet.  I'm not sure if this is a Drill
>> > > question
>> > > > > or
>> > > > > > Parquet, but thought I'd start here.
>> > > > > >
>> > > > > > I have a sample dataset of ~100M rows in a Parquet file.  It's
>> > quick
>> > > to
>> > > > > sum
>> > > > > > a single column across the whole dataset.
>> > > > > >
>> > > > > > I have a column which has approx 100 unique values (e.g. a
>> customer
>> > > > ID).
>> > > > > > When I filter on that column by one of those values (to reduce
>> the
>> > > set
>> > > > to
>> > > > > > ~1M values), the query takes longer.
>> > > > > >
>> > > > > > This doesn't make a lot of sense to me - I would have expected
>> the
>> > > > > Parquet
>> > > > > > format to only bring back segments that match that and only sum
>> > those
>> > > > > > values.  I would expect that this would make the query
>> magnitudes
>> > > > faster,
>> > > > > > not slower.
>> > > > > >
>> > > > > > Other columnar formats I've used (e.g. ORCFile, SQL Server
>> > > Columnstore)
>> > > > > > have acted this way, so I can't quite understand why Parquet
>> > doesn't
>> > > > act
>> > > > > > the same.
>> > > > > >
>> > > > > > Can anyone suggest what I'm doing wrong?
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Parquet and filtering

Reply via email to