Ryan Brush created CRUNCH-336:
---------------------------------

             Summary: Optimized filters and joins via Parquet RecordFilters
                 Key: CRUNCH-336
                 URL: https://issues.apache.org/jira/browse/CRUNCH-336
             Project: Crunch
          Issue Type: Improvement
            Reporter: Ryan Brush


Logging this to track some ideas from an offline discussion with [~jwills] and 
[~mkwhitacre]. There's an opportunity to significantly speed up a couple access 
patterns:

1. Process only a subset of data from a Parquet file identified by a single 
column
2. Perform a bloom filter join between two datasets, where the joined item is a 
Parquet column in the larger data set.

Optimizing item 1 simply involves using a RecordFilter to narrow down the data 
loaded from the AvroParquetInputFormat.

Optimizing item 2 is more involved. In a nutshell, we discussed doing a bloom 
filter join, but using the bloom filter to implement the Parquet RecordFilter 
on the specific column. In cases where where we join on columns and only select 
a small subset of the larger dataset, this would skip IO and deserialization 
cost for all items that didn't match the join.

It's not obvious to me how we'd achieve this cleanly, since it involves 
multiple pieces (configuring of inputs in conjunction with a specific join 
strategy). In many cases the bloom filter join alone will achieve sufficient 
performance, but I'm logging this potential optimization for reference.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to