Ryan Brush created CRUNCH-336:
---------------------------------
Summary: Optimized filters and joins via Parquet RecordFilters
Key: CRUNCH-336
URL: https://issues.apache.org/jira/browse/CRUNCH-336
Project: Crunch
Issue Type: Improvement
Reporter: Ryan Brush
Logging this to track some ideas from an offline discussion with [~jwills] and
[~mkwhitacre]. There's an opportunity to significantly speed up a couple access
patterns:
1. Process only a subset of data from a Parquet file identified by a single
column
2. Perform a bloom filter join between two datasets, where the joined item is a
Parquet column in the larger data set.
Optimizing item 1 simply involves using a RecordFilter to narrow down the data
loaded from the AvroParquetInputFormat.
Optimizing item 2 is more involved. In a nutshell, we discussed doing a bloom
filter join, but using the bloom filter to implement the Parquet RecordFilter
on the specific column. In cases where where we join on columns and only select
a small subset of the larger dataset, this would skip IO and deserialization
cost for all items that didn't match the join.
It's not obvious to me how we'd achieve this cleanly, since it involves
multiple pieces (configuring of inputs in conjunction with a specific join
strategy). In many cases the bloom filter join alone will achieve sufficient
performance, but I'm logging this potential optimization for reference.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)