yordan-pavlov opened a new pull request #9064:
URL: https://github.com/apache/arrow/pull/9064


   While profiling a DataFusion query I found that the code spends a lot of 
time in reading data from parquet files. Predicate / filter push-down is a 
commonly used performance optimization, where statistics data stored in parquet 
files (such as min / max values for columns in a parquet row group) is 
evaluated against query filters to determine which row groups could contain 
data requested by a query. In this way, by pushing down query filters all the 
way to the parquet data source, entire row groups or even parquet files can be 
skipped often resulting in significant performance improvements.
   
   I have been working on an implementation for a few weeks and initial results 
look promising - with predicate push-down, DataFusion is now faster than Apache 
Spark (`140ms for DataFusion vs 200ms for Spark`) for the same query against 
the same parquet files.
   
   My work is based on the following key ideas:
   * predicate-push down is implemented by filtering row group metadata entries 
to only those which could contain data which could satisfy query filters
   * it's best to reuse the existing code for evaluating physical expressions 
already implemented in DataFusion
   * filter expressions pushed down to a parquet table are rewritten to use 
parquet statistics (instead of the actual column data), for example `(column / 
2) = 4`  becomes  `(column_min / 2) <= 4 && 4 <= (column_max / 2)` - this is 
done once for all files in a parquet table
   * for each parquet file, a RecordBatch containing all required statistics 
columns ( [`column_min`, `column_max`] in the example above) is produced, and 
the predicate expression from the previous step is evaluated, producing a 
binary array which is finally used to filter the row groups in each parquet file
   
   This is still work in progress - more tests left to write; I am publishing 
this now to gather feedback.
   
   @andygrove let me know what you think


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to