[GitHub] [arrow-datafusion] alamb opened a new pull request, #3822: Expose parquet reader settings as DataFusion config settings

GitBox Thu, 13 Oct 2022 09:03:45 -0700


alamb opened a new pull request, #3822:
URL: https://github.com/apache/arrow-datafusion/pull/3822


   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/3821
   
    # Rationale for this change
   I want to test out the parquet filter pushdown on real datasets using 
datafusion-cli so we can enable it by default -- 
https://github.com/apache/arrow-datafusion/issues/3463
   
   I want to be able to do so via `datafusion-cli` like:
   
   ```shell
   $ target/debug/datafusion-cli
   DataFusion CLI v13.0.0
   ❯ show all;
   +-------------------------------------------------+---------+
   | name                                            | setting |
   +-------------------------------------------------+---------+
   | datafusion.execution.time_zone                  | UTC     |
   | datafusion.execution.parquet.pushdown_filters   | false   | <---- Note the 
option is now visible here
   | datafusion.explain.physical_plan_only           | false   |
   | datafusion.execution.coalesce_target_batch_size | 4096    |
   | datafusion.execution.batch_size                 | 8192    |
   | datafusion.execution.coalesce_batches           | true    |
   | datafusion.explain.logical_plan_only            | false   |
   | datafusion.optimizer.skip_failed_rules          | true    |
   | datafusion.optimizer.filter_null_join_keys      | false   |
   +-------------------------------------------------+---------+
   ```
   
   And then set them like:
   
   ```shell
   $ DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true 
target/debug/datafusion-cli
   DataFusion CLI v13.0.0
   ❯ show all;
   +-------------------------------------------------+---------+
   | name                                            | setting |
   +-------------------------------------------------+---------+
   | datafusion.execution.batch_size                 | 8192    |
   | datafusion.execution.coalesce_batches           | true    |
   | datafusion.explain.logical_plan_only            | false   |
   | datafusion.optimizer.filter_null_join_keys      | false   |
   | datafusion.execution.parquet.enable_page_index  | false   |
   | datafusion.optimizer.skip_failed_rules          | true    |
   | datafusion.explain.physical_plan_only           | false   |
   | datafusion.execution.time_zone                  | UTC     |
   | datafusion.execution.coalesce_target_batch_size | 4096    |
   | datafusion.execution.parquet.pushdown_filters   | true    | <---- Note the 
option is set to true here!!!!
   | datafusion.execution.parquet.reorder_filters    | false   |
   +-------------------------------------------------+---------+
   ```
   
   
   # What changes are included in this PR?
   1. Add three new config settings to `ConfigOptions`
   3. Thread `ConfigOptions` down into the FileScanConfig
   2. Remove `ParquetScanOptions` in favor of these new configs (will comment 
on the rationale here)
   
   # Are there any user-facing changes?
   YES: If you used `ParquetScanOptions` (which I know @thinkharderdev  does) 
the API has changed. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new pull request, #3822: Expose parquet reader settings as DataFusion config settings

Reply via email to