This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git
The following commit(s) were added to refs/heads/main by this push:
new 482f32caa9 Enable parquet page level skipping (page index pruning) by
default (#5099)
482f32caa9 is described below
commit 482f32caa9a05b172e723df41a9a1c50e8447b00
Author: Andrew Lamb <[email protected]>
AuthorDate: Fri May 12 10:40:25 2023 -0400
Enable parquet page level skipping (page index pruning) by default (#5099)
* Enable parquet page level skipping (page index pruning) by default
* update
---
datafusion/common/src/config.rs | 2 +-
datafusion/core/tests/sqllogictests/test_files/information_schema.slt | 2 +-
docs/source/user-guide/configs.md | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs
index 9193e99e3d..0d86131db1 100644
--- a/datafusion/common/src/config.rs
+++ b/datafusion/common/src/config.rs
@@ -243,7 +243,7 @@ config_namespace! {
pub struct ParquetOptions {
/// If true, uses parquet data page level metadata (Page Index)
statistics
/// to reduce the number of rows decoded.
- pub enable_page_index: bool, default = false
+ pub enable_page_index: bool, default = true
/// If true, the parquet reader attempts to skip entire row groups
based
/// on the predicate in the query and the metadata (min/max values)
stored in
diff --git
a/datafusion/core/tests/sqllogictests/test_files/information_schema.slt
b/datafusion/core/tests/sqllogictests/test_files/information_schema.slt
index 2f51a56a1b..38f1d2cd05 100644
--- a/datafusion/core/tests/sqllogictests/test_files/information_schema.slt
+++ b/datafusion/core/tests/sqllogictests/test_files/information_schema.slt
@@ -146,7 +146,7 @@ datafusion.execution.aggregate.scalar_update_factor 10
datafusion.execution.batch_size 8192
datafusion.execution.coalesce_batches true
datafusion.execution.collect_statistics false
-datafusion.execution.parquet.enable_page_index false
+datafusion.execution.parquet.enable_page_index true
datafusion.execution.parquet.metadata_size_hint NULL
datafusion.execution.parquet.pruning true
datafusion.execution.parquet.pushdown_filters false
diff --git a/docs/source/user-guide/configs.md
b/docs/source/user-guide/configs.md
index 77196299ed..d64f327e06 100644
--- a/docs/source/user-guide/configs.md
+++ b/docs/source/user-guide/configs.md
@@ -49,7 +49,7 @@ Environment variables are read during `SessionConfig`
initialisation so they mus
| datafusion.execution.collect_statistics | false |
Should DataFusion collect statistics after listing files
[...]
| datafusion.execution.target_partitions | 0 |
Number of partitions for query execution. Increasing partitions can increase
concurrency. Defaults to the number of CPU cores on the system
[...]
| datafusion.execution.time_zone | +00:00 |
The default time zone Some functions, e.g. `EXTRACT(HOUR from SOME_TIME)`,
shift the underlying datetime according to this time zone, and then extract the
hour
[...]
-| datafusion.execution.parquet.enable_page_index | false | If
true, uses parquet data page level metadata (Page Index) statistics to reduce
the number of rows decoded.
[...]
+| datafusion.execution.parquet.enable_page_index | true | If
true, uses parquet data page level metadata (Page Index) statistics to reduce
the number of rows decoded.
[...]
| datafusion.execution.parquet.pruning | true | If
true, the parquet reader attempts to skip entire row groups based on the
predicate in the query and the metadata (min/max values) stored in the parquet
file
[...]
| datafusion.execution.parquet.skip_metadata | true | If
true, the parquet reader skip the optional embedded metadata that may be in the
file Schema. This setting can help avoid schema conflicts when querying
multiple parquet files with schemas containing compatible types but different
metadata
[...]
| datafusion.execution.parquet.metadata_size_hint | NULL | If
specified, the parquet reader will try and fetch the last `size_hint` bytes of
the parquet file optimistically. If not specified, two reads are required: One
read to fetch the 8-byte parquet footer and another to fetch the metadata
length encoded in the footer
[...]