alamb commented on issue #8846: URL: https://github.com/apache/arrow-rs/issues/8846#issuecomment-4769814523
Context from @hhhizzz in https://github.com/apache/arrow-rs/pull/9956#issuecomment-4769646397 I think the best model is **multi-level adaptivity**. DataFusion has more **high-level context**, such as query semantics, file / row-group statistics, projected columns across the whole scan, and cross-file predicate selectivity. That makes it a good place to decide whether a `RowFilter` should be used at all, or to override the default behavior when it has stronger information. However, the part this PR is trying to handle is **lower-level**. The actual shape of the final `RowSelection` is only known after the Parquet reader has decoded the predicate columns for a specific row group and evaluated the `RowFilter`. For example, whether the result is clustered, fragmented, mostly selected, or sparse is not available to DataFusion at logical / physical planning time without duplicating Parquet reader internals. So I see the responsibilities as: - **DataFusion** decides the higher-level scan / filtering strategy. - **arrow-rs** decides how to execute a row filter once it sees the actual row-group-level selection shape and loaded page ranges. The explicit policy APIs are important because they still allow DataFusion to override the automatic choice. But I think the default `Auto` behavior belongs in arrow-rs, because the information it uses is produced inside the Parquet reader during execution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
