alamb commented on code in PR #21190:
URL: https://github.com/apache/datafusion/pull/21190#discussion_r3015363551
##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -125,15 +133,338 @@ pub(super) struct ParquetOpener {
pub reverse_row_groups: bool,
}
+/// States for [`ParquetOpenFuture`]
+///
+/// These states correspond to the steps required to read and apply various
+/// filter operations.
+///
+/// States whose names beginning with `Load` represent waiting on IO to resolve
+///
+/// ```text
+/// Start
+/// |
+/// v
+/// [LoadEncryption]?
+/// |
+/// v
+/// PruneFile
+/// |
+/// v
+/// LoadMetadata
+/// |
+/// v
+/// PrepareFilters
+/// |
+/// v
+/// LoadPageIndex
+/// |
+/// v
+/// PruneWithStatistics
+/// |
+/// v
+/// PruneWithBloomFilters
+/// |
+/// v
+/// BuildStream
+/// |
+/// v
+/// Done
+/// ```
+///
+/// Note: `LoadEncryption` is only present when the `parquet_encryption`
feature is
+/// enabled. All other states are always visited in the order shown above,
+/// though any async state may return `Poll::Pending` and then resume later.
+enum ParquetOpenState {
+ Start {
+ prepared: Box<PreparedParquetOpen>,
+ #[cfg(feature = "parquet_encryption")]
+ encryption_context: Arc<EncryptionContext>,
+ },
+ /// Loading encryption footers
+ #[cfg(feature = "parquet_encryption")]
+ LoadEncryption(BoxFuture<'static, Result<Box<PreparedParquetOpen>>>),
+ /// Try to prune file using only file-level statistics and partition
+ /// values before loading any parquet metadata
+ PruneFile(Box<PreparedParquetOpen>),
+ /// Loading Parquet metadata (in footer)
+ LoadMetadata(BoxFuture<'static, Result<MetadataLoadedParquetOpen>>),
+ /// Specialize any filters for the actual file schema (only known after
+ /// metadata is loaded)
+ PrepareFilters(Box<MetadataLoadedParquetOpen>),
+ /// Loading [Parquet Page
Index](https://parquet.apache.org/docs/file-format/pageindex/)
+ LoadPageIndex(BoxFuture<'static, Result<FiltersPreparedParquetOpen>>),
+ /// Pruning Row Groups
+ PruneWithStatistics(Box<FiltersPreparedParquetOpen>),
+ /// Pruning with Bloom Filters
+ ///
+ /// TODO: split state as this currently does both I/O and CPU work
Review Comment:
Follow on:
- https://github.com/apache/datafusion/pull/21285
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]