Re: [PR] Improve `ParquetExec` and related documentation [datafusion]
alamb merged PR #10647: URL: https://github.com/apache/datafusion/pull/10647 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve `ParquetExec` and related documentation [datafusion]
comphead commented on code in PR #10647: URL: https://github.com/apache/datafusion/pull/10647#discussion_r1614830679 ## datafusion/core/src/datasource/physical_plan/parquet/schema_adapter.rs: ## @@ -20,35 +20,38 @@ use arrow_schema::{Schema, SchemaRef}; use std::fmt::Debug; use std::sync::Arc; -/// Factory of schema adapters. +/// Factory for creating [`SchemaAdapter`] /// -/// Provides means to implement custom schema adaptation. +/// This interface provides a way to implement custom schema adaptation logic +/// for ParquetExec (for example, to fill missing columns with default value +/// other than null) pub trait SchemaAdapterFactory: Debug + Send + Sync + 'static { /// Provides `SchemaAdapter` for the ParquetExec. fn create(&self, schema: SchemaRef) -> Box; } -/// A utility which can adapt file-level record batches to a table schema which may have a schema +/// Adapt file-level [`RecordBatch`]es to a table schema, which may have a schema /// obtained from merging multiple file-level schemas. /// /// This is useful for enabling schema evolution in partitioned datasets. /// /// This has to be done in two stages. /// -/// 1. Before reading the file, we have to map projected column indexes from the table schema to -///the file schema. +/// 1. Before reading the file, we have to map projected column indexes from the +///table schema to the file schema. /// -/// 2. After reading a record batch we need to map the read columns back to the expected columns -///indexes and insert null-valued columns wherever the file schema was missing a colum present -///in the table schema. +/// 2. After reading a record batch map the read columns back to the expected +///columns indexes and insert null-valued columns wherever the file schema was +///missing a colum present in the table schema. Review Comment: ```suggestion ///missing a column present in the table schema. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve `ParquetExec` and related documentation [datafusion]
alamb commented on code in PR #10647: URL: https://github.com/apache/datafusion/pull/10647#discussion_r1614666750 ## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ## @@ -642,11 +714,22 @@ fn should_enable_page_index( .unwrap_or(false) } -/// Factory of parquet file readers. +/// Interface for creating [`AsyncFileReader`]s to read parquet files. +/// +/// This interface is used by [`ParquetOpener`] in order to create readers for +/// parquet files. Implementations of this trait can be used to provide custom Review Comment: Excellent idea. I did so ## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ## @@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics; pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper}; pub use statistics::{RequestedStatistics, StatisticsConverter}; -/// Execution plan for scanning one or more Parquet partitions +/// Execution plan for reading one or more Parquet files. +/// +/// ```text +/// ▲ +/// │ +/// │ Produce a stream of +/// │ RecordBatches +/// │ +/// ┌───┐ +/// │ │ +/// │ ParquetExec │ +/// │ │ +/// └───┘ +/// ▲ +/// │ Asynchronously read from one +/// │ or more parquet files via +/// │ ObjectStore interface +/// │ +/// │ +/// .───. +/// │ ) +/// │`───'│ +/// │ObjectStore │ +/// │.───.│ +/// │ ) +/// `───' +/// +/// ``` +/// # Features +/// +/// Supports the following optimizations: +/// +/// * Multi-threaded (aka multi-partition): read from one or more files in +/// parallel. Can read concurrently from multiple row groups from a single file. +/// +/// * Predicate push down: skips row groups and pages based on +/// min/max/null_counts in the row group metadata, the page index and bloom +/// filters. +/// +/// * Projection pushdown: reads and decodes only the columns required. +/// +/// * Limit pushdown: stop execution early after some number of rows are read. +/// +/// * Custom readers: controls I/O for accessing pages. See Review Comment: good call -- updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve `ParquetExec` and related documentation [datafusion]
crepererum commented on code in PR #10647: URL: https://github.com/apache/datafusion/pull/10647#discussion_r1613322378 ## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ## @@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics; pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper}; pub use statistics::{RequestedStatistics, StatisticsConverter}; -/// Execution plan for scanning one or more Parquet partitions +/// Execution plan for reading one or more Parquet files. +/// +/// ```text +/// ▲ +/// │ +/// │ Produce a stream of +/// │ RecordBatches +/// │ +/// ┌───┐ +/// │ │ +/// │ ParquetExec │ +/// │ │ +/// └───┘ +/// ▲ +/// │ Asynchronously read from one +/// │ or more parquet files via +/// │ ObjectStore interface +/// │ +/// │ +/// .───. +/// │ ) +/// │`───'│ +/// │ObjectStore │ +/// │.───.│ +/// │ ) +/// `───' +/// +/// ``` +/// # Features +/// +/// Supports the following optimizations: +/// +/// * Multi-threaded (aka multi-partition): read from one or more files in +/// parallel. Can read concurrently from multiple row groups from a single file. Review Comment: I would call this "concurrency" instead of "multi-threading". IIRC we don't implement ANY threading in this operator and solely rely on tokio to dispatch concurrent bits for us. I think it's fine to mention that the concurrency in this operator CAN lead to multi-core usage under specific circumstances. ## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ## @@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics; pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper}; pub use statistics::{RequestedStatistics, StatisticsConverter}; -/// Execution plan for scanning one or more Parquet partitions +/// Execution plan for reading one or more Parquet files. +/// +/// ```text +/// ▲ +/// │ +/// │ Produce a stream of +/// │ RecordBatches +/// │ +/// ┌───┐ +/// │ │ +/// │ ParquetExec │ +/// │ │ +/// └───┘ +/// ▲ +/// │ Asynchronously read from one +/// │ or more parquet files via +/// │ ObjectStore interface +/// │ +/// │ +/// .───. +/// │ ) +/// │`───'│ +/// │ObjectStore │ +/// │.───.│ +/// │ ) +/// `───' +/// +/// ``` +/// # Features +/// +/// Supports the following optimizations: +/// +/// * Multi-threaded (aka multi-partition): read from one or more files in +/// parallel. Can read concurrently from multiple row groups from a single file. +/// +/// * Predicate push down: skips row groups and pages based on +/// min/max/null_counts in the row group metadata, the page index and bloom +/// filters. +/// +/// * Projection pushdown: reads and decodes only the columns required. +/// +/// * Limit pushdown: stop execution early after some number of rows are read. +/// +/// * Custom readers: controls I/O for accessing pages. See Review Comment: ```suggestion /// * Custom readers: implements I/O for accessing byte ranges and the metadata object. See ``` It's not steering the IO process, it's actually responsible for performing (or not performing) it. For example, a custom impl. could totally NOT use an object store (which is esp. interesting for the metadata bit, see other comment below). ## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ## @@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics; pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper}; pub use statistics::{RequestedStatistics, StatisticsConverter}; -/// Execution plan for scanning one or more Parquet partitions +/// Execution plan for reading one or more Parquet files. +/// +/// ```text +/// ▲ +/// │ +/// │ Produce a stream of +/// │ RecordBatches +/// │ +/// ┌───┐ +/// │ │ +/// │ ParquetExec │ +/// │ │ +/// └───┘ +/// ▲ +/// │ Asynchronously read from one +/// │ or more parquet files via +/// │ ObjectStore interface +/// │ +/// │ +/// .───. +/// │ ) +/// │`───'│ +/// │ObjectStore │ +/// │.───.│ +//
Re: [PR] Improve `ParquetExec` and related documentation [datafusion]
alamb commented on PR #10647: URL: https://github.com/apache/datafusion/pull/10647#issuecomment-2129087302 @thinkharderdev , @tustvold, @Ted-Jiang and @crepererum: if you have time, could you double check that this correctly describes `ParquetExec` to your understanding? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org