Re: [PR] Improve `ParquetExec` and related documentation [datafusion]

2024-05-27 Thread via GitHub


alamb merged PR #10647:
URL: https://github.com/apache/datafusion/pull/10647


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve `ParquetExec` and related documentation [datafusion]

2024-05-25 Thread via GitHub


comphead commented on code in PR #10647:
URL: https://github.com/apache/datafusion/pull/10647#discussion_r1614830679


##
datafusion/core/src/datasource/physical_plan/parquet/schema_adapter.rs:
##
@@ -20,35 +20,38 @@ use arrow_schema::{Schema, SchemaRef};
 use std::fmt::Debug;
 use std::sync::Arc;
 
-/// Factory of schema adapters.
+/// Factory for creating [`SchemaAdapter`]
 ///
-/// Provides means to implement custom schema adaptation.
+/// This interface provides a way to implement custom schema adaptation logic
+/// for ParquetExec (for example, to fill missing columns with default value
+/// other than null)
 pub trait SchemaAdapterFactory: Debug + Send + Sync + 'static {
 /// Provides `SchemaAdapter` for the ParquetExec.
 fn create(&self, schema: SchemaRef) -> Box;
 }
 
-/// A utility which can adapt file-level record batches to a table schema 
which may have a schema
+/// Adapt file-level [`RecordBatch`]es to a table schema, which may have a 
schema
 /// obtained from merging multiple file-level schemas.
 ///
 /// This is useful for enabling schema evolution in partitioned datasets.
 ///
 /// This has to be done in two stages.
 ///
-/// 1. Before reading the file, we have to map projected column indexes from 
the table schema to
-///the file schema.
+/// 1. Before reading the file, we have to map projected column indexes from 
the
+///table schema to the file schema.
 ///
-/// 2. After reading a record batch we need to map the read columns back to 
the expected columns
-///indexes and insert null-valued columns wherever the file schema was 
missing a colum present
-///in the table schema.
+/// 2. After reading a record batch map the read columns back to the expected
+///columns indexes and insert null-valued columns wherever the file schema 
was
+///missing a colum present in the table schema.

Review Comment:
   ```suggestion
   ///missing a column present in the table schema.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve `ParquetExec` and related documentation [datafusion]

2024-05-25 Thread via GitHub


alamb commented on code in PR #10647:
URL: https://github.com/apache/datafusion/pull/10647#discussion_r1614666750


##
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##
@@ -642,11 +714,22 @@ fn should_enable_page_index(
 .unwrap_or(false)
 }
 
-/// Factory of parquet file readers.
+/// Interface for creating [`AsyncFileReader`]s to read parquet files.
+///
+/// This interface is used by [`ParquetOpener`] in order to create readers for
+/// parquet files. Implementations of this trait can be used to provide custom

Review Comment:
   Excellent idea. I did so



##
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##
@@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics;
 pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper};
 pub use statistics::{RequestedStatistics, StatisticsConverter};
 
-/// Execution plan for scanning one or more Parquet partitions
+/// Execution plan for reading one or more Parquet files.
+///
+/// ```text
+/// ▲
+/// │
+/// │  Produce a stream of
+/// │  RecordBatches
+/// │
+/// ┌───┐
+/// │   │
+/// │  ParquetExec  │
+/// │   │
+/// └───┘
+/// ▲
+/// │  Asynchronously read from one
+/// │  or more parquet files via
+/// │  ObjectStore interface
+/// │
+/// │
+///   .───.
+///  │ )
+///  │`───'│
+///  │ObjectStore  │
+///  │.───.│
+///  │ )
+///   `───'
+///
+/// ```
+/// # Features
+///
+/// Supports the following optimizations:
+///
+/// * Multi-threaded (aka multi-partition): read from one or more files in
+/// parallel. Can read concurrently from multiple row groups from a single 
file.
+///
+/// * Predicate push down: skips row groups and pages based on
+/// min/max/null_counts in the row group metadata, the page index and bloom
+/// filters.
+///
+/// * Projection pushdown: reads and decodes only the columns required.
+///
+/// * Limit pushdown: stop execution early after some number of rows are read.
+///
+/// * Custom readers: controls I/O for accessing pages. See

Review Comment:
   good call -- updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve `ParquetExec` and related documentation [datafusion]

2024-05-24 Thread via GitHub


crepererum commented on code in PR #10647:
URL: https://github.com/apache/datafusion/pull/10647#discussion_r1613322378


##
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##
@@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics;
 pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper};
 pub use statistics::{RequestedStatistics, StatisticsConverter};
 
-/// Execution plan for scanning one or more Parquet partitions
+/// Execution plan for reading one or more Parquet files.
+///
+/// ```text
+/// ▲
+/// │
+/// │  Produce a stream of
+/// │  RecordBatches
+/// │
+/// ┌───┐
+/// │   │
+/// │  ParquetExec  │
+/// │   │
+/// └───┘
+/// ▲
+/// │  Asynchronously read from one
+/// │  or more parquet files via
+/// │  ObjectStore interface
+/// │
+/// │
+///   .───.
+///  │ )
+///  │`───'│
+///  │ObjectStore  │
+///  │.───.│
+///  │ )
+///   `───'
+///
+/// ```
+/// # Features
+///
+/// Supports the following optimizations:
+///
+/// * Multi-threaded (aka multi-partition): read from one or more files in
+/// parallel. Can read concurrently from multiple row groups from a single 
file.

Review Comment:
   I would call this "concurrency" instead of "multi-threading". IIRC we don't 
implement ANY threading in this operator and solely rely on tokio to dispatch 
concurrent bits for us. I think it's fine to mention that the concurrency in 
this operator CAN lead to multi-core usage under specific circumstances.



##
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##
@@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics;
 pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper};
 pub use statistics::{RequestedStatistics, StatisticsConverter};
 
-/// Execution plan for scanning one or more Parquet partitions
+/// Execution plan for reading one or more Parquet files.
+///
+/// ```text
+/// ▲
+/// │
+/// │  Produce a stream of
+/// │  RecordBatches
+/// │
+/// ┌───┐
+/// │   │
+/// │  ParquetExec  │
+/// │   │
+/// └───┘
+/// ▲
+/// │  Asynchronously read from one
+/// │  or more parquet files via
+/// │  ObjectStore interface
+/// │
+/// │
+///   .───.
+///  │ )
+///  │`───'│
+///  │ObjectStore  │
+///  │.───.│
+///  │ )
+///   `───'
+///
+/// ```
+/// # Features
+///
+/// Supports the following optimizations:
+///
+/// * Multi-threaded (aka multi-partition): read from one or more files in
+/// parallel. Can read concurrently from multiple row groups from a single 
file.
+///
+/// * Predicate push down: skips row groups and pages based on
+/// min/max/null_counts in the row group metadata, the page index and bloom
+/// filters.
+///
+/// * Projection pushdown: reads and decodes only the columns required.
+///
+/// * Limit pushdown: stop execution early after some number of rows are read.
+///
+/// * Custom readers: controls I/O for accessing pages. See

Review Comment:
   ```suggestion
   /// * Custom readers: implements I/O for accessing byte ranges and the 
metadata object. See
   ```
   
   It's not steering the IO process, it's actually responsible for performing 
(or not performing) it. For example, a custom impl. could totally NOT use an 
object store (which is esp. interesting for the metadata bit, see other comment 
below).



##
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##
@@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics;
 pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper};
 pub use statistics::{RequestedStatistics, StatisticsConverter};
 
-/// Execution plan for scanning one or more Parquet partitions
+/// Execution plan for reading one or more Parquet files.
+///
+/// ```text
+/// ▲
+/// │
+/// │  Produce a stream of
+/// │  RecordBatches
+/// │
+/// ┌───┐
+/// │   │
+/// │  ParquetExec  │
+/// │   │
+/// └───┘
+/// ▲
+/// │  Asynchronously read from one
+/// │  or more parquet files via
+/// │  ObjectStore interface
+/// │
+/// │
+///   .───.
+///  │ )
+///  │`───'│
+///  │ObjectStore  │
+///  │.───.│
+//

Re: [PR] Improve `ParquetExec` and related documentation [datafusion]

2024-05-24 Thread via GitHub


alamb commented on PR #10647:
URL: https://github.com/apache/datafusion/pull/10647#issuecomment-2129087302

   @thinkharderdev , @tustvold, @Ted-Jiang  and  @crepererum: if you have time, 
could you double check that this correctly describes `ParquetExec` to your 
understanding?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org