This is an automated email from the ASF dual-hosted git repository.
mbutrovich pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/main by this push:
new b8d8fbe04 docs: Update Parquet scan documentation (#3433)
b8d8fbe04 is described below
commit b8d8fbe047adb34c574a7e8a17f28356cb7f9db8
Author: Andy Grove <[email protected]>
AuthorDate: Wed Feb 18 08:01:03 2026 -0700
docs: Update Parquet scan documentation (#3433)
* docs: remove all mentions of native_comet scan
* update
* prettier
* docs: improve parquet_scans.md accuracy and completeness
Fix grammar, add encryption fallback and native_iceberg_compat
hard-coded config limitations, clarify S3 section applies to both
scan implementations, and remove orphaned link references.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
* update config docs
* prettier
* docs: clarify parquet scan limitations and fallback behavior
Clarify which limitations fall back to Spark vs which may produce
incorrect results. Add missing documented limitations for
native_datafusion (DPP, input_file_name, metadata columns). Fix
misleading wording for ignoreCorruptFiles/ignoreMissingFiles. Note
that auto mode currently always selects native_iceberg_compat.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
* docs: remove redundant fallback language in native_datafusion section
The section intro already states all limitations fall back to Spark,
so individual bullet points don't need to repeat it.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
* docs: separate fallback limitations from incorrect-results limitations
Restructure shared and per-scan limitation lists into two clear
categories: features that fall back to Spark (safe) and issues that
may produce incorrect results without falling back. Remove redundant
"Comet falls back to Spark" from individual bullets where the section
intro already states it.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
* fix
* update
* remove encryption from unsupported list, move DPP to common list
* Update docs/source/contributor-guide/parquet_scans.md
Co-authored-by: Oleks V <[email protected]>
* address feedback
* address feedback
---------
Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: Matt Butrovich <[email protected]>
Co-authored-by: Oleks V <[email protected]>
---
.../main/scala/org/apache/comet/CometConf.scala | 14 ++-
docs/source/contributor-guide/ffi.md | 7 +-
docs/source/contributor-guide/parquet_scans.md | 120 ++++++++++-----------
docs/source/contributor-guide/roadmap.md | 14 ---
4 files changed, 64 insertions(+), 91 deletions(-)
diff --git a/common/src/main/scala/org/apache/comet/CometConf.scala
b/common/src/main/scala/org/apache/comet/CometConf.scala
index 49eb55479..2439f3b58 100644
--- a/common/src/main/scala/org/apache/comet/CometConf.scala
+++ b/common/src/main/scala/org/apache/comet/CometConf.scala
@@ -125,16 +125,14 @@ object CometConf extends ShimCometConf {
val SCAN_AUTO = "auto"
val COMET_NATIVE_SCAN_IMPL: ConfigEntry[String] =
conf("spark.comet.scan.impl")
- .category(CATEGORY_SCAN)
+ .category(CATEGORY_PARQUET)
.doc(
- "The implementation of Comet Native Scan to use. Available modes are " +
+ "The implementation of Comet's Parquet scan to use. Available scans are
" +
s"`$SCAN_NATIVE_DATAFUSION`, and `$SCAN_NATIVE_ICEBERG_COMPAT`. " +
- s"`$SCAN_NATIVE_DATAFUSION` is a fully native implementation of scan
based on " +
- "DataFusion. " +
- s"`$SCAN_NATIVE_ICEBERG_COMPAT` is the recommended native
implementation that " +
- "exposes apis to read parquet columns natively and supports complex
types. " +
- s"`$SCAN_AUTO` (default) chooses the best scan.")
- .internal()
+ s"`$SCAN_NATIVE_DATAFUSION` is a fully native implementation, and " +
+ s"`$SCAN_NATIVE_ICEBERG_COMPAT` is a hybrid implementation that
supports some " +
+ "additional features, such as row indexes and field ids. " +
+ s"`$SCAN_AUTO` (default) chooses the best available scan based on the
scan schema.")
.stringConf
.transform(_.toLowerCase(Locale.ROOT))
.checkValues(Set(SCAN_NATIVE_DATAFUSION, SCAN_NATIVE_ICEBERG_COMPAT,
SCAN_AUTO))
diff --git a/docs/source/contributor-guide/ffi.md
b/docs/source/contributor-guide/ffi.md
index b1a51ecb2..c40c189e9 100644
--- a/docs/source/contributor-guide/ffi.md
+++ b/docs/source/contributor-guide/ffi.md
@@ -177,9 +177,10 @@ message Scan {
#### When ownership is NOT transferred to native:
-If the data originates from `native_comet` scan (deprecated, will be removed
in a future release) or from
-`native_iceberg_compat` in some cases, then ownership is not transferred to
native and the JVM may re-use the
-underlying buffers in the future.
+If the data originates from a scan that uses mutable buffers (such as Iceberg
scans using the [hybrid Iceberg reader]),
+then ownership is not transferred to native and the JVM may re-use the
underlying buffers in the future.
+
+[hybrid Iceberg reader]:
https://datafusion.apache.org/comet/user-guide/latest/iceberg.html#hybrid-reader
It is critical that the native code performs a deep copy of the arrays if the
arrays are to be buffered by
operators such as `SortExec` or `ShuffleWriterExec`, otherwise data corruption
is likely to occur.
diff --git a/docs/source/contributor-guide/parquet_scans.md
b/docs/source/contributor-guide/parquet_scans.md
index bbacff4d9..7df939488 100644
--- a/docs/source/contributor-guide/parquet_scans.md
+++ b/docs/source/contributor-guide/parquet_scans.md
@@ -19,71 +19,60 @@ under the License.
# Comet Parquet Scan Implementations
-Comet currently has three distinct implementations of the Parquet scan
operator. The configuration property
-`spark.comet.scan.impl` is used to select an implementation. The default
setting is `spark.comet.scan.impl=auto`, and
-Comet will choose the most appropriate implementation based on the Parquet
schema and other Comet configuration
-settings. Most users should not need to change this setting. However, it is
possible to force Comet to try and use
-a particular implementation for all scan operations by setting this
configuration property to one of the following
-implementations.
-
-| Implementation | Description
|
-| ----------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
-| `native_comet` | **Deprecated.** This implementation provides
strong compatibility with Spark but does not support complex types. This is the
original scan implementation in Comet and will be removed in a future release. |
-| `native_iceberg_compat` | This implementation delegates to DataFusion's
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan
is designed to be integrated with Iceberg in the future.
|
-| `native_datafusion` | This experimental implementation delegates to
DataFusion's `DataSourceExec` for full native execution. There are known
compatibility issues when using this scan.
|
-
-The `native_datafusion` and `native_iceberg_compat` scans provide the
following benefits over the `native_comet`
-implementation:
-
-- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
-- Provides support for reading complex types (structs, arrays, and maps)
-- Delegates Parquet decoding to native Rust code rather than JVM-side decoding
-- Improves performance
-
-> **Note on mutable buffers:** Both `native_comet` and `native_iceberg_compat`
use reusable mutable buffers
-> when transferring data from JVM to native code via Arrow FFI. The
`native_iceberg_compat` implementation uses DataFusion's native Parquet reader
for data columns, bypassing Comet's mutable buffer infrastructure entirely.
However, partition columns still use `ConstantColumnReader`, which relies on
Comet's mutable buffers that are reused across batches. This means native
operators that buffer data (such as `SortExec` or `ShuffleWriterExec`) must
perform deep copies to avoid data corruption.
-> See the [FFI documentation](ffi.md) for details on the `arrow_ffi_safe` flag
and ownership semantics.
-
-The `native_datafusion` and `native_iceberg_compat` scans share the following
limitations:
-
-- When reading Parquet files written by systems other than Spark that contain
columns with the logical type `UINT_8`
- (unsigned 8-bit integers), Comet may produce different results than Spark.
Spark maps `UINT_8` to `ShortType`, but
- Comet's Arrow-based readers respect the unsigned type and read the data as
unsigned rather than signed. Since Comet
- cannot distinguish `ShortType` columns that came from `UINT_8` versus signed
`INT16`, by default Comet falls back to
- Spark when scanning Parquet files containing `ShortType` columns. This
behavior can be disabled by setting
- `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType`
columns are always safe because they can
- only come from signed `INT8`, where truncation preserves the signed value.
-- No support for default values that are nested types (e.g., maps, arrays,
structs). Literal default values are supported.
-- No support for datetime rebasing detection or the
`spark.comet.exceptionOnDatetimeRebase` configuration. When reading
- Parquet files containing dates or timestamps written before Spark 3.0 (which
used a hybrid Julian/Gregorian calendar),
- the `native_comet` implementation can detect these legacy values and either
throw an exception or read them without
- rebasing. The DataFusion-based implementations do not have this detection
capability and will read all dates/timestamps
- as if they were written using the Proleptic Gregorian calendar. This may
produce incorrect results for dates before
- October 15, 1582.
-- No support for Spark's Datasource V2 API. When
`spark.sql.sources.useV1SourceList` does not include `parquet`,
- Spark uses the V2 API for Parquet scans. The DataFusion-based
implementations only support the V1 API, so Comet
- will fall back to `native_comet` when V2 is enabled.
-
-The `native_datafusion` scan has some additional limitations:
+Comet currently has two distinct implementations of the Parquet scan operator.
+
+| Scan Implementation | Notes |
+| ----------------------- | ---------------------- |
+| `native_datafusion` | Fully native scan |
+| `native_iceberg_compat` | Hybrid JVM/native scan |
+
+The configuration property
+`spark.comet.scan.impl` is used to select an implementation. The default
setting is `spark.comet.scan.impl=auto`, which
+currently always uses the `native_iceberg_compat` implementation. Most users
should not need to change this setting.
+However, it is possible to force Comet to use a particular implementation for
all scan operations by setting
+this configuration property to one of the following implementations. For
example: `--conf spark.comet.scan.impl=native_datafusion`.
+
+The following features are not supported by either scan implementation, and
Comet will fall back to Spark in these scenarios:
+
+- `ShortType` columns, by default. When reading Parquet files written by
systems other than Spark that contain
+ columns with the logical type `UINT_8` (unsigned 8-bit integers), Comet may
produce different results than Spark.
+ Spark maps `UINT_8` to `ShortType`, but Comet's Arrow-based readers respect
the unsigned type and read the data as
+ unsigned rather than signed. Since Comet cannot distinguish `ShortType`
columns that came from `UINT_8` versus
+ signed `INT16`, by default Comet falls back to Spark when scanning Parquet
files containing `ShortType` columns.
+ This behavior can be disabled by setting
`spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType`
+ columns are always safe because they can only come from signed `INT8`, where
truncation preserves the signed value.
+- Default values that are nested types (e.g., maps, arrays, structs). Literal
default values are supported.
+- Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not
include `parquet`, Spark uses the
+ V2 API for Parquet scans. The DataFusion-based implementations only support
the V1 API.
+- Spark metadata columns (e.g., `_metadata.file_path`)
+- No support for Dynamic Partition Pruning (DPP)
+
+The following shared limitation may produce incorrect results without falling
back to Spark:
+
+- No support for datetime rebasing detection or the
`spark.comet.exceptionOnDatetimeRebase` configuration. When
+ reading Parquet files containing dates or timestamps written before Spark
3.0 (which used a hybrid
+ Julian/Gregorian calendar), dates/timestamps will be read as if they were
written using the Proleptic Gregorian
+ calendar. This may produce incorrect results for dates before October 15,
1582.
+
+The `native_datafusion` scan has some additional limitations, mostly related
to Parquet metadata. All of these
+cause Comet to fall back to Spark.
- No support for row indexes
-- `PARQUET_FIELD_ID_READ_ENABLED` is not respected [#1758]
-- There are failures in the Spark SQL test suite [#1545]
-- Setting Spark configs `ignoreMissingFiles` or `ignoreCorruptFiles` to `true`
is not compatible with Spark
+- No support for reading Parquet field IDs
+- No support for `input_file_name()`, `input_file_block_start()`, or
`input_file_block_length()` SQL functions.
+ The `native_datafusion` scan does not use Spark's `FileScanRDD`, so these
functions cannot populate their values.
+- No support for `ignoreMissingFiles` or `ignoreCorruptFiles` being set to
`true`
-## S3 Support
-
-There are some differences in S3 support between the scan implementations.
-
-### `native_comet` (Deprecated)
+The `native_iceberg_compat` scan has the following additional limitation that
may produce incorrect results
+without falling back to Spark:
-> **Note:** The `native_comet` scan implementation is deprecated and will be
removed in a future release.
+- Some Spark configuration values are hard-coded to their defaults rather than
respecting user-specified values.
+ This may produce incorrect results when non-default values are set. The
affected configurations are
+ `spark.sql.parquet.binaryAsString`, `spark.sql.parquet.int96AsTimestamp`,
`spark.sql.caseSensitive`,
+ `spark.sql.parquet.inferTimestampNTZ.enabled`, and
`spark.sql.legacy.parquet.nanosAsLong`. See
+ [issue #1816](https://github.com/apache/datafusion-comet/issues/1816) for
more details.
-The `native_comet` Parquet scan implementation reads data from S3 using the
[Hadoop-AWS
module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html),
which
-is identical to the approach commonly used with vanilla Spark. AWS credential
configuration and other Hadoop S3A
-configurations works the same way as in vanilla Spark.
-
-### `native_datafusion` and `native_iceberg_compat`
+## S3 Support
The `native_datafusion` and `native_iceberg_compat` Parquet scan
implementations completely offload data loading
to native code. They use the [`object_store`
crate](https://crates.io/crates/object_store) to read data from S3 and
@@ -95,7 +84,8 @@ continue to work as long as the configurations are supported
and can be translat
#### Additional S3 Configuration Options
-Beyond credential providers, the `native_datafusion` implementation supports
additional S3 configuration options:
+Beyond credential providers, the `native_datafusion` and
`native_iceberg_compat` implementations support additional
+S3 configuration options:
| Option | Description
|
| ------------------------------- |
--------------------------------------------------------------------------------------------------
|
@@ -108,7 +98,8 @@ All configuration options support bucket-specific overrides
using the pattern `f
#### Examples
-The following examples demonstrate how to configure S3 access with the
`native_datafusion` Parquet scan implementation using different authentication
methods.
+The following examples demonstrate how to configure S3 access with the
`native_datafusion` and `native_iceberg_compat`
+Parquet scan implementations using different authentication methods.
**Example 1: Simple Credentials**
@@ -140,11 +131,8 @@ $SPARK_HOME/bin/spark-shell \
#### Limitations
-The S3 support of `native_datafusion` has the following limitations:
+The S3 support of `native_datafusion` and `native_iceberg_compat` has the
following limitations:
1. **Partial Hadoop S3A configuration support**: Not all Hadoop S3A
configurations are currently supported. Only the configurations listed in the
tables above are translated and applied to the underlying `object_store` crate.
2. **Custom credential providers**: Custom implementations of AWS credential
providers are not supported. The implementation only supports the standard
credential providers listed in the table above. We are planning to add support
for custom credential providers through a JNI-based adapter that will allow
calling Java credential providers from native code. See [issue
#1829](https://github.com/apache/datafusion-comet/issues/1829) for more details.
-
-[#1545]: https://github.com/apache/datafusion-comet/issues/1545
-[#1758]: https://github.com/apache/datafusion-comet/issues/1758
diff --git a/docs/source/contributor-guide/roadmap.md
b/docs/source/contributor-guide/roadmap.md
index ce9c41416..6d99ee545 100644
--- a/docs/source/contributor-guide/roadmap.md
+++ b/docs/source/contributor-guide/roadmap.md
@@ -51,20 +51,6 @@ with benchmarks that benefit from this feature like TPC-DS.
This effort can be t
[#3349]: https://github.com/apache/datafusion-comet/pull/3349
[#3510]: https://github.com/apache/datafusion-comet/issues/3510
-### Removing the native_comet scan implementation
-
-The `native_comet` scan implementation is now deprecated and will be removed
in a future release ([#2186], [#2177]).
-This is the original scan implementation that uses mutable buffers (which is
incompatible with best practices around
-Arrow FFI) and does not support complex types.
-
-Now that the default `auto` scan mode uses `native_iceberg_compat` (which is
based on DataFusion's `DataSourceExec`),
-we can proceed with removing the `native_comet` scan implementation, and then
improve the efficiency of our use of
-Arrow FFI ([#2171]).
-
-[#2186]: https://github.com/apache/datafusion-comet/issues/2186
-[#2171]: https://github.com/apache/datafusion-comet/issues/2171
-[#2177]: https://github.com/apache/datafusion-comet/issues/2177
-
## Ongoing Improvements
In addition to the major initiatives above, we have the following ongoing
areas of work:
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]