[I] Analysis of IgnoreCometNativeScan/IgnoreCometNativeDataFusion tests with native_datafusion (Spark 3.5.7) [datafusion-comet]

via GitHub Wed, 28 Jan 2026 11:09:49 -0800


andygrove opened a new issue, #3305:
URL: https://github.com/apache/datafusion-comet/issues/3305


   # Native DataFusion Scan Test Analysis (Spark 3.5.7)
   
   ## Overview
   
   This analysis covers tests that were previously ignored for 
`native_datafusion` scan mode
   via `IgnoreCometNativeScan` or `IgnoreCometNativeDataFusion` tags in the 
Spark 3.5.7 diff.
   Each test was run with `spark.comet.scan.impl=native_datafusion` to 
determine whether
   the ignore directive is still necessary.
   
   ## Summary
   
   - **Total tests with ignore directives removed:** 8 (across 3 test files)
   - **Tests now passing:** 3 (ParquetEncryptionSuite)
   - **Tests still failing:** 5 (ParquetV1FilterSuite: 4, ParquetV1QuerySuite: 
1)
   - **Diff updated:** Yes, removed `IgnoreCometNativeScan` from the 3 passing 
encryption tests
   
   ## Tests Now Passing (Ignore Removed)
   
   ### ParquetEncryptionSuite (`sql/hive`)
   
   All three encryption tests now pass with `native_datafusion`:
   
   | Test | Previous Ignore Reason |
   |------|----------------------|
   | `SPARK-34990: Write and read an encrypted parquet` | no encryption support 
yet |
   | `SPARK-37117: Can't read files in Parquet encryption external key material 
mode` | no encryption support yet |
   | `SPARK-42114: Test of uniform parquet encryption` | no encryption support 
yet |
   
   These tests verify that Spark can write and read encrypted Parquet files. 
The native
   DataFusion scan now handles encrypted Parquet correctly, so the ignore 
directives were
   removed from the diff.
   
   ## Tests Still Failing (Ignore Retained)
   
   ### ParquetV1FilterSuite (`sql/core`)
   
   All four tests fail only in the V1 source path (`ParquetV1FilterSuite`). The 
corresponding
   V2 tests (`ParquetV2FilterSuite`) pass because V2 sources don't use Comet's 
native scan.
   
   #### 1. `Filters should be pushed down for vectorized Parquet reader at row 
group level`
   
   - **Ignore reason:** Native scans do not support the tested accumulator
   - **Failure type:** `TestFailedException` (assertion failure)
   - **Details:** The test checks that Parquet filter pushdown works at the row 
group level
     by examining a custom accumulator that counts row groups. The native 
DataFusion scan
     does not support Spark's accumulator mechanism for tracking pushed-down 
filter statistics,
     so the assertion on the accumulator value fails.
   
   #### 2. `filter pushdown - StringPredicate`
   
   - **Ignore reason:** cannot be pushed down
   - **Failure type:** `TestFailedException` (assertion failure)
   - **Details:** Tests that `StartsWith`, `EndsWith`, and `Contains` string 
predicates are
     pushed down into the Parquet reader. The native DataFusion scan does not 
push these
     string predicates down in the same way Spark's built-in reader does, 
causing the
     assertions on pushed filter counts to fail.
   
   #### 3. `SPARK-17091: Convert IN predicate to Parquet filter push-down`
   
   - **Ignore reason:** Comet has different push-down behavior
   - **Failure type:** `CometRuntimeException: CometNativeExec should not be 
executed directly without a serialized plan`
   - **Details:** The test constructs a DataFrame with specific filters and 
directly executes
     it in a way that triggers `CometNativeScan` without going through the 
proper native
     execution plan serialization. This is a fundamental incompatibility with 
how the native
     DataFusion scan handles standalone execution outside of a full native plan.
   
   #### 4. `SPARK-34562: Bloom filter push down`
   
   - **Ignore reason:** Native scans do not support the tested accumulator
   - **Failure type:** `TestFailedException` (assertion failure)
   - **Details:** Similar to test #1, this test relies on a custom accumulator 
to verify that
     bloom filter push-down is working. The native DataFusion scan does not 
integrate with
     Spark's accumulator framework for this purpose.
   
   ### ParquetV1QuerySuite (`sql/core`)
   
   #### 5. `SPARK-26677: negated null-safe equality comparison should not 
filter matched row groups`
   
   - **Ignore reason:** Native scans had the filter pushed into DF operator, 
cannot strip
   - **Failure type:** `CometRuntimeException: CometNativeExec should not be 
executed directly without a serialized plan`
   - **Details:** The test verifies that a negated null-safe equality filter 
(`NOT (value <=> 'A')`)
     does not incorrectly filter out row groups. With the native DataFusion 
scan, the filter
     gets pushed into the DataFusion operator rather than being handled at the 
Spark level.
     When the test tries to execute the scan directly, it hits the same 
serialization issue
     as SPARK-17091 above.
   
   ## Root Causes
   
   The 5 still-failing tests fall into two categories:
   
   1. **Accumulator incompatibility (tests #1, #2, #4):** The native DataFusion 
scan bypasses
      Spark's internal accumulator mechanism used to track filter pushdown 
statistics. Tests
      that assert on these accumulator values will fail.
   
   2. **Direct execution without serialized plan (tests #3, #5):** The native 
DataFusion scan
      requires execution through a serialized native plan. When tests construct 
and execute
      scans directly (outside of the normal query planning flow), they hit a
      `CometRuntimeException` because `CometNativeScan` cannot be executed 
standalone.
   
   ## Note on V2 Tests
   
   The `ParquetV2FilterSuite` and `ParquetV2QuerySuite` variants of these tests 
all pass
   because they use `USE_V1_SOURCE_LIST = ""`, which means Spark uses V2 data 
sources
   instead of V1. Comet's native scan only intercepts V1 Parquet sources, so V2 
tests
   effectively run without Comet's native scan and pass trivially.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Analysis of IgnoreCometNativeScan/IgnoreCometNativeDataFusion tests with native_datafusion (Spark 3.5.7) [datafusion-comet]

Reply via email to