andygrove opened a new pull request, #3230:
URL: https://github.com/apache/datafusion-comet/pull/3230
## Summary
This PR introduces an **experimental** optimization that allows the native
shuffle writer to directly execute the child native plan instead of reading
intermediate batches via JNI. This avoids the JNI round-trip for single-source
native plans.
**Current flow:**
```
Native Plan → ColumnarBatch → JNI → ScanExec → ShuffleWriterExec
```
**Optimized flow:**
```
Native Plan → (directly in native) → ShuffleWriterExec
```
### Configuration
The optimization is controlled by a new config option:
- `spark.comet.exec.shuffle.directNative.enabled` (default: `false`)
### Scope
The optimization currently applies when:
- Native shuffle mode is enabled (`spark.comet.shuffle.mode=native`)
- The shuffle's child is a single-source native plan with
`CometNativeScanExec`
- The partitioning is not `RangePartitioning` (which requires sampling)
### Changes
| File | Change |
|------|--------|
| `CometShuffleDependency.scala` | Added `childNativePlan` field to pass
native plan to writer |
| `CometShuffleExchangeExec.scala` | Added detection logic for single-source
native plans |
| `CometShuffleManager.scala` | Pass native plan to shuffle writer |
| `CometNativeShuffleWriter.scala` | Use child native plan directly when
available |
| `CometConf.scala` | Added `COMET_SHUFFLE_DIRECT_NATIVE_ENABLED` config
option |
| `CometDirectNativeShuffleSuite.scala` | Comprehensive test suite with 15
tests |
## Test plan
- [x] Added `CometDirectNativeShuffleSuite` with 15 tests covering:
- Basic optimization with single scan source
- Filter/project operators
- Single partition and multi-partition cases
- Config enable/disable behavior
- Range partitioning fallback
- JVM shuffle mode fallback
- Various data types
- Edge cases (empty tables, filtered results)
- Correctness comparison between optimized and non-optimized paths
- [x] Verified existing `CometNativeShuffleSuite` tests still pass (16/16)
🤖 Generated with [Claude Code](https://claude.ai/code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]