andygrove opened a new issue, #4377:
URL: https://github.com/apache/datafusion-comet/issues/4377
## What is the problem the feature request solves?
After #4328, `CometSparkSessionExtensions.isCometLoaded` now returns `false`
and logs a warning when `spark.comet.exec.shuffle.enabled` is true (the
default) but `spark.shuffle.manager` is not set to `CometShuffleManager`.
This surfaces a long-standing asymmetry in the Spark SQL test diffs:
- `dev/diffs/{3.4.3,3.5.8,4.0.2,4.1.1}.diff` patch
`SparkSession.applyExtensions` to **globally** add
`CometSparkSessionExtensions` to every SparkSession built in the JVM whenever
`ENABLE_COMET=true`.
- The same diffs set `spark.shuffle.manager=CometShuffleManager` **only**
inside `SharedSparkSessionBase.sparkConf` and `TestHiveContext`.
Test suites that build their own `SparkConf`/`SparkSession` outside those
traits therefore get Comet installed but no shuffle manager. Before #4328 these
suites still ran with Comet, just falling back to Spark's default shuffle.
After #4328 Comet disables itself entirely for these suites and logs a warning
per `isCometLoaded` invocation.
The once-per-session log dedup landing alongside this issue keeps the log
volume manageable, but the underlying questions are unanswered:
1. Are these suites supposed to be exercising Comet at all? If yes, the diff
needs to register `CometShuffleManager` for them too (currently a coverage
regression).
2. If no, should we suppress the warning when Comet was auto-injected by the
diff rather than explicitly enabled by the user (i.e. only warn when the user
set `spark.comet.enabled=true` themselves)?
3. Should we revive the `spark.comet.exec.shuffle.required` opt-out that
#4328's description promised but the merged code did not implement?
## Spark 4.1.1 suites observed emitting the warning
Attribution is from the in-progress run
https://github.com/apache/datafusion-comet/actions/runs/26177223270 (six of
seven 4.1.1 jobs complete; sql_core-1 still running at file time). Counts are
pre-dedup.
| Warnings | Suite |
|---|---|
| ~304 | `BroadcastJoinSuiteAE` |
| ~132 | `BroadcastJoinSuite` |
| 58 | `ExecutorSideSQLConfSuite` |
| 30 | `ExpressionInfoSuite` |
| 24 | `DisableUnnecessaryBucketedScanWithoutHiveSupportSuite` |
| 18 | `SparkSessionJobTaggingAndCancellationSuite` |
| 18 | `SparkSessionBuilderSuite` |
| ~12 | `UISeleniumSuite` |
| 6 | `UISeleniumWithRocksDBBackendSuite` |
| 6 | `ParquetCommitterSuite` |
| 4 | `OrcFilterSuite` |
| 2 | `ParquetV1SchemaPruningSuite` |
| 2 | `OrcV2SchemaPruningSuite` |
The Spark 3.5.8, 4.0.2, and 3.4.3 diffs have the same shape, so the same
suites should surface there.
## Describe the potential solution
Investigate and decide whether this matters. Possible directions:
- Accept as-is — the warning is informative and the per-session dedup keeps
log volume manageable. Test coverage for those suites without
CometShuffleManager is small.
- Extend the diffs to register `CometShuffleManager` for the affected suites
(per-suite patches, or a broader hook such as a system property picked up by
every `SparkConf`).
- Add an opt-out config (e.g. `spark.comet.exec.shuffle.required`) so users
/ tests can keep Comet enabled with Spark's default shuffle manager.
- Restrict the disable-on-missing-shuffle-manager logic to sessions where
the user explicitly set `spark.comet.enabled=true`, leaving auto-injected
sessions in the prior "Comet on, default shuffle" mode.
## Additional context
- #4328 introduced the disable-and-warn behavior.
- The PR description originally proposed a
`spark.comet.exec.shuffle.required` config but the merged code gates on
`COMET_EXEC_SHUFFLE_ENABLED` instead.
- A separate change to log this warning at most once per `SQLConf` (per
session) lands alongside this issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]