andygrove commented on PR #4234:
URL:
https://github.com/apache/datafusion-comet/pull/4234#issuecomment-4426341963
@mbutrovich thanks for the thorough pass — full mapping below.
**Framing.** PR description updated to "drop two `UnsafeProjection` copies
and keep the stage columnar". #4240 still tracks full Arrow-to-Arrow round-trip
elimination.
**Bigger items**
1. Conf + class renamed to `spark.comet.exec.pyarrowUdf.enabled` and
`CometMapInBatchExec`.
2. Shared 4.x bits live in `spark-4.x/.../Spark4xMapInBatchSupport.scala`
(matchers, `RunnerInputs`, `runnerInputs` helper). Each minor's
`ShimCometMapInBatch` is just the per-version `ArrowPythonRunner` factory.
3. Scala suite replaced with per-version `CometMapInBatchSuite`
(`spark/src/test/spark-{3.5,4.x}`) that constructs a `PythonMapInArrowExec` /
`MapInArrowExec` over a stub `CometPlan` leaf and asserts
`EliminateRedundantTransitions` rewrites it (plus the disabled-conf negative).
4. `CometPlan` mixed into `CometMapInBatchExec`.
**Smaller things**
5. 5-tuple replaced with `MapInBatchInfo` case class.
6. Single `match` on the matcher result in `EliminateRedundantTransitions`.
7. Unreachable raw `ColumnarToRowExec` branch removed from
`extractColumnarChild`; reasoning documented in the scaladoc above it.
8. Why-comment added on the `doExecute` fallback explaining the row
transition reintroduction.
9. `computeArrowPython` parameter count cut by deriving `timeZoneId` /
`largeVarTypes` / `pythonRunnerConf` / `jobArtifactUUID` inside the shim — but
that introduced the NPE @wForget caught on the executor (`SQLConf.ConfigReader`
is driver-only). The latest commit (`6f5aca3b3`) moves resolution back to the
driver in `doExecuteColumnar` and passes a serializable `RunnerInputs` case
class into `computeArrowPython`, so the param count is still small without
dereferencing `SQLConf` in the task closure.
10. `local*` prefixes dropped.
11. Code-restating comments dropped; the struct-wrap and barrier notes kept.
**Test infra**
12. Jar resolution consolidated into
`spark/src/test/resources/pyspark/conftest.py`; pytest and the benchmark script
both import `resolve_comet_jar` from there. The workflow no longer needs the
inline `ls | grep`.
13. `pyarrow_udf_test.yml` switched from `paths-ignore` to a feature-file
allowlist; build flavor switched to debug (no `-Prelease`, no `cargo
--release`).
14. CI workflow pinned to Spark 4.0 (covers the 4.x shim path). Spark 3.5
still gets matcher coverage through the existing `CometMapInBatchSuite` runs in
`pr_build_linux` / `pr_build_macos`. Compile-only 4.1/4.2 isn't wired up here
yet — happy to add a matrix if you want it; my read was that the existing 4.1
build job already provides that, but let me know.
15. Benchmark caveat (passthrough UDF + `local[2]` measures Python fork/IPC,
real UDFs shrink the delta further) included in the PR description.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]