Re: [PR] feat: Add experimental support for accelerated PyArrow UDFs [datafusion-comet]

via GitHub Mon, 11 May 2026 17:51:19 -0700


andygrove commented on PR #4234:
URL: 
https://github.com/apache/datafusion-comet/pull/4234#issuecomment-4426341963


   @mbutrovich thanks for the thorough pass — full mapping below.
   
   **Framing.** PR description updated to "drop two `UnsafeProjection` copies 
and keep the stage columnar". #4240 still tracks full Arrow-to-Arrow round-trip 
elimination.
   
   **Bigger items**
   
   1. Conf + class renamed to `spark.comet.exec.pyarrowUdf.enabled` and 
`CometMapInBatchExec`.
   2. Shared 4.x bits live in `spark-4.x/.../Spark4xMapInBatchSupport.scala` 
(matchers, `RunnerInputs`, `runnerInputs` helper). Each minor's 
`ShimCometMapInBatch` is just the per-version `ArrowPythonRunner` factory.
   3. Scala suite replaced with per-version `CometMapInBatchSuite` 
(`spark/src/test/spark-{3.5,4.x}`) that constructs a `PythonMapInArrowExec` / 
`MapInArrowExec` over a stub `CometPlan` leaf and asserts 
`EliminateRedundantTransitions` rewrites it (plus the disabled-conf negative).
   4. `CometPlan` mixed into `CometMapInBatchExec`.
   
   **Smaller things**
   
   5. 5-tuple replaced with `MapInBatchInfo` case class.
   6. Single `match` on the matcher result in `EliminateRedundantTransitions`.
   7. Unreachable raw `ColumnarToRowExec` branch removed from 
`extractColumnarChild`; reasoning documented in the scaladoc above it.
   8. Why-comment added on the `doExecute` fallback explaining the row 
transition reintroduction.
   9. `computeArrowPython` parameter count cut by deriving `timeZoneId` / 
`largeVarTypes` / `pythonRunnerConf` / `jobArtifactUUID` inside the shim — but 
that introduced the NPE @wForget caught on the executor (`SQLConf.ConfigReader` 
is driver-only). The latest commit (`6f5aca3b3`) moves resolution back to the 
driver in `doExecuteColumnar` and passes a serializable `RunnerInputs` case 
class into `computeArrowPython`, so the param count is still small without 
dereferencing `SQLConf` in the task closure.
   10. `local*` prefixes dropped.
   11. Code-restating comments dropped; the struct-wrap and barrier notes kept.
   
   **Test infra**
   
   12. Jar resolution consolidated into 
`spark/src/test/resources/pyspark/conftest.py`; pytest and the benchmark script 
both import `resolve_comet_jar` from there. The workflow no longer needs the 
inline `ls | grep`.
   13. `pyarrow_udf_test.yml` switched from `paths-ignore` to a feature-file 
allowlist; build flavor switched to debug (no `-Prelease`, no `cargo 
--release`).
   14. CI workflow pinned to Spark 4.0 (covers the 4.x shim path). Spark 3.5 
still gets matcher coverage through the existing `CometMapInBatchSuite` runs in 
`pr_build_linux` / `pr_build_macos`. Compile-only 4.1/4.2 isn't wired up here 
yet — happy to add a matrix if you want it; my read was that the existing 4.1 
build job already provides that, but let me know.
   15. Benchmark caveat (passthrough UDF + `local[2]` measures Python fork/IPC, 
real UDFs shrink the delta further) included in the PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Add experimental support for accelerated PyArrow UDFs [datafusion-comet]

Reply via email to