Re: [PR] feat: Add experimental support for accelerated PyArrow UDFs [datafusion-comet]

via GitHub Thu, 21 May 2026 08:40:39 -0700


andygrove commented on PR #4234:
URL: 
https://github.com/apache/datafusion-comet/pull/4234#issuecomment-4509884836


   **Item 7 (CometArrowPythonRunner dedupe across 4.1/4.2).** Took a closer 
look. The shared surface ends up smaller than the diff suggests because Spark's 
`BaseArrowPythonRunner` itself changed shape between minors:
   
   - 4.1 constructor: `(funcs, evalType, argOffsets, schema, timeZoneId, 
largeVarTypes, workerConf, pythonMetrics, jobArtifactUUID, sessionUUID)`
   - 4.2 constructor: `(funcs, evalType, argOffsets, schema, timeZoneId, 
largeVarTypes, pythonMetrics, jobArtifactUUID, sessionUUID)` — `workerConf` 
dropped, overridden via `runnerConf` instead
   
   A shared `CometArrowPythonRunnerBase` would have to live in `spark-4.x/` to 
be visible to both minors, but `BaseArrowPythonRunner` isn't on the classpath 
there at a single signature, so the base would still need per-minor 
constructors and `super(...)` calls. Net saving: the two trait mix-ins 
(`CometColumnarPythonInput` + `BasicPythonArrowOutput`) factor out, which is 
roughly 10 lines for one more abstraction layer. I'd rather keep the per-minor 
pattern that mirrors `Spark4xMapInBatchSupport` already in the repo. Open to 
revisiting if 4.3 lands and the two minors converge on a stable signature.
   
   **Item 11 (fuzz harness).** Borrowing the `CometCodegenFuzzSuite` shape into 
`test_pyarrow_udf.py` is the right next step but I'd like to do it in a 
follow-up so this PR doesn't grow further. Filed as #4384.
   
   For this round I've added the targeted gaps in commit a52032176:
   
   - Decimal precision sweep `{1, 9, 17, 18, 19, 28, 38}` × scale `{0, half, 
max}` — covers the long-backed / `FixedSizeBinary` boundary at 18/19
   - Null density sweep `{0, 0.01, 0.5, 0.99, 1.0}` for validity-buffer memcpy
   - Multi-batch per partition (4000 rows, `maxRecordsPerBatch=16`) so the 
persistent IPC root is exercised across ~250 batches
   - 50-column mixed schema for the flattened-tree address walk
   - Mid-stream zero-row batch (validity sizing + `setValueCount(0)`)
   - Transforming array UDF (reverse each list) — catches symmetric 
encode/decode mistakes a passthrough would invert
   
   Also filed #4383 for item 3 (drop the per-batch Comet→Spark buffer copy via 
direct `ArrowRecordBatch` assembly). That one supersedes the bulk-copy framing 
#4294 was built on; the doc rewrite in b69da2ebe explains why.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Add experimental support for accelerated PyArrow UDFs [datafusion-comet]

Reply via email to