andygrove opened a new issue, #4240:
URL: https://github.com/apache/datafusion-comet/issues/4240

   ### What is the problem the feature request solves?
   
   #4234 introduced `CometPythonMapInArrowExec`, eliminating `ColumnarToRow + 
UnsafeProjection` ahead of `mapInArrow` / `mapInPandas`. End-to-end speedup is 
1.30x-1.32x on narrow workloads but only 1.08x-1.09x on a 50-column workload 
because the input side of `ArrowPythonRunner` still re-encodes rows back into 
Arrow.
   
   The remaining round-trip: Comet feeds `ColumnarBatch.rowIterator()` to the 
existing `ArrowPythonRunner`, whose writer thread reads each row via the row 
API and writes it back into Arrow vectors via `ArrowWriter.write` before 
sending the IPC bytes. Data that already lives in Arrow vectors goes Arrow -> 
row view -> Arrow inside the JVM.
   
   ### Describe the potential solution
   
   Replace the writer side of the runner with one that streams Arrow record 
batches straight to the Python IPC stream:
   
   - **Option A (smaller):** subclass `ArrowPythonRunner` and override the 
writer thread to accept `Iterator[ColumnarBatch]` and write batches via 
`ArrowStreamWriter` over a `VectorSchemaRoot` derived from the Comet vectors. 
Reuse worker management, error handling, traceback marshalling.
   - **Option B (bigger):** write a `CometArrowPythonRunner` extending 
`BasePythonRunner[Iterator[ColumnarBatch], ColumnarBatch]` directly. Cleaner 
separation, more code.
   
   Either path adds new shim methods to `ShimCometPythonMapInArrow` for each 
Spark version (3.5, 4.0, 4.1, 4.2). Use `benchmark_pyarrow_udf.py` to validate 
the win, especially on wide rows.
   
   ### Additional context
   
   Closing this issue would let us flip 
`spark.comet.exec.pythonMapInArrow.enabled` to default-true and drop the 
experimental marker. Related: #957, #4234.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to