Re: [PR] [SPARK-56642][SQL] Add pipelined JVM-Python UDF data transfer [spark]

via GitHub Wed, 29 Apr 2026 17:43:42 -0700


viirya commented on PR #55552:
URL: https://github.com/apache/spark/pull/55552#issuecomment-4348615713


   Thanks for the suggestion! I agree ASV is the right direction for benchmarks.
   
   However, the current ASV benchmarks in python/benchmarks/ (e.g., 
bench_eval_type.py) work by directly calling worker_main(infile, outfile) with 
a mock protocol — they bypass the JVM and socket communication entirely. This 
means they can only measure the Python worker's internal processing time.
                                                                                
                                                                                
 
   The pipelined mode's performance benefit comes from overlapping JVM-side 
socket writes with Python-side computation across a real socket connection 
(full-duplex blocking I/O between a JVM writer thread and the task reader 
thread). To benchmark this end-to-end, we would need the ASV framework to 
support running a full SparkSession and executing actual Spark queries, which 
the current setup doesn't do.                                                   
      
                                                                 
   Would it make sense to open a separate PR to extend the ASV framework with 
end-to-end SparkSession-based benchmark support, and then migrate this 
benchmark? For now, the standalone script (bench_pipelined_udf.py) serves as an 
ad-hoc verification tool for this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56642][SQL] Add pipelined JVM-Python UDF data transfer [spark]

Reply via email to