viirya commented on PR #55552:
URL: https://github.com/apache/spark/pull/55552#issuecomment-4348615713
Thanks for the suggestion! I agree ASV is the right direction for benchmarks.
However, the current ASV benchmarks in python/benchmarks/ (e.g.,
bench_eval_type.py) work by directly calling worker_main(infile, outfile) with
a mock protocol — they bypass the JVM and socket communication entirely. This
means they can only measure the Python worker's internal processing time.
The pipelined mode's performance benefit comes from overlapping JVM-side
socket writes with Python-side computation across a real socket connection
(full-duplex blocking I/O between a JVM writer thread and the task reader
thread). To benchmark this end-to-end, we would need the ASV framework to
support running a full SparkSession and executing actual Spark queries, which
the current setup doesn't do.
Would it make sense to open a separate PR to extend the ASV framework with
end-to-end SparkSession-based benchmark support, and then migrate this
benchmark? For now, the standalone script (bench_pipelined_udf.py) serves as an
ad-hoc verification tool for this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]