[PR] [SPARK-56629][PYTHON][TESTS] Add ASV microbenchmark for SQL_COGROUPED_MAP_PANDAS_UDF [spark]

via GitHub Sat, 25 Apr 2026 07:20:09 -0700


Yicong-Huang opened a new pull request, #55551:
URL: https://github.com/apache/spark/pull/55551


   ### What changes were proposed in this pull request?
   
   Add ASV microbenchmarks for `SQL_COGROUPED_MAP_PANDAS_UDF` in 
`python/benchmarks/bench_eval_type.py`, mirroring the existing 
`SQL_COGROUPED_MAP_ARROW_UDF` benchmarks. The new section adds:
   
   - `_CogroupedMapPandasBenchMixin` with 6 scenarios (`few_groups_sm/lg`, 
`many_groups_sm/lg`, `wide_values`, `multi_key`) and 4 UDFs (`identity_udf`, 
`concat_udf`, `left_semi_udf`, `key_identity_udf`).
   - `CogroupedMapPandasUDFTimeBench` and `CogroupedMapPandasUDFPeakmemBench` 
driving the worker via the same wire protocol used by cogrouped arrow.
   
   The 3-arg `key_identity_udf` exercises the `(key, left_pdf, right_pdf)` UDF 
signature path. Scenario sizes are scaled down vs cogrouped arrow because 
Pandas conversion adds per-group Arrow<->Pandas overhead on both sides.
   
   ### Why are the changes needed?
   
   This is part of 
[SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724) 
(Micro-benchmark PySpark Eval Types). A baseline microbenchmark for 
`SQL_COGROUPED_MAP_PANDAS_UDF` is required before refactoring its serializer to 
use `ArrowStreamSerializer` so that any regression can be detected.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests. Two stable local ASV runs via `eva-cli bench run` (run id 
#29 and #30, ~360s each); numbers below are from one run. Setup constructs the 
full worker binary protocol in memory, then `pyspark.worker.main` is invoked.
   
   ```text
   $ COLUMNS=120 asv run --bench CogroupedMapPandasUDFTimeBench --quick 
--python=same
   ================ ============== ============ =============== 
==================
   --                                            udf
   ---------------- 
--------------------------------------------------------------
       scenario      identity_udf   concat_udf   left_semi_udf   
key_identity_udf
   ================ ============== ============ =============== 
==================
    few_groups_sm      182+-0ms       204+-0ms        203+-0ms          180+-0ms
    few_groups_lg      449+-0ms       574+-0ms        549+-0ms          450+-0ms
    many_groups_sm     1.39+-0s       1.50+-0s        1.55+-0s          1.42+-0s
    many_groups_lg     840+-0ms       922+-0ms        914+-0ms          787+-0ms
     wide_values       1.10+-0s       1.32+-0s        1.14+-0s          1.09+-0s
      multi_key        448+-0ms       493+-0ms        480+-0ms          442+-0ms
   ================ ============== ============ =============== 
==================
   ```
   
   ```text
   $ COLUMNS=120 asv run --bench CogroupedMapPandasUDFPeakmemBench --quick 
--python=same
   ================ ============== ============ =============== 
==================
   --                                            udf
   ---------------- 
--------------------------------------------------------------
       scenario      identity_udf   concat_udf   left_semi_udf   
key_identity_udf
   ================ ============== ============ =============== 
==================
    few_groups_sm        470M          473M           471M             470M
    few_groups_lg        509M          521M           509M             509M
    many_groups_sm       478M          480M           478M             478M
    many_groups_lg       499M          505M           498M             499M
     wide_values         501M          506M           502M             501M
      multi_key          479M          481M           480M             479M
   ================ ============== ============ =============== 
==================
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56629][PYTHON][TESTS] Add ASV microbenchmark for SQL_COGROUPED_MAP_PANDAS_UDF [spark]

Reply via email to