Yicong-Huang opened a new pull request, #55551: URL: https://github.com/apache/spark/pull/55551
### What changes were proposed in this pull request? Add ASV microbenchmarks for `SQL_COGROUPED_MAP_PANDAS_UDF` in `python/benchmarks/bench_eval_type.py`, mirroring the existing `SQL_COGROUPED_MAP_ARROW_UDF` benchmarks. The new section adds: - `_CogroupedMapPandasBenchMixin` with 6 scenarios (`few_groups_sm/lg`, `many_groups_sm/lg`, `wide_values`, `multi_key`) and 4 UDFs (`identity_udf`, `concat_udf`, `left_semi_udf`, `key_identity_udf`). - `CogroupedMapPandasUDFTimeBench` and `CogroupedMapPandasUDFPeakmemBench` driving the worker via the same wire protocol used by cogrouped arrow. The 3-arg `key_identity_udf` exercises the `(key, left_pdf, right_pdf)` UDF signature path. Scenario sizes are scaled down vs cogrouped arrow because Pandas conversion adds per-group Arrow<->Pandas overhead on both sides. ### Why are the changes needed? This is part of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724) (Micro-benchmark PySpark Eval Types). A baseline microbenchmark for `SQL_COGROUPED_MAP_PANDAS_UDF` is required before refactoring its serializer to use `ArrowStreamSerializer` so that any regression can be detected. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Two stable local ASV runs via `eva-cli bench run` (run id #29 and #30, ~360s each); numbers below are from one run. Setup constructs the full worker binary protocol in memory, then `pyspark.worker.main` is invoked. ```text $ COLUMNS=120 asv run --bench CogroupedMapPandasUDFTimeBench --quick --python=same ================ ============== ============ =============== ================== -- udf ---------------- -------------------------------------------------------------- scenario identity_udf concat_udf left_semi_udf key_identity_udf ================ ============== ============ =============== ================== few_groups_sm 182+-0ms 204+-0ms 203+-0ms 180+-0ms few_groups_lg 449+-0ms 574+-0ms 549+-0ms 450+-0ms many_groups_sm 1.39+-0s 1.50+-0s 1.55+-0s 1.42+-0s many_groups_lg 840+-0ms 922+-0ms 914+-0ms 787+-0ms wide_values 1.10+-0s 1.32+-0s 1.14+-0s 1.09+-0s multi_key 448+-0ms 493+-0ms 480+-0ms 442+-0ms ================ ============== ============ =============== ================== ``` ```text $ COLUMNS=120 asv run --bench CogroupedMapPandasUDFPeakmemBench --quick --python=same ================ ============== ============ =============== ================== -- udf ---------------- -------------------------------------------------------------- scenario identity_udf concat_udf left_semi_udf key_identity_udf ================ ============== ============ =============== ================== few_groups_sm 470M 473M 471M 470M few_groups_lg 509M 521M 509M 509M many_groups_sm 478M 480M 478M 478M many_groups_lg 499M 505M 498M 499M wide_values 501M 506M 502M 501M multi_key 479M 481M 480M 479M ================ ============== ============ =============== ================== ``` ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
