Yicong-Huang opened a new pull request, #55600:
URL: https://github.com/apache/spark/pull/55600
### What changes were proposed in this pull request?
Adds ASV microbenchmarks for `SQL_GROUPED_MAP_PANDAS_ITER_UDF` to
`python/benchmarks/bench_eval_type.py`:
- `_GroupedMapPandasIterBenchMixin` reuses the parent
`_GroupedMapPandasBenchMixin` scenario configs and overrides `_udfs` and
`_write_scenario` to dispatch to `SQL_GROUPED_MAP_PANDAS_ITER_UDF`.
- UDFs cover the iterator pattern (`Iterator[pandas.DataFrame] ->
Iterator[pandas.DataFrame]`):
- `identity_udf`: yield from input iterator
- `sort_udf`: sort each DataFrame by first column
- `key_identity_udf`: 2-arg variant `(key, pdfs)` yielding from input
iterator
- Two bench classes: `GroupedMapPandasIterUDFTimeBench` and
`GroupedMapPandasIterUDFPeakmemBench`.
### Why are the changes needed?
Establish a baseline before refactoring `SQL_GROUPED_MAP_PANDAS_ITER_UDF`
(subtask of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724)).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Ran ASV with `COLUMNS=120 asv run --bench GroupedMapPandasIterUDF --quick
--python=same`, twice, numbers stable.
\`\`\`text
[ 0.00%] ·· Benchmarking existing-py_home_yicong.huang_venv312_bin_python
[25.00%] ···
bench_eval_type.GroupedMapPandasIterUDFPeakmemBench.peakmem_worker
ok
[25.00%] ··· ================= ============== ========== ==================
-- udf
----------------- --------------------------------------------
scenario identity_udf sort_udf key_identity_udf
================= ============== ========== ==================
sm_grp_few_col 477M 478M 472M
sm_grp_many_col 485M 485M 484M
lg_grp_few_col 635M 635M 551M
lg_grp_many_col 767M 768M 766M
mixed_types 467M 467M 467M
================= ============== ========== ==================
[50.00%] ··· bench_eval_type.GroupedMapPandasIterUDFTimeBench.time_worker
ok
[50.00%] ··· ================= ============== ========== ==================
-- udf
----------------- --------------------------------------------
scenario identity_udf sort_udf key_identity_udf
================= ============== ========== ==================
sm_grp_few_col 460+/-0ms 519+/-0ms 425+/-0ms
sm_grp_many_col 419+/-0ms 429+/-0ms 410+/-0ms
lg_grp_few_col 817+/-0ms 1.08+/-0s 689+/-0ms
lg_grp_many_col 994+/-0ms 1.08+/-0s 1.01+/-0s
mixed_types 462+/-0ms 566+/-0ms 433+/-0ms
================= ============== ========== ==================
\`\`\`
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]