Yicong-Huang opened a new pull request, #55244:
URL: https://github.com/apache/spark/pull/55244
### What changes were proposed in this pull request?
Add ASV microbenchmarks for `SQL_COGROUPED_MAP_ARROW_UDF` in
`bench_eval_type.py`.
Changes:
- Add `_CogroupedMapArrowBenchMixin` with three UDF variants:
`identity_udf`, `concat_udf`, `left_semi_udf`
- Add `CogroupedMapArrowUDFTimeBench` and `CogroupedMapArrowUDFPeakmemBench`
classes
- Add `MockDataFactory.make_cogrouped_batches()` factory for generating
cogroup batch pairs (left, right)
- Rename `make_batch_groups` to `make_grouped_batches` for consistency
### Why are the changes needed?
Part of SPARK-55724 (Micro-benchmark PySpark Eval Types). This provides a
performance baseline for `SQL_COGROUPED_MAP_ARROW_UDF` before refactoring.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
ASV benchmark run with `repeat=(3, 5, 10.0)`:
```
CogroupedMapArrowUDFTimeBench.time_worker
================ =============== ============
scenario udf
---------------- --------------- ------------
few_groups_sm identity_udf 13.4±0.1ms
few_groups_sm concat_udf 16.5±0.2ms
few_groups_sm left_semi_udf 70.4±1ms
few_groups_lg identity_udf 53.8±0.2ms
few_groups_lg concat_udf 83.3±0.9ms
few_groups_lg left_semi_udf 222±6ms
many_groups_sm identity_udf 393±0.7ms
many_groups_sm concat_udf 513±1ms
many_groups_sm left_semi_udf 1.67±0.01s
many_groups_lg identity_udf 200±4ms
many_groups_lg concat_udf 265±1ms
many_groups_lg left_semi_udf 997±50ms
wide_values identity_udf 308±1ms
wide_values concat_udf 394±2ms
wide_values left_semi_udf 635±10ms
multi_key identity_udf 75.1±0.2ms
multi_key concat_udf 105±0.6ms
multi_key left_semi_udf 233±2ms
================ =============== ============
CogroupedMapArrowUDFPeakmemBench.peakmem_worker
================ =============== ======
scenario udf
---------------- --------------- ------
few_groups_sm identity_udf 483M
few_groups_sm concat_udf 488M
few_groups_sm left_semi_udf 482M
few_groups_lg identity_udf 682M
few_groups_lg concat_udf 741M
few_groups_lg left_semi_udf 715M
many_groups_sm identity_udf 559M
many_groups_sm concat_udf 579M
many_groups_sm left_semi_udf 549M
many_groups_lg identity_udf 845M
many_groups_lg concat_udf 955M
many_groups_lg left_semi_udf 870M
wide_values identity_udf 810M
wide_values concat_udf 919M
wide_values left_semi_udf 772M
multi_key identity_udf 572M
multi_key concat_udf 593M
multi_key left_semi_udf 586M
================ =============== ======
```
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]