Yicong-Huang opened a new pull request, #55600:
URL: https://github.com/apache/spark/pull/55600

   ### What changes were proposed in this pull request?
   
   Adds ASV microbenchmarks for `SQL_GROUPED_MAP_PANDAS_ITER_UDF` to 
`python/benchmarks/bench_eval_type.py`:
   
   - `_GroupedMapPandasIterBenchMixin` reuses the parent 
`_GroupedMapPandasBenchMixin` scenario configs and overrides `_udfs` and 
`_write_scenario` to dispatch to `SQL_GROUPED_MAP_PANDAS_ITER_UDF`.
   - UDFs cover the iterator pattern (`Iterator[pandas.DataFrame] -> 
Iterator[pandas.DataFrame]`):
     - `identity_udf`: yield from input iterator
     - `sort_udf`: sort each DataFrame by first column
     - `key_identity_udf`: 2-arg variant `(key, pdfs)` yielding from input 
iterator
   - Two bench classes: `GroupedMapPandasIterUDFTimeBench` and 
`GroupedMapPandasIterUDFPeakmemBench`.
   
   ### Why are the changes needed?
   
   Establish a baseline before refactoring `SQL_GROUPED_MAP_PANDAS_ITER_UDF` 
(subtask of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724)).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Ran ASV with `COLUMNS=120 asv run --bench GroupedMapPandasIterUDF --quick 
--python=same`, twice, numbers stable.
   
   \`\`\`text
   [ 0.00%] ·· Benchmarking existing-py_home_yicong.huang_venv312_bin_python
   [25.00%] ··· 
bench_eval_type.GroupedMapPandasIterUDFPeakmemBench.peakmem_worker              
                        ok
   [25.00%] ··· ================= ============== ========== ==================
                --                                    udf
                ----------------- --------------------------------------------
                     scenario      identity_udf   sort_udf   key_identity_udf
                ================= ============== ========== ==================
                  sm_grp_few_col       477M         478M           472M
                 sm_grp_many_col       485M         485M           484M
                  lg_grp_few_col       635M         635M           551M
                 lg_grp_many_col       767M         768M           766M
                   mixed_types         467M         467M           467M
                ================= ============== ========== ==================
   
   [50.00%] ··· bench_eval_type.GroupedMapPandasIterUDFTimeBench.time_worker    
                                        ok
   [50.00%] ··· ================= ============== ========== ==================
                --                                    udf
                ----------------- --------------------------------------------
                     scenario      identity_udf   sort_udf   key_identity_udf
                ================= ============== ========== ==================
                  sm_grp_few_col     460+/-0ms    519+/-0ms      425+/-0ms
                 sm_grp_many_col     419+/-0ms    429+/-0ms      410+/-0ms
                  lg_grp_few_col     817+/-0ms    1.08+/-0s      689+/-0ms
                 lg_grp_many_col     994+/-0ms    1.08+/-0s      1.01+/-0s
                   mixed_types       462+/-0ms    566+/-0ms      433+/-0ms
                ================= ============== ========== ==================
   \`\`\`
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to