Yicong-Huang opened a new pull request, #55834: URL: https://github.com/apache/spark/pull/55834
### What changes were proposed in this pull request? `_ArrowBatchedBenchMixin._write_scenario` in `python/benchmarks/bench_eval_type.py` wrote the `input_type` schema JSON as a length-prefixed UTF-8 string before the UDF payload. This was the old wire-protocol shape. Since [SPARK-56340](https://issues.apache.org/jira/browse/SPARK-56340) (move input_type schema to eval conf), the worker reads `input_type` via `EvalConf` instead, so the extra prefix gets parsed as the UDF count and the worker exits with `UnicodeDecodeError` while reading subsequent UTF-8 fields. This PR moves the schema to `eval_conf={"input_type": schema.json()}`, matching the pattern already used by the `_ArrowTableUDFBenchMixin`. ### Why are the changes needed? Running any `ArrowBatchedUDFTimeBench` / `ArrowBatchedUDFPeakmemBench` ASV benchmark currently fails with: ``` File "pyspark/worker.py", line 3581, in main init_info = WorkerInitInfo.from_stream(infile) ... UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 353: invalid start byte ``` The bench file is the only `SQL_ARROW_BATCHED_UDF` mock writer in the tree and was missed when the worker protocol changed. ### Does this PR introduce _any_ user-facing change? No. Test-only change. ### How was this patch tested? Running both bench classes locally now succeeds. Numbers from one run: ```text === bench_eval_type.ArrowBatchedUDFTimeBench.time_worker === scenario identity_udf stringify_udf nullcheck_udf sm_batch_few_col 44.3+/-0.3ms 46.9+/-0.3ms 45.0+/-0.4ms sm_batch_many_col 112+/-0.7ms 113+/-1ms 112+/-0.5ms lg_batch_few_col 106+/-0.7ms 113+/-2ms 106+/-0.4ms lg_batch_many_col 448+/-1ms 449+/-0.3ms 447+/-3ms pure_ints 157+/-1ms 162+/-1ms 156+/-2ms pure_floats 148+/-0.2ms 170+/-1ms 149+/-2ms pure_strings 302+/-0.5ms 305+/-3ms 295+/-0.7ms mixed_types 226+/-0.9ms 230+/-1ms 222+/-0.9ms === bench_eval_type.ArrowBatchedUDFPeakmemBench.peakmem_worker === scenario identity_udf stringify_udf nullcheck_udf sm_batch_few_col 464M 464M 464M sm_batch_many_col 469M 469M 469M lg_batch_few_col 469M 470M 469M lg_batch_many_col 509M 510M 509M pure_ints 469M 470M 469M pure_floats 469M 470M 469M pure_strings 473M 473M 473M mixed_types 471M 471M 470M ``` Run commands: ```bash COLUMNS=120 asv run --bench ArrowBatchedUDFTimeBench -a repeat=3 --python=same COLUMNS=120 asv run --bench ArrowBatchedUDFPeakmemBench -a repeat=3 --python=same ``` Smoke-tested all 40 benchmark classes in the file (every other class still passes; only the two ArrowBatched classes were broken). ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
