Yicong-Huang opened a new pull request, #55532:
URL: https://github.com/apache/spark/pull/55532
### What changes were proposed in this pull request?
Replace the `verify_result(expected_type)` iterator factory in `worker.py`
with a unified helper:
```python
verify_return_type(result, expected_type)
```
- When `expected_type` is a concrete type (e.g. `pa.Table`,
`pa.RecordBatch`), it checks `isinstance(result, expected_type)` and returns
`result` unchanged.
- When `expected_type` is `Iterator[T]`, it checks that `result` is an
`Iterator` and returns a lazy iterator that type-checks each element against
`T` on consumption.
The two existing callers are updated:
- `SQL_MAP_ARROW_ITER_UDF`: `verify_result(pa.RecordBatch)(output_batches)`
-> `verify_return_type(output_batches, Iterator[pa.RecordBatch])`
- `SQL_SCALAR_ARROW_ITER_UDF`:
`verify_result(pa.Array)(udf_func(args_iter))` ->
`verify_return_type(udf_func(args_iter), Iterator[pa.Array])`
The runtime now strictly requires an `Iterator` (dropping the prior
`hasattr(result, "__iter__")` fallback that also admitted plain iterables such
as `list`). This aligns with the public UDF type contract
(`ArrowMapIterFunction = Callable[[Iterator[pa.RecordBatch]],
Iterator[pa.RecordBatch]]`) and with the type-hint inference in
`pyspark/sql/pandas/typehints.py` which only recognizes `Iterator`.
### Why are the changes needed?
Part of SPARK-55388. Consolidates the iterator factory and the scattered
container-type `isinstance` checks behind one well-named helper, so downstream
eval-type refactors can uniformly replace inline `isinstance` checks with a
single call.
### Does this PR introduce _any_ user-facing change?
No user-facing API change. One behavior tightening: a UDF returning a plain
`Iterable` that is not an `Iterator` (e.g. a `list`) is now rejected with a
`UDF_RETURN_TYPE` error. The public signatures (`ArrowMapIterFunction`, pandas
iterator UDFs) have always declared `Iterator[...]`, so this aligns runtime
enforcement with the documented contract.
### How was this patch tested?
Existing tests. No behavior change for contract-compliant UDFs.
- Added `python/pyspark/tests/test_verify_return_type.py` with 6 unit tests
covering:
- concrete-type accept / reject
- iterator-mode accept + laziness
- iterator-mode reject of non-iterable, non-Iterator iterable (`list`),
and wrong element
- Existing integration coverage remains:
`python/pyspark/sql/tests/arrow/test_arrow_map.py::test_other_than_recordbatch_iter`
verifies the end-to-end error messages.
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]