[PR] [SPARK-56612][PYTHON] Unify verify_result and container-type checks into verify_return_type helper [spark]

via GitHub Fri, 24 Apr 2026 01:20:20 -0700


Yicong-Huang opened a new pull request, #55532:
URL: https://github.com/apache/spark/pull/55532


   ### What changes were proposed in this pull request?
   
   Replace the `verify_result(expected_type)` iterator factory in `worker.py` 
with a unified helper:
   
   ```python
   verify_return_type(result, expected_type)
   ```
   
   - When `expected_type` is a concrete type (e.g. `pa.Table`, 
`pa.RecordBatch`), it checks `isinstance(result, expected_type)` and returns 
`result` unchanged.
   - When `expected_type` is `Iterator[T]`, it checks that `result` is an 
`Iterator` and returns a lazy iterator that type-checks each element against 
`T` on consumption.
   
   The two existing callers are updated:
   
   - `SQL_MAP_ARROW_ITER_UDF`: `verify_result(pa.RecordBatch)(output_batches)` 
-> `verify_return_type(output_batches, Iterator[pa.RecordBatch])`
   - `SQL_SCALAR_ARROW_ITER_UDF`: 
`verify_result(pa.Array)(udf_func(args_iter))` -> 
`verify_return_type(udf_func(args_iter), Iterator[pa.Array])`
   
   The runtime now strictly requires an `Iterator` (dropping the prior 
`hasattr(result, "__iter__")` fallback that also admitted plain iterables such 
as `list`). This aligns with the public UDF type contract 
(`ArrowMapIterFunction = Callable[[Iterator[pa.RecordBatch]], 
Iterator[pa.RecordBatch]]`) and with the type-hint inference in 
`pyspark/sql/pandas/typehints.py` which only recognizes `Iterator`.
   
   ### Why are the changes needed?
   
   Part of SPARK-55388. Consolidates the iterator factory and the scattered 
container-type `isinstance` checks behind one well-named helper, so downstream 
eval-type refactors can uniformly replace inline `isinstance` checks with a 
single call.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No user-facing API change. One behavior tightening: a UDF returning a plain 
`Iterable` that is not an `Iterator` (e.g. a `list`) is now rejected with a 
`UDF_RETURN_TYPE` error. The public signatures (`ArrowMapIterFunction`, pandas 
iterator UDFs) have always declared `Iterator[...]`, so this aligns runtime 
enforcement with the documented contract.
   
   ### How was this patch tested?
   
   Existing tests. No behavior change for contract-compliant UDFs.
   
   - Added `python/pyspark/tests/test_verify_return_type.py` with 6 unit tests 
covering:
     - concrete-type accept / reject
     - iterator-mode accept + laziness
     - iterator-mode reject of non-iterable, non-Iterator iterable (`list`), 
and wrong element
   - Existing integration coverage remains: 
`python/pyspark/sql/tests/arrow/test_arrow_map.py::test_other_than_recordbatch_iter`
 verifies the end-to-end error messages.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56612][PYTHON] Unify verify_result and container-type checks into verify_return_type helper [spark]

Reply via email to