This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 08f1d8b6ffe [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle `list` GenericAlias in Python 3.11+ 08f1d8b6ffe is described below commit 08f1d8b6ffed2d9a4c0633bd65ac4cef13f5c745 Author: Dongjoon Hyun <dh...@apple.com> AuthorDate: Mon Nov 20 08:30:42 2023 +0900 [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle `list` GenericAlias in Python 3.11+ ### What changes were proposed in this pull request? This PR aims to fix `type hints` to handle `list` GenericAlias in Python 3.11+ for Apache Spark 4.0.0 and 3.5.1. - https://github.com/apache/spark/actions/workflows/build_python.yml ### Why are the changes needed? PEP 646 changes `GenericAlias` instances into `Iterable` ones at Python 3.11. - https://peps.python.org/pep-0646/ This behavior changes introduce the following failure on Python 3.11. - **Python 3.11.6** ```python Python 3.11.6 (main, Nov 1 2023, 07:46:30) [Clang 14.0.0 (clang-1400.0.28.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/11/18 16:34:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ Using Python version 3.11.6 (main, Nov 1 2023 07:46:30) Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1700354049391). SparkSession available as 'spark'. >>> from pyspark import pandas as ps >>> from typing import List >>> ps.DataFrame[float, [int, List[int]]] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/frame.py", line 13647, in __class_getitem__ return create_tuple_for_frame_type(params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py", line 717, in create_tuple_for_frame_type return Tuple[_to_type_holders(params)] ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py", line 762, in _to_type_holders data_types = _new_type_holders(data_types, NameTypeHolder) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py", line 828, in _new_type_holders raise TypeError( TypeError: Type hints should be specified as one of: - DataFrame[type, type, ...] - DataFrame[name: type, name: type, ...] - DataFrame[dtypes instance] - DataFrame[zip(names, types)] - DataFrame[index_type, [type, ...]] - DataFrame[(index_name, index_type), [(name, type), ...]] - DataFrame[dtype instance, dtypes instance] - DataFrame[(index_name, index_type), zip(names, types)] - DataFrame[[index_type, ...], [type, ...]] - DataFrame[[(index_name, index_type), ...], [(name, type), ...]] - DataFrame[dtypes instance, dtypes instance] - DataFrame[zip(index_names, index_types), zip(names, types)] However, got (<class 'int'>, typing.List[int]). ``` - **Python 3.10.13** ```python Python 3.10.13 (main, Sep 29 2023, 16:03:45) [Clang 14.0.0 (clang-1400.0.28.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/11/18 16:33:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ Using Python version 3.10.13 (main, Sep 29 2023 16:03:45) Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1700354002048). SparkSession available as 'spark'. >>> from pyspark import pandas as ps >>> from typing import List >>> ps.DataFrame[float, [int, List[int]]] typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType] >>> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Manually test with Python 3.11. ``` $ build/sbt -Phadoop-3 -Pkinesis-asl -Pyarn -Pkubernetes -Pdocker-integration-tests -Pconnect -Pspark-ganglia-lgpl -Pvolcano -Phadoop-cloud -Phive-thriftserver -Phive Test/package streaming-kinesis-asl-assembly/assembly connect/assembly $ python/run-tests --modules pyspark-pandas-slow --python-executables python3.11 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43888 from dongjoon-hyun/SPARK-45988. Authored-by: Dongjoon Hyun <dh...@apple.com> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- python/pyspark/pandas/typedef/typehints.py | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/python/pyspark/pandas/typedef/typehints.py b/python/pyspark/pandas/typedef/typehints.py index 57bfd7fcd83..bb0f70ee924 100644 --- a/python/pyspark/pandas/typedef/typehints.py +++ b/python/pyspark/pandas/typedef/typehints.py @@ -796,9 +796,21 @@ def _new_type_holders( isinstance(param, slice) and param.step is None and param.stop is not None for param in params ) - is_unnamed_params = all( - not isinstance(param, slice) and not isinstance(param, Iterable) for param in params - ) + if sys.version_info < (3, 11): + is_unnamed_params = all( + not isinstance(param, slice) and not isinstance(param, Iterable) for param in params + ) + else: + # PEP 646 changes `GenericAlias` instances into iterable ones at Python 3.11 + is_unnamed_params = all( + not isinstance(param, slice) + and ( + not isinstance(param, Iterable) + or isinstance(param, typing.GenericAlias) + or isinstance(param, typing._GenericAlias) + ) + for param in params + ) if is_named_params: # DataFrame["id": int, "A": int] --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org