[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF URL: https://github.com/apache/spark/pull/23900#discussion_r268465039 ## File path: python/pyspark/serializers.py ## @@ -295,7 +297,34 @@ def create_array(s, t): raise RuntimeError(error_msg % (s.dtype, t), e) return array -arrs = [create_array(s, t) for s, t in series] +arrs = [] +for s, t in series: +if t is not None and pa.types.is_struct(t): +if not isinstance(s, pd.DataFrame): Review comment: Looks good @BryanCutler, and cc @ueshin FYI. Just out of curiosity, WDYT about putting those PySpark specific conversion logics into somewhere together, of course, in a separate PR and JIRA? Looks it's getting difficult to read (to me .. ) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF URL: https://github.com/apache/spark/pull/23900#discussion_r268047078 ## File path: python/pyspark/sql/functions.py ## @@ -2868,6 +2869,15 @@ def pandas_udf(f=None, returnType=None, functionType=None): +--+--++ | 8| JOHN DOE| 22| +--+--++ + >>> @pandas_udf("first string, last string") # doctest: +SKIP Review comment: Yes, something has to be done. I was at the very least tried to document the casting combinations. **Pandas UDF matrix:** https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L3098-L3131 The problem is, this matrix is different from regular PySpark UDF, and also our `TypeCoercions`: **Regular PySpark UDF matrix:** https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L2830-L2860 I lost the last discussion about whether we should allow such type coercions or not. But basically my guts say: If we allow, I think it will need a huge bunch of codes to maintain again (Arrow Type <> Pandas type <> Python type <> SparkSQL type), but if we disallow, it will break many existing apps. One way is that we explicitly document that Pandas's type coercion is dependent on Arrow (apart from regular PySpark UDF), and throw an explicit exception. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF URL: https://github.com/apache/spark/pull/23900#discussion_r268047078 ## File path: python/pyspark/sql/functions.py ## @@ -2868,6 +2869,15 @@ def pandas_udf(f=None, returnType=None, functionType=None): +--+--++ | 8| JOHN DOE| 22| +--+--++ + >>> @pandas_udf("first string, last string") # doctest: +SKIP Review comment: Yes, something has to be done. I was at the very least tried to document the casting combinations. **Pandas UDF matrix:** https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L3098-L3131 The problem is, this matrix is different from regular PySpark UDF, and also our `TypeCoercions`: **Regular PySpark UDF matrix:** https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L2830-L2860 I lost the last discussion about whether we should allow such type coercions or not. But basically my guts say: If we allow, I think it will need a huge bunch of codes to maintain again (Arrow Type <> Pandas type <> Python type <> SparkSQL type), but if we disallow, it will break many existing apps. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF URL: https://github.com/apache/spark/pull/23900#discussion_r268047291 ## File path: python/pyspark/sql/functions.py ## @@ -2868,6 +2869,15 @@ def pandas_udf(f=None, returnType=None, functionType=None): +--+--++ | 8| JOHN DOE| 22| +--+--++ + >>> @pandas_udf("first string, last string") # doctest: +SKIP Review comment: And .. yes I think we should also throw better exceptions at the very least. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF URL: https://github.com/apache/spark/pull/23900#discussion_r268047078 ## File path: python/pyspark/sql/functions.py ## @@ -2868,6 +2869,15 @@ def pandas_udf(f=None, returnType=None, functionType=None): +--+--++ | 8| JOHN DOE| 22| +--+--++ + >>> @pandas_udf("first string, last string") # doctest: +SKIP Review comment: Yes, something has to be done. I was at the very least tried to document the casting combinations. **Pandas UDF matrix:** https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L3098-L3131 The problem is, this matrix is different from regular PySpark UDF, and also our `TypeCoercions`: **Regular PySpark UDF matrix:** https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L2830-L2860 I lost the last discussion about whether we should allow such type coercions or not. I roughly recall: If we allow, I think it will need a huge bunch of codes to maintain again (Arrow Type <> Pandas type <> Python type <> SparkSQL type), but if we disallow, it will break many existing apps. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF URL: https://github.com/apache/spark/pull/23900#discussion_r262365143 ## File path: python/pyspark/sql/udf.py ## @@ -133,6 +133,8 @@ def returnType(self): "UDFs: returnType must be a StructType.") elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF: try: +if isinstance(self._returnType_placeholder, StructType): +raise TypeError Review comment: Yea, either way sounds good to me. I'll leave it to you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF URL: https://github.com/apache/spark/pull/23900#discussion_r262334277 ## File path: python/pyspark/sql/udf.py ## @@ -133,6 +133,8 @@ def returnType(self): "UDFs: returnType must be a StructType.") elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF: try: +if isinstance(self._returnType_placeholder, StructType): +raise TypeError Review comment: I see. Can you add some message while we're here? If this is going to be fixed soon, I am okay as is as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF URL: https://github.com/apache/spark/pull/23900#discussion_r262333893 ## File path: python/pyspark/serializers.py ## @@ -64,6 +64,7 @@ from itertools import izip as zip, imap as map else: import pickle +basestring = unicode = str Review comment: I don't think we'll drop it in Spark 3.0. I will cc you in the related PRs later in the future. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF URL: https://github.com/apache/spark/pull/23900#discussion_r262327034 ## File path: python/pyspark/sql/udf.py ## @@ -133,6 +133,8 @@ def returnType(self): "UDFs: returnType must be a StructType.") elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF: try: +if isinstance(self._returnType_placeholder, StructType): +raise TypeError Review comment: Hm, @BryanCutler, sorry if I missed something but why do we throw a type error here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org