[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF

2019-03-24 Thread GitBox
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] 
Add support for StructType return in Scalar Pandas UDF
URL: https://github.com/apache/spark/pull/23900#discussion_r268465039
 
 

 ##
 File path: python/pyspark/serializers.py
 ##
 @@ -295,7 +297,34 @@ def create_array(s, t):
 raise RuntimeError(error_msg % (s.dtype, t), e)
 return array
 
-arrs = [create_array(s, t) for s, t in series]
+arrs = []
+for s, t in series:
+if t is not None and pa.types.is_struct(t):
+if not isinstance(s, pd.DataFrame):
 
 Review comment:
   Looks good @BryanCutler, and cc @ueshin FYI. Just out of curiosity, WDYT 
about putting those PySpark specific conversion logics into somewhere together, 
of course, in a separate PR and JIRA? Looks it's getting difficult to read (to 
me .. )


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF

2019-03-22 Thread GitBox
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] 
Add support for StructType return in Scalar Pandas UDF
URL: https://github.com/apache/spark/pull/23900#discussion_r268047078
 
 

 ##
 File path: python/pyspark/sql/functions.py
 ##
 @@ -2868,6 +2869,15 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
+--+--++
| 8|  JOHN DOE|  22|
+--+--++
+   >>> @pandas_udf("first string, last string")  # doctest: +SKIP
 
 Review comment:
   Yes, something has to be done. I was at the very least tried to document the 
casting 
   combinations.
   
   **Pandas UDF matrix:**
   
   
https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L3098-L3131
   
   The problem is, this matrix is different from regular PySpark UDF, and also 
our `TypeCoercions`:
   
   **Regular PySpark UDF matrix:**
   
   
https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L2830-L2860
   
   I lost the last discussion about whether we should allow such type coercions 
or not. But basically my guts say: If we allow, I think it will need a huge 
bunch of codes to maintain again (Arrow Type <> Pandas type <> Python type <> 
SparkSQL type), but if we disallow, it will break many existing apps.
   
   One way is that we explicitly document that Pandas's type coercion is 
dependent on Arrow (apart from regular PySpark UDF), and throw an explicit 
exception.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF

2019-03-22 Thread GitBox
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] 
Add support for StructType return in Scalar Pandas UDF
URL: https://github.com/apache/spark/pull/23900#discussion_r268047078
 
 

 ##
 File path: python/pyspark/sql/functions.py
 ##
 @@ -2868,6 +2869,15 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
+--+--++
| 8|  JOHN DOE|  22|
+--+--++
+   >>> @pandas_udf("first string, last string")  # doctest: +SKIP
 
 Review comment:
   Yes, something has to be done. I was at the very least tried to document the 
casting 
   combinations.
   
   **Pandas UDF matrix:**
   
   
https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L3098-L3131
   
   The problem is, this matrix is different from regular PySpark UDF, and also 
our `TypeCoercions`:
   
   **Regular PySpark UDF matrix:**
   
   
https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L2830-L2860
   
   I lost the last discussion about whether we should allow such type coercions 
or not. But basically my guts say: If we allow, I think it will need a huge 
bunch of codes to maintain again (Arrow Type <> Pandas type <> Python type <> 
SparkSQL type), but if we disallow, it will break many existing apps.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF

2019-03-22 Thread GitBox
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] 
Add support for StructType return in Scalar Pandas UDF
URL: https://github.com/apache/spark/pull/23900#discussion_r268047291
 
 

 ##
 File path: python/pyspark/sql/functions.py
 ##
 @@ -2868,6 +2869,15 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
+--+--++
| 8|  JOHN DOE|  22|
+--+--++
+   >>> @pandas_udf("first string, last string")  # doctest: +SKIP
 
 Review comment:
   And .. yes I think we should also throw better exceptions at the very least.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF

2019-03-22 Thread GitBox
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] 
Add support for StructType return in Scalar Pandas UDF
URL: https://github.com/apache/spark/pull/23900#discussion_r268047078
 
 

 ##
 File path: python/pyspark/sql/functions.py
 ##
 @@ -2868,6 +2869,15 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
+--+--++
| 8|  JOHN DOE|  22|
+--+--++
+   >>> @pandas_udf("first string, last string")  # doctest: +SKIP
 
 Review comment:
   Yes, something has to be done. I was at the very least tried to document the 
casting 
   combinations.
   
   **Pandas UDF matrix:**
   
   
https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L3098-L3131
   
   The problem is, this matrix is different from regular PySpark UDF, and also 
our `TypeCoercions`:
   
   **Regular PySpark UDF matrix:**
   
   
https://github.com/apache/spark/blob/b67d36957287c1fbefa1996e6a4a009a75c4c3f8/python/pyspark/sql/functions.py#L2830-L2860
   
   I lost the last discussion about whether we should allow such type coercions 
or not. I roughly recall: If we allow, I think it will need a huge bunch of 
codes to maintain again (Arrow Type <> Pandas type <> Python type <> SparkSQL 
type), but if we disallow, it will break many existing apps.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF

2019-03-04 Thread GitBox
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] 
Add support for StructType return in Scalar Pandas UDF
URL: https://github.com/apache/spark/pull/23900#discussion_r262365143
 
 

 ##
 File path: python/pyspark/sql/udf.py
 ##
 @@ -133,6 +133,8 @@ def returnType(self):
 "UDFs: returnType must be a StructType.")
 elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF:
 try:
+if isinstance(self._returnType_placeholder, StructType):
+raise TypeError
 
 Review comment:
   Yea, either way sounds good to me. I'll leave it to you.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF

2019-03-04 Thread GitBox
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] 
Add support for StructType return in Scalar Pandas UDF
URL: https://github.com/apache/spark/pull/23900#discussion_r262334277
 
 

 ##
 File path: python/pyspark/sql/udf.py
 ##
 @@ -133,6 +133,8 @@ def returnType(self):
 "UDFs: returnType must be a StructType.")
 elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF:
 try:
+if isinstance(self._returnType_placeholder, StructType):
+raise TypeError
 
 Review comment:
   I see. Can you add some message while we're here? If this is going to be 
fixed soon, I am okay as is as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF

2019-03-04 Thread GitBox
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] 
Add support for StructType return in Scalar Pandas UDF
URL: https://github.com/apache/spark/pull/23900#discussion_r262333893
 
 

 ##
 File path: python/pyspark/serializers.py
 ##
 @@ -64,6 +64,7 @@
 from itertools import izip as zip, imap as map
 else:
 import pickle
+basestring = unicode = str
 
 Review comment:
   I don't think we'll drop it in Spark 3.0. I will cc you in the related PRs 
later in the future.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF

2019-03-04 Thread GitBox
HyukjinKwon commented on a change in pull request #23900: [SPARK-23836][PYTHON] 
Add support for StructType return in Scalar Pandas UDF
URL: https://github.com/apache/spark/pull/23900#discussion_r262327034
 
 

 ##
 File path: python/pyspark/sql/udf.py
 ##
 @@ -133,6 +133,8 @@ def returnType(self):
 "UDFs: returnType must be a StructType.")
 elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF:
 try:
+if isinstance(self._returnType_placeholder, StructType):
+raise TypeError
 
 Review comment:
   Hm, @BryanCutler, sorry if I missed something but why do we throw a type 
error here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org