[jira] [Commented] (SPARK-24735) Improve exception when mixing up pandas_udf types

2018-08-13 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578719#comment-16578719
 ] 

holdenk commented on SPARK-24735:
-

So [~bryanc]what do you think of if we add a AggregatePythonUDF and use it for 
grouped_map / grouped_agg so we get treated the correct way by the Scala SQL 
engine?

> Improve exception when mixing up pandas_udf types
> -
>
> Key: SPARK-24735
> URL: https://issues.apache.org/jira/browse/SPARK-24735
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> From the discussion here 
> https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
> Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = 
> pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an 
> exception which is hard to understand.  It should tell the user that the UDF 
> type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
> line 353, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
>   at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24735) Improve exception when mixing up pandas_udf types

2018-08-13 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578710#comment-16578710
 ] 

holdenk commented on SPARK-24735:
-

I think we could do better than just improving the exception, if we look at the 
other aggregates in PySpark when we call them with select it does the grouping 
for us:

 
{code:java}
>>> df.select(sumDistinct(df._1)).show()
++
|sum(DISTINCT _1)|
++
| 4950   |
++{code}

> Improve exception when mixing up pandas_udf types
> -
>
> Key: SPARK-24735
> URL: https://issues.apache.org/jira/browse/SPARK-24735
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> From the discussion here 
> https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
> Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = 
> pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an 
> exception which is hard to understand.  It should tell the user that the UDF 
> type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
> line 353, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
>   at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org