[jira] [Updated] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument

Xiao Li (JIRA) Thu, 11 May 2017 12:58:25 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiao Li updated SPARK-20685:
----------------------------
    Fix Version/s:     (was: 2.3.0)

> BatchPythonEvaluation UDF evaluator fails for case of single UDF with 
> repeated argument
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-20685
>                 URL: https://issues.apache.org/jira/browse/SPARK-20685
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>             Fix For: 2.1.2, 2.2.0
>
>
> There's a latent corner-case bug in PYSpark UDF evaluation where executing a 
> stage with a single UDF that takes more than one argument _where that 
> argument is repeated_ will crash at execution with a confusing error.
> Here's a repro:
> {code}
> from pyspark.sql.types import *
> spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType())
> spark.sql("SELECT add(1, 1)").first()
> {code}
> This fails with
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 180, in main
>     process()
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 175, in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 107, in <lambda>
>     func = lambda _, it: map(mapper, it)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 93, in <lambda>
>     mapper = lambda a: udf(*a)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 71, in <lambda>
>     return lambda *a: f(*a)
> TypeError: <lambda>() takes exactly 2 arguments (1 given)
> {code}
> The problem was introduced by SPARK-14267: there code there has a fast path 
> for handling a "batch UDF evaluation consisting of a single Python UDF, but 
> that branch incorrectly assumes that a single UDF won't have repeated 
> arguments and therefore skips the code for unpacking arguments from the input 
> row (whose schema may not necessarily match the UDF inputs).
> I have a simple fix for this which I'll submit now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument

Reply via email to