[ https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li updated SPARK-20685: ---------------------------- Fix Version/s: (was: 2.3.0) > BatchPythonEvaluation UDF evaluator fails for case of single UDF with > repeated argument > --------------------------------------------------------------------------------------- > > Key: SPARK-20685 > URL: https://issues.apache.org/jira/browse/SPARK-20685 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.0.0 > Reporter: Josh Rosen > Assignee: Josh Rosen > Fix For: 2.1.2, 2.2.0 > > > There's a latent corner-case bug in PYSpark UDF evaluation where executing a > stage with a single UDF that takes more than one argument _where that > argument is repeated_ will crash at execution with a confusing error. > Here's a repro: > {code} > from pyspark.sql.types import * > spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType()) > spark.sql("SELECT add(1, 1)").first() > {code} > This fails with > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 180, in main > process() > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 175, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 107, in <lambda> > func = lambda _, it: map(mapper, it) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 93, in <lambda> > mapper = lambda a: udf(*a) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 71, in <lambda> > return lambda *a: f(*a) > TypeError: <lambda>() takes exactly 2 arguments (1 given) > {code} > The problem was introduced by SPARK-14267: there code there has a fast path > for handling a "batch UDF evaluation consisting of a single Python UDF, but > that branch incorrectly assumes that a single UDF won't have repeated > arguments and therefore skips the code for unpacking arguments from the input > row (whose schema may not necessarily match the UDF inputs). > I have a simple fix for this which I'll submit now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org