[jira] [Commented] (SPARK-37752) Python UDF fails when it should not get evaluated

L. C. Hsieh (Jira) Mon, 27 Dec 2021 14:04:04 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-37752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17465883#comment-17465883
 ]


L. C. Hsieh commented on SPARK-37752:
-------------------------------------

This is known limitation of PySpark UDF. You can check the document of "udf" 
function. There is a note:

The user-defined functions do not support conditional expressions or short 
circuiting
in boolean expressions and it ends up with being executed all internally. If 
the functions
can fail on special rows, the workaround is to incorporate the condition into 
the functions.

You can also check the physical query plan of the query. As we can see, the udf 
is evaluated before the Project with the conditional expression.

{code:java}
== Physical Plan ==
*(2) Project [CASE WHEN (length(c#17) > 2) THEN pythonUDF0#21 END AS CASE WHEN 
(length(c) > 2) THEN udf1(c) END#19]
+- BatchEvalPython [udf1(c#17)#18], [pythonUDF0#21]
   +- *(1) Generate explode([123,234,12]), false, [c#17]
      +- *(1) Scan OneRowRelation[]
{code}



> Python UDF fails when it should not get evaluated
> -------------------------------------------------
>
>                 Key: SPARK-37752
>                 URL: https://issues.apache.org/jira/browse/SPARK-37752
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.4
>            Reporter: Ohad Raviv
>            Priority: Minor
>
> Haven't checked on newer versions yet.
> If i define in Python:
> {code:java}
> def udf1(col1):
>     print(col1[2])
>     return "blah"
> spark.udf.register("udf1", udf1) {code}
> and then use it in SQL:
> {code:java}
> select case when length(c)>2 then udf1(c) end
> from (
>     select explode(array("123","234","12")) as c
> ) {code}
> it fails on:
> {noformat}
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 253, 
> in main
>     process()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 248, 
> in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 155, 
> in <lambda>
>     func = lambda _, it: map(mapper, it)
>   File "<string>", line 1, in <lambda>
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 76, in 
> <lambda>
>     return lambda *a: f(*a)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in 
> wrapper
>     return f(*args, **kwargs)
>   File "<stdin>", line 3, in udf1
> IndexError: string index out of range{noformat}
> Although in the out-of-range row it should not get evaluated at all as the 
> case-when filters for lengths of more than 2 letters.
> the same scenario works great when we define instead a Scala UDF.
> will check now if it happens also for newer versions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37752) Python UDF fails when it should not get evaluated

Reply via email to