[jira] [Comment Edited] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

Liang-Chi Hsieh (JIRA) Thu, 26 Oct 2017 19:46:01 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221612#comment-16221612
 ]


Liang-Chi Hsieh edited comment on SPARK-22347 at 10/27/17 2:44 AM:
-------------------------------------------------------------------

Under the current execution mode of Python UDFs, I think it is hard to support 
Python UDFs as branch values or else value in CaseWhen expression. The 
execution of batch/vectorized Python UDFs evaluates the UDFs in an operator at 
once. It might not be easy to let it support conditional execution. I'd rather 
like to disable the usage of Python UDFs in CaseWhen. I think it can be very 
easy to incorporate the condition logic of CaseWhen into the Python UDFs, e.g. 
for the above example:

{code}
def Divide10():
    def fn(value): return 10 / int(value) if (value > 0) else None
    return udf(fn, types.IntegerType())

df2 = df.select(when((x > 0), Divide10()(x)))
df2.show()
+--------------------------------+
|CASE WHEN (x > 0) THEN fn(x) END|
+--------------------------------+
|                               2|
|                            null|
+--------------------------------+
{code}



was (Author: viirya):
Under the current execution mode of Python UDFs, I think it is hard to support 
Python UDFs as branch values or else value in CaseWhen expression. The 
execution of batch/vectorized Python UDFs evaluates the UDFs in an operator at 
once. It might not be easy to let it support conditional execution. I'd rather 
disable the usage of Python UDFs in CaseWhen. I think it can be very easy to 
incorporate the condition logic of CaseWhen into the Python UDFs, e.g. for the 
above example:

{code}
def Divide10():
    def fn(value): return 10 / int(value) if (value > 0) else None
    return udf(fn, types.IntegerType())

df2 = df.select(when((x > 0), Divide10()(x)))
df2.show()
+--------------------------------+
|CASE WHEN (x > 0) THEN fn(x) END|
+--------------------------------+
|                               2|
|                            null|
+--------------------------------+
{code}


> UDF is evaluated when 'F.when' condition is false
> -------------------------------------------------
>
>                 Key: SPARK-22347
>                 URL: https://issues.apache.org/jira/browse/SPARK-22347
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0
>            Reporter: Nicolas Porter
>            Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
>     def fn(value): return 10 / int(value)
>     return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

Reply via email to