[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221612#comment-16221612 ]
Liang-Chi Hsieh edited comment on SPARK-22347 at 10/27/17 2:44 AM: ------------------------------------------------------------------- Under the current execution mode of Python UDFs, I think it is hard to support Python UDFs as branch values or else value in CaseWhen expression. The execution of batch/vectorized Python UDFs evaluates the UDFs in an operator at once. It might not be easy to let it support conditional execution. I'd rather like to disable the usage of Python UDFs in CaseWhen. I think it can be very easy to incorporate the condition logic of CaseWhen into the Python UDFs, e.g. for the above example: {code} def Divide10(): def fn(value): return 10 / int(value) if (value > 0) else None return udf(fn, types.IntegerType()) df2 = df.select(when((x > 0), Divide10()(x))) df2.show() +--------------------------------+ |CASE WHEN (x > 0) THEN fn(x) END| +--------------------------------+ | 2| | null| +--------------------------------+ {code} was (Author: viirya): Under the current execution mode of Python UDFs, I think it is hard to support Python UDFs as branch values or else value in CaseWhen expression. The execution of batch/vectorized Python UDFs evaluates the UDFs in an operator at once. It might not be easy to let it support conditional execution. I'd rather disable the usage of Python UDFs in CaseWhen. I think it can be very easy to incorporate the condition logic of CaseWhen into the Python UDFs, e.g. for the above example: {code} def Divide10(): def fn(value): return 10 / int(value) if (value > 0) else None return udf(fn, types.IntegerType()) df2 = df.select(when((x > 0), Divide10()(x))) df2.show() +--------------------------------+ |CASE WHEN (x > 0) THEN fn(x) END| +--------------------------------+ | 2| | null| +--------------------------------+ {code} > UDF is evaluated when 'F.when' condition is false > ------------------------------------------------- > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.2.0 > Reporter: Nicolas Porter > Priority: Minor > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF when the `F.when` condition is false. > Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, > then the error is not raised and all rows resolve to `null`, which is the > expected result. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org