Franklyn Dsouza created SPARK-19844:
---------------------------------------

             Summary: UDF in when control function is executed before the when 
clause is evaluated.
                 Key: SPARK-19844
                 URL: https://issues.apache.org/jira/browse/SPARK-19844
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 2.1.0, 2.0.1
            Reporter: Franklyn Dsouza


Sometimes we try to filter out the argument to a udf using {code}when(clause, 
udf).otherwise(default){code}

but we've noticed that sometimes the udf is being run on data that shouldn't 
have matched the clause.

heres some code to reproduce the issue.

{code}
from pyspark.sql import functions as F
from pyspark.sql import types

df = sc.sql.createDataFrame([{'r': None}], 
schema=types.StructType([types.StructField('r', types.StringType())]))

simple_udf = F.udf(lambda ref: ref.strip("/"), types.StringType())

df.withColumn('test', 
               F.when(F.col("r").isNotNull(), simple_udf(F.col("r")))
                .otherwise(F.lit(None))
             ).collect()
{code}

This causes an exception because the udf is running on null data. i get 
AttributeError: 'NoneType' object has no attribute 'strip'. 

so it looks like the udf is being evaluated before the clause in the when is 
evaulated. Oddly enough when i change {code}F.col("r").isNotNull(){code} to 
{code}df["r"] != None{code} then it works. 

might be related to https://issues.apache.org/jira/browse/SPARK-13773
 
and https://issues.apache.org/jira/browse/SPARK-15282



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to