[ 
https://issues.apache.org/jira/browse/SPARK-22629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16303480#comment-16303480
 ] 

Xiao Li commented on SPARK-22629:
---------------------------------

The reason is we assume all the UDF are deterministic. The problem this JIRA 
hit is caused by the misuse.

We are thinking whether we should change the default to be non-deterministic. 

> incorrect handling of calls to random in UDFs
> ---------------------------------------------
>
>                 Key: SPARK-22629
>                 URL: https://issues.apache.org/jira/browse/SPARK-22629
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.0
>            Reporter: Michael H
>
> {code:none}
> df_br = spark.createDataFrame([{'name': 'hello'}])
> # udf creates a random integer
> udf_random_col =  udf(lambda: int(100*random.random()), IntegerType())
> # add a column to our DF using that udf
> df_br = df_br.withColumn('RAND', udf_random_col())
> df_br.show()
> +-----+----+
> | name|RAND|
> +-----+----+
> |hello|  68|
> +-----+----+
> # udf that adds 10 to an input column value
> random.seed(1234)
> udf_add_ten =  udf(lambda rand: rand + 10, IntegerType())
> # unexpected result due to re-evaluation
> df_br.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).show()
> +-----+----+-------------+
> | name|RAND|RAND_PLUS_TEN|
> +-----+----+-------------+
> |hello|  72|           87|
> +-----+----+-------------+
> # workaround: cache the resulst after using the random number generating udf
> df_br.withColumn('RAND', 
> udf_random_col()).cache().withColumn('RAND_PLUS_TEN', 
> udf_add_ten('RAND')).show()
> +-----+----+-------------+
> | name|RAND|RAND_PLUS_TEN|
> +-----+----+-------------+
> |hello|  68|           78|
> +-----+----+-------------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to