[ https://issues.apache.org/jira/browse/SPARK-22629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275553#comment-16275553 ]
Sean Owen commented on SPARK-22629: ----------------------------------- That case is a better question and I'm not sure, actually, whether that is intended semantics or not. > incorrect handling of calls to random in UDFs > --------------------------------------------- > > Key: SPARK-22629 > URL: https://issues.apache.org/jira/browse/SPARK-22629 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.1.0 > Reporter: Michael H > > {code:none} > df_br = spark.createDataFrame([{'name': 'hello'}]) > # udf creates a random integer > udf_random_col = udf(lambda: int(100*random.random()), IntegerType()) > # add a column to our DF using that udf > df_br = df_br.withColumn('RAND', udf_random_col()) > df_br.show() > +-----+----+ > | name|RAND| > +-----+----+ > |hello| 68| > +-----+----+ > # udf that adds 10 to an input column value > random.seed(1234) > udf_add_ten = udf(lambda rand: rand + 10, IntegerType()) > # unexpected result due to re-evaluation > df_br.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).show() > +-----+----+-------------+ > | name|RAND|RAND_PLUS_TEN| > +-----+----+-------------+ > |hello| 72| 87| > +-----+----+-------------+ > # workaround: cache the resulst after using the random number generating udf > df_br.withColumn('RAND', > udf_random_col()).cache().withColumn('RAND_PLUS_TEN', > udf_add_ten('RAND')).show() > +-----+----+-------------+ > | name|RAND|RAND_PLUS_TEN| > +-----+----+-------------+ > |hello| 68| 78| > +-----+----+-------------+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org