[ https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655532#comment-17655532 ]
L. C. Hsieh commented on SPARK-41049: ------------------------------------- For a correctness bug, I think we should backport it, though the patch is a kind of refactoring work. > Nondeterministic expressions have unstable values if they are children of > CodegenFallback expressions > ----------------------------------------------------------------------------------------------------- > > Key: SPARK-41049 > URL: https://issues.apache.org/jira/browse/SPARK-41049 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.2 > Reporter: Guy Boo > Assignee: Wenchen Fan > Priority: Major > Labels: correctness > Fix For: 3.4.0 > > > h2. Expectation > For a given row, Nondeterministic expressions are expected to have stable > values. > {code:scala} > import org.apache.spark.sql.functions._ > val df = sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(10000)).cast(IntegerType) > df.select(v1, v1).collect{code} > Returns a set like this: > |8777|8777| > |1357|1357| > |3435|3435| > |9204|9204| > |3870|3870| > where both columns always have the same value, but what that value is changes > from row to row. This is different from the following: > {code:scala} > df.select(rand(), rand()).collect{code} > In this case, because the rand() calls are distinct, the values in both > columns should be different. > h2. Problem > This expectation does not appear to be stable in the event that any > subsequent expression is a CodegenFallback. This program: > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > val sparkSession = SparkSession.builder().getOrCreate() > val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(10000)).cast(IntegerType) > val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback > df.select(v1, v1, v2, v2).collect {code} > produces output like this: > |8159|8159|8159|{color:#ff0000}2028{color}| > |8320|8320|8320|{color:#ff0000}1640{color}| > |7937|7937|7937|{color:#ff0000}769{color}| > |436|436|436|{color:#ff0000}8924{color}| > |8924|8924|2827|{color:#ff0000}2731{color}| > Not sure why the first call via the CodegenFallback path should be correct > while subsequent calls aren't. > h2. Workaround > If the Nondeterministic expression is moved to a separate, earlier select() > call, so the CodegenFallback instead only refers to a column reference, then > the problem seems to go away. But this workaround may not be reliable if > optimization is ever able to restructure adjacent select()s. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org