zhangyt26 opened a new pull request, #47571: URL: https://github.com/apache/spark/pull/47571
This will create a stable hash for the plan if there are more than one projection to break down ### What changes were proposed in this pull request? Use LinkedHashSet so projections order is always stable ### Why are the changes needed? The same spark plan should compare equal even after analyzer phase. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` import spark.implicits._ import org.apache.spark.sql.{Row, functions => SparkFuncs} import org.apache.spark.sql.types._ import scala.collection.JavaConverters._ import scala.collection.immutable.ListMap val df1 = Seq((1, 2, 3, 4), (5, 6, 7, 8), (9, 10, 11, 12)) .toDF("a", "b", "c", "d") .select(SparkFuncs.struct("a", "b", "c", "d").alias("df1_struct")) val df2 = df1.groupBy().agg(SparkFuncs.first("df1_struct").as("df1_struct_first")).withColumn("e", SparkFuncs.lit("100")) val df3 = spark .createDataFrame(List(Row(0)).asJava, StructType(Seq(StructField("EMPTY", IntegerType)))) .withColumn("num1", SparkFuncs.lit(1000)) .withColumn("num2", SparkFuncs.lit(2000)) .withColumn("num3", SparkFuncs.lit(3000)) .withColumn("num4", SparkFuncs.lit(4000)) .select(SparkFuncs.struct("num1", "num2", "num3", "num4").alias("df3_struct")) .cache() val df4 = df2 .join( df2 .join(df3, Seq(), "inner") .groupBy() .agg(SparkFuncs.first("df3_struct").as("df3_struct_first")), Seq(), "left") .select( SparkFuncs.col("*"), SparkFuncs .posexplode(SparkFuncs.array(SparkFuncs.lit("x"), SparkFuncs.lit("y"), SparkFuncs.lit("z"))) .as(Seq("pos", "val")) ) // run this multiple times and it is unstable on master, but stable after the patch df4 .withColumns( ListMap( "col1" -> SparkFuncs.lit(100), "col2" -> SparkFuncs .when( SparkFuncs.col("val").eqNullSafe(SparkFuncs.lit("x")), SparkFuncs.col("df1_struct_first").getItem("a")) .otherwise( SparkFuncs .when( SparkFuncs.col("val").eqNullSafe(SparkFuncs.lit("y")), SparkFuncs.col("df1_struct_first").getItem("b")) .otherwise(SparkFuncs .when( SparkFuncs.col("val").eqNullSafe(SparkFuncs.lit("z")), SparkFuncs.col("df1_struct_first").getItem("c")))), "col3" -> SparkFuncs.col("col2").plus(SparkFuncs.lit(1000)), "col4" -> SparkFuncs.col("df3_struct_first").getItem("num1").multiply(SparkFuncs.lit(10)), "col5" -> SparkFuncs.col("df3_struct_first").getItem("num2").multiply(SparkFuncs.lit(10)), "col6" -> SparkFuncs.col("df3_struct_first").getItem("num3").multiply(SparkFuncs.lit(10)), "col7" -> SparkFuncs.col("df3_struct_first").getItem("num4").multiply(SparkFuncs.lit(10)), "col8" -> SparkFuncs.col("col4").minus(SparkFuncs.col("col5")), "col9" -> SparkFuncs.col("col8").plus(SparkFuncs.lit(1000)) )) .semanticHash() ``` ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org