[PR] [MINOR] Use LinkedHashSet for ResolveLateralColumnAliasReference to generate stable hash for the plan [spark]

via GitHub Thu, 01 Aug 2024 13:04:38 -0700


zhangyt26 opened a new pull request, #47571:
URL: https://github.com/apache/spark/pull/47571


   This will create a stable hash for the plan if there are more than one 
projection to break down
   
   ### What changes were proposed in this pull request?
   Use LinkedHashSet so projections order is always stable
   
   
   ### Why are the changes needed?
   The same spark plan should compare equal even after analyzer phase.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   ```
   import spark.implicits._
   import org.apache.spark.sql.{Row, functions => SparkFuncs}
   import org.apache.spark.sql.types._
   import scala.collection.JavaConverters._
   import scala.collection.immutable.ListMap
   
   val df1 = Seq((1, 2, 3, 4), (5, 6, 7, 8), (9, 10, 11, 12))
     .toDF("a", "b", "c", "d")
     .select(SparkFuncs.struct("a", "b", "c", "d").alias("df1_struct"))
   
   val df2 =
     
df1.groupBy().agg(SparkFuncs.first("df1_struct").as("df1_struct_first")).withColumn("e",
 SparkFuncs.lit("100"))
   
   val df3 = spark
     .createDataFrame(List(Row(0)).asJava, StructType(Seq(StructField("EMPTY", 
IntegerType))))
     .withColumn("num1", SparkFuncs.lit(1000))
     .withColumn("num2", SparkFuncs.lit(2000))
     .withColumn("num3", SparkFuncs.lit(3000))
     .withColumn("num4", SparkFuncs.lit(4000))
     .select(SparkFuncs.struct("num1", "num2", "num3", 
"num4").alias("df3_struct"))
     .cache()
   
   val df4 = df2
     .join(
         df2
           .join(df3, Seq(), "inner")
           .groupBy()
           .agg(SparkFuncs.first("df3_struct").as("df3_struct_first")),
         Seq(),
         "left")
     .select(
         SparkFuncs.col("*"),
         SparkFuncs
           .posexplode(SparkFuncs.array(SparkFuncs.lit("x"), 
SparkFuncs.lit("y"), SparkFuncs.lit("z")))
           .as(Seq("pos", "val"))
     )
   
   // run this multiple times and it is unstable on master, but stable after 
the patch
   df4
     .withColumns(
         ListMap(
             "col1" -> SparkFuncs.lit(100),
             "col2" -> SparkFuncs
               .when(
                   SparkFuncs.col("val").eqNullSafe(SparkFuncs.lit("x")),
                   SparkFuncs.col("df1_struct_first").getItem("a"))
               .otherwise(
                   SparkFuncs
                     .when(
                         SparkFuncs.col("val").eqNullSafe(SparkFuncs.lit("y")),
                         SparkFuncs.col("df1_struct_first").getItem("b"))
                     .otherwise(SparkFuncs
                       .when(
                           
SparkFuncs.col("val").eqNullSafe(SparkFuncs.lit("z")),
                           SparkFuncs.col("df1_struct_first").getItem("c")))),
             "col3" -> SparkFuncs.col("col2").plus(SparkFuncs.lit(1000)),
             "col4" -> 
SparkFuncs.col("df3_struct_first").getItem("num1").multiply(SparkFuncs.lit(10)),
             "col5" -> 
SparkFuncs.col("df3_struct_first").getItem("num2").multiply(SparkFuncs.lit(10)),
             "col6" -> 
SparkFuncs.col("df3_struct_first").getItem("num3").multiply(SparkFuncs.lit(10)),
             "col7" -> 
SparkFuncs.col("df3_struct_first").getItem("num4").multiply(SparkFuncs.lit(10)),
             "col8" -> SparkFuncs.col("col4").minus(SparkFuncs.col("col5")),
             "col9" -> SparkFuncs.col("col8").plus(SparkFuncs.lit(1000))
         ))
     .semanticHash()
   ```
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [MINOR] Use LinkedHashSet for ResolveLateralColumnAliasReference to generate stable hash for the plan [spark]

Reply via email to