Joachim Bargsten created SPARK-32728: ----------------------------------------
Summary: Using groupby with rand creates different values when joining table with itself Key: SPARK-32728 URL: https://issues.apache.org/jira/browse/SPARK-32728 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 2.4.5 Environment: I tested it with environment,Azure Databricks 7.2 (includes Apache Spark 3.0.0, Scala 2.12) Worker type: Standard_DS3_v2 (2 workers) Reporter: Joachim Bargsten When running following query on a cluster with *multiple workers (>1)*, the result is not 0.0, even though I would expect it to be. {code:java} import pyspark.sql.functions as F sdf = spark.range(100) sdf = ( sdf.withColumn("a", F.col("id") + 1) .withColumn("b", F.col("id") + 2) .withColumn("c", F.col("id") + 3) .withColumn("d", F.col("id") + 4) .withColumn("e", F.col("id") + 5) ) sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) sdf = sdf.withColumn("x", F.rand() * F.col("e")) sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - F.col("xx"))).agg(F.sum("delta_x")) sum_delta_x = sdf2.head()[0] print(f"{sum_delta_x} should be 0.0") assert abs(sum_delta_x) < 0.001 {code} {{}}If the groupby statement is commented out, the code is working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org