Swetha Baskaran created SPARK-40588:
---------------------------------------

             Summary: Sorting issue with AQE turned on  
                 Key: SPARK-40588
                 URL: https://issues.apache.org/jira/browse/SPARK-40588
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.1.3
         Environment: Spark v3.1.3
Scala v2.12.13
            Reporter: Swetha Baskaran


We are attempting to partition data by a few columns, sort by a particular 
_sortCol_ and write out one file per partition. 
{code:java}
df
    .repartition(col("day"), col("month"), col("year"))
    .withColumn("partitionId",spark_partition_id)
    .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
    .sortWithinPartitions("year", "month", "day", "sortCol")
    .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
    .write
    .partitionBy("year", "month", "day")
    .parquet(path){code}
When inspecting the results, we observe one file per partition, however we see 
an _alternating_ pattern of unsorted rows in some files.
{code:java}
{"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
{"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
{"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
{"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
{"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
{"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
{"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
{"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
{"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
{"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
{"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
Here is a 
[gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
reproduce the issue. 

Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) fixes 
the issue.

I'm working on identifying why AQE affects the sort order. Any leads or 
thoughts would be appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to