[ https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656705#comment-17656705 ]
Erik Krogen commented on SPARK-40588: ------------------------------------- Labeling with 'correctness' since this breaks correctness of output by breaking the sort ordering. > Sorting issue with partitioned-writing and AQE turned on > -------------------------------------------------------- > > Key: SPARK-40588 > URL: https://issues.apache.org/jira/browse/SPARK-40588 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.1.3 > Environment: Spark v3.1.3 > Scala v2.12.13 > Reporter: Swetha Baskaran > Assignee: Enrico Minack > Priority: Major > Labels: correctness > Fix For: 3.2.3, 3.3.2 > > Attachments: image-2022-10-16-22-05-47-159.png > > > We are attempting to partition data by a few columns, sort by a particular > _sortCol_ and write out one file per partition. > {code:java} > df > .repartition(col("day"), col("month"), col("year")) > .withColumn("partitionId",spark_partition_id) > .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId) > .sortWithinPartitions("year", "month", "day", "sortCol") > .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId) > .write > .partitionBy("year", "month", "day") > .parquet(path){code} > When inspecting the results, we observe one file per partition, however we > see an _alternating_ pattern of unsorted rows in some files. > {code:java} > {"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586} > {"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587} > {"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588} > {"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589} > {"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590} > {"sortCol":100000,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code} > Here is a > [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to > reproduce the issue. > Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) > fixes the issue. > I'm working on identifying why AQE affects the sort order. Any leads or > thoughts would be appreciated! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org