apapi opened a new pull request #27077: [SPARK-30408][SQL] Should not remove orderBy in sortBy clause in Optimizer URL: https://github.com/apache/spark/pull/27077 ### What changes were proposed in this pull request? Fix defect [SPARK-30408](https://issues.apache.org/jira/browse/SPARK-30408) in EliminateSorts: orderBy in sortBy clause was removed by EliminateSorts. code to reproduce: ``` val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("c", 3, 6) ).toDF("a", "b", "c") val groupData = dataset.orderBy("b") val sortData = groupData.sortWithinPartitions("c") ``` The content of groupData is: ``` partition 0: [a,1,4] partition 1: [b,2,5] partition 2: [c,3,6] ``` The content of sortData is: ``` partition 0: [a,1,4] partition 1: [b,2,5], [c,3,6] ``` The content of sortData is not correct because of orderBy was removed by EliminateSorts. The content of sortData should be same as groupData. ### Why are the changes needed? This PR fixed defect [SPARK-30408](https://issues.apache.org/jira/browse/SPARK-30408). Without this fix, the output of ```rdd.orderBy("b").sortWithinPartitions("c")``` is same as ```rdd.sortWithinPartitions("c")``` which is not correct. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I add an UT in ```EliminateSortsSuite``` to test this patch.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org