[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785356#comment-17785356 ]
Yiu-Chung Lee edited comment on SPARK-44512 at 11/13/23 3:34 AM: ----------------------------------------------------------------- [~dongjoon] As I mentioned before, I need to preserve the sorting order by _1 (but _1 is not part of the output) before writing into file. If partitionBy does not have such contract what would be your recommendation? was (Author: JIRAUSER301473): [~dongjoon] As I mentioned before, I need to preserve the sorting order by _1 before writing into file. If partitionBy does not have such contract what would be your recommendation? > dataset.sort.select.write.partitionBy sorts wrong column > -------------------------------------------------------- > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL > Affects Versions: 3.4.1 > Reporter: Yiu-Chung Lee > Priority: Blocker > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false. > After further investigation, spark actually sorted wrong column in the > following code. > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction<Row, Row>) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org