[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yiu-Chung Lee updated SPARK-44512: ---------------------------------- Description: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is no longer necessary) However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction<Row, Row>) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}} Below is the complete code that reproduces the problem. was: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is no longer necessary) -However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction<Row, Row>) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}} Below is the complete code that reproduces the problem. > dataset.sort.select.write.partitionBy does not return a sorted output > --------------------------------------------------------------------- > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL > Affects Versions: 3.4.1 > Reporter: Yiu-Chung Lee > Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction<Row, Row>) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org