[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

Yiu-Chung Lee (Jira) Tue, 25 Jul 2023 03:08:58 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746849#comment-17746849
 ]


Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 10:06 AM:
-----------------------------------------------------------------

bumping to blocker because I believe this is a potentially very serious issue 
in the query planner (sort().select() and the original sorting column is not in 
select(), then query plan would use the wrong column to sort), which may affect 
other queries


was (Author: JIRAUSER301473):
bumping to blocker because I believe this is a potentially very serious issue 
in the query planner, which may affect other queries

> dataset.sort.select.write.partitionBy sorts wrong column
> --------------------------------------------------------
>
>                 Key: SPARK-44512
>                 URL: https://issues.apache.org/jira/browse/SPARK-44512
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer, SQL
>    Affects Versions: 3.4.1
>            Reporter: Yiu-Chung Lee
>            Priority: Blocker
>              Labels: correctness
>         Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction<Row, Row>) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

Reply via email to