[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746833#comment-17746833
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 8:31 AM:


After inspecting the SQL plan, below are the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark generates sorted incorrect column if 
spark.sql.optimizer.plannedWrite.enabled=true
(it should sort _1, but it actually sorted _2 instead)


was (Author: JIRAUSER301473):
After inspecting the SQL plan, bekow the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark generates sorted incorrect column if 
spark.sql.optimizer.plannedWrite.enabled=true
(it should sort _1, but it actually sorted _2 instead)

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746833#comment-17746833
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 8:31 AM:


After inspecting the SQL plan, below are the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark sorted incorrect column if 
spark.sql.optimizer.plannedWrite.enabled=true
(it should sort _1, but it actually sorted _2 instead)


was (Author: JIRAUSER301473):
After inspecting the SQL plan, below are the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark generates sorted incorrect column if 
spark.sql.optimizer.plannedWrite.enabled=true
(it should sort _1, but it actually sorted _2 instead)

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746849#comment-17746849
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 10:06 AM:
-

bumping to blocker because I believe this is a potentially very serious issue 
in the query planner (sort() then select(), and the sorting column is not in 
select(), then query plan would use the wrong column to sort), which may affect 
other queries


was (Author: JIRAUSER301473):
bumping to blocker because I believe this is a potentially very serious issue 
in the query planner (sort().select() and the original sorting column is not in 
select(), then query plan would use the wrong column to sort), which may affect 
other queries

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Blocker
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746849#comment-17746849
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 10:06 AM:
-

bumping to blocker because I believe this is a potentially very serious issue 
in the query planner (sort().select() and the original sorting column is not in 
select(), then query plan would use the wrong column to sort), which may affect 
other queries


was (Author: JIRAUSER301473):
bumping to blocker because I believe this is a potentially very serious issue 
in the query planner, which may affect other queries

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Blocker
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746849#comment-17746849
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 11:36 AM:
-

bumping to blocker because I believe this is a potentially very serious issue 
in the query planner, which may affect other queries

(sort() then select(), but the sorting column is not in select(), then query 
planner would use wrong column to sort)


was (Author: JIRAUSER301473):
bumping to blocker because I believe this is a potentially very serious issue 
in the query planner (sort() then select(), and the sorting column is not in 
select(), then query plan would use the wrong column to sort), which may affect 
other queries

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Blocker
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-11-12 Thread Yiu-Chung Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785356#comment-17785356
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 11/13/23 3:34 AM:
-

[~dongjoon] As I mentioned before, I need to preserve the sorting order by _1 
(but _1 is not part of the output) before writing into file. If partitionBy 
does not have such contract what would be your recommendation?


was (Author: JIRAUSER301473):
[~dongjoon] As I mentioned before, I need to preserve the sorting order by _1 
before writing into file. If partitionBy does not have such contract what would 
be your recommendation?

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Blocker
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org