[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-23 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-48956:
---
Affects Version/s: 3.2.3
   3.2.2
   3.1.3

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.3, 3.2.1, 3.2.2, 3.2.3
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, 
> image-2024-07-22-14-47-50-773.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-23 Thread xuanzhiang (Jira)


[ https://issues.apache.org/jira/browse/SPARK-48956 ]


xuanzhiang deleted comment on SPARK-48956:


was (Author: JIRAUSER295364):
Metric info error. Actual output 35351985,but got duplicate data. I will try to 
reproduce the problem and give use cases

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.3, 3.2.1, 3.2.2, 3.2.3
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, 
> image-2024-07-22-14-47-50-773.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867668#comment-17867668
 ] 

xuanzhiang edited comment on SPARK-48956 at 7/22/24 6:51 AM:
-

I found out that the mission failed because of shuffle. Because the task failed 
with a shuffle data fetch failure, so the previous stage needs to be re-run. I 
think it's a task restart reading an unfixed partition, so data duplication.I 
think it was the change in parallelism that caused the partition change.


was (Author: JIRAUSER295364):
I found out that the mission failed because of shuffle. Because the task failed 
with a shuffle data fetch failure, so the previous stage needs to be re-run. I 
think it's a task restart reading an unfixed partition, so data duplication.

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, 
> image-2024-07-22-14-47-50-773.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867668#comment-17867668
 ] 

xuanzhiang edited comment on SPARK-48956 at 7/22/24 6:51 AM:
-

I found out that the mission failed because of shuffle. Because the task failed 
with a shuffle data fetch failure, so the previous stage needs to be re-run. I 
think it's a task restart reading an unfixed partition, so data duplication.I 
think it was the change in parallelism that caused the partition change.Should 
I turn off dynamic allocation?


was (Author: JIRAUSER295364):
I found out that the mission failed because of shuffle. Because the task failed 
with a shuffle data fetch failure, so the previous stage needs to be re-run. I 
think it's a task restart reading an unfixed partition, so data duplication.I 
think it was the change in parallelism that caused the partition change.

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, 
> image-2024-07-22-14-47-50-773.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867668#comment-17867668
 ] 

xuanzhiang edited comment on SPARK-48956 at 7/22/24 6:49 AM:
-

I found out that the mission failed because of shuffle. Because the task failed 
with a shuffle data fetch failure, so the previous stage needs to be re-run. I 
think it's a task restart reading an unfixed partition, so data duplication.


was (Author: JIRAUSER295364):
I found out that the mission failed because of shuffle. Because the task failed 
with a shuffle data fetch failure, so the previous stage needs to be re-run. 
!image-2024-07-22-14-47-50-773.png!

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, 
> image-2024-07-22-14-47-50-773.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867668#comment-17867668
 ] 

xuanzhiang commented on SPARK-48956:


I found out that the mission failed because of shuffle. Because the task failed 
with a shuffle data fetch failure, so the previous stage needs to be re-run. 
!image-2024-07-22-14-47-50-773.png!

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, 
> image-2024-07-22-14-47-50-773.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-48956:
---
Attachment: image-2024-07-22-14-47-50-773.png

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, 
> image-2024-07-22-14-47-50-773.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-48956:
---
Attachment: (was: image-2024-07-22-09-59-31-004.png)

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867631#comment-17867631
 ] 

xuanzhiang commented on SPARK-48956:


!image-2024-07-22-10-00-45-793.png!

Metric info error. Actual output 35351985,but got duplicate data. I will try to 
reproduce the problem and give use cases

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-09-59-31-004.png, 
> image-2024-07-22-10-00-45-793.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867631#comment-17867631
 ] 

xuanzhiang edited comment on SPARK-48956 at 7/22/24 2:02 AM:
-

Metric info error. Actual output 35351985,but got duplicate data. I will try to 
reproduce the problem and give use cases


was (Author: JIRAUSER295364):
!image-2024-07-22-10-00-45-793.png!

Metric info error. Actual output 35351985,but got duplicate data. I will try to 
reproduce the problem and give use cases

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-09-59-31-004.png, 
> image-2024-07-22-10-00-45-793.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-48956:
---
Attachment: image-2024-07-22-10-00-45-793.png

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-09-59-31-004.png, 
> image-2024-07-22-10-00-45-793.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-48956:
---
Attachment: image-2024-07-22-09-59-31-004.png

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-09-59-31-004.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-48956:
---
Attachment: image-2024-07-21-18-22-04-665.png

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


[ https://issues.apache.org/jira/browse/SPARK-48956 ]


xuanzhiang deleted comment on SPARK-48956:


was (Author: JIRAUSER295364):
!image-2024-07-21-18-22-04-665.png!

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867595#comment-17867595
 ] 

xuanzhiang commented on SPARK-48956:


!image-2024-07-21-18-22-04-665.png!

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


[ https://issues.apache.org/jira/browse/SPARK-48956 ]


xuanzhiang deleted comment on SPARK-48956:


was (Author: JIRAUSER295364):
!image-2024-07-21-18-21-33-888.png!

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-48956:
---
Attachment: image-2024-07-21-18-21-33-888.png

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-21 Thread xuanzhiang (Jira)
xuanzhiang created SPARK-48956:
--

 Summary: Spark Repartition Task Field Retry Cause Data Duplication
 Key: SPARK-48956
 URL: https://issues.apache.org/jira/browse/SPARK-48956
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.2.1
Reporter: xuanzhiang


The question seems like 
[SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

2024-07-21 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867593#comment-17867593
 ] 

xuanzhiang edited comment on SPARK-23207 at 7/21/24 10:01 AM:
--

[~igor.berman]  So we are. We are running on yarn and with dynamic 
allocation.Task retries gen  duplicate data.


was (Author: JIRAUSER295364):
[~igor.berman]  So we are. We are running on yarn and with dynamic allocation.

> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.1.4, 2.2.3, 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code:java}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

2024-07-21 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867593#comment-17867593
 ] 

xuanzhiang commented on SPARK-23207:


[~igor.berman]  So we are. We are running on yarn and with dynamic allocation.

> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.1.4, 2.2.3, 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code:java}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42217) Support lateral column alias in queries with Window

2024-01-08 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804201#comment-17804201
 ] 

xuanzhiang commented on SPARK-42217:


Hello, Does spark3.4.2 support LCA with window?

> Support lateral column alias in queries with Window
> ---
>
> Key: SPARK-42217
> URL: https://issues.apache.org/jira/browse/SPARK-42217
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42227) Use approx_percentile function running slower than percentile in spark3

2023-02-15 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-42227:
---
Summary: Use approx_percentile function running slower than percentile in 
spark3   (was: Use approx_percentile function running slower in spark3 than 
spark2)

> Use approx_percentile function running slower than percentile in spark3 
> 
>
> Key: SPARK-42227
> URL: https://issues.apache.org/jira/browse/SPARK-42227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: percentile+objectHashAggregateExec.png, 
> percentile+objectHashAggregateExec_shuffle_task.png, 
> percentile_approx+objectHashAggregateExec.png, 
> percentile_approx+objectHashAggregateExec_shuffle_task.png
>
>
> approx_percentile(end_ts-start_ts,0.9) cost_p90
> in spark3 , it use objectHashAggregate method , but it shuffle very slow. 
> when i use percentile , it become fast. i dont know the reson, i think 
> approx_percentile should fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2

2023-02-13 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687889#comment-17687889
 ] 

xuanzhiang edited comment on SPARK-42227 at 2/13/23 11:16 AM:
--

spark version : 3.2.1

hadoop version : 3.0.0

job info:
!percentile_approx objectHashAggregateExec.png!

!percentile objectHashAggregateExec.png!

shuffle read task info :
!percentile_approx objectHashAggregateExec_shuffle_task.png!

!percentile objectHashAggregateExec_shuffle_task.png!


was (Author: JIRAUSER295364):
spark version : 3.2.1

hadoop version : 3.0.0

job info:
!percentile_approx objectHashAggregateExec.png!
shuffle read task info :
!percentile_approx objectHashAggregateExec_shuffle_task.png!

 

> Use approx_percentile function running slower in spark3 than spark2
> ---
>
> Key: SPARK-42227
> URL: https://issues.apache.org/jira/browse/SPARK-42227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: percentile+objectHashAggregateExec.png, 
> percentile+objectHashAggregateExec_shuffle_task.png, 
> percentile_approx+objectHashAggregateExec.png, 
> percentile_approx+objectHashAggregateExec_shuffle_task.png
>
>
> approx_percentile(end_ts-start_ts,0.9) cost_p90
> in spark3 , it use objectHashAggregate method , but it shuffle very slow. 
> when i use percentile , it become fast. i dont know the reson, i think 
> approx_percentile should fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2

2023-02-13 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687889#comment-17687889
 ] 

xuanzhiang edited comment on SPARK-42227 at 2/13/23 11:16 AM:
--

spark version : 3.2.1

hadoop version : 3.0.0

here are job info and shuffle read task info 


was (Author: JIRAUSER295364):
spark version : 3.2.1

hadoop version : 3.0.0

job info:
!percentile_approx objectHashAggregateExec.png!

!percentile objectHashAggregateExec.png!

shuffle read task info :
!percentile_approx objectHashAggregateExec_shuffle_task.png!

!percentile objectHashAggregateExec_shuffle_task.png!

> Use approx_percentile function running slower in spark3 than spark2
> ---
>
> Key: SPARK-42227
> URL: https://issues.apache.org/jira/browse/SPARK-42227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: percentile+objectHashAggregateExec.png, 
> percentile+objectHashAggregateExec_shuffle_task.png, 
> percentile_approx+objectHashAggregateExec.png, 
> percentile_approx+objectHashAggregateExec_shuffle_task.png
>
>
> approx_percentile(end_ts-start_ts,0.9) cost_p90
> in spark3 , it use objectHashAggregate method , but it shuffle very slow. 
> when i use percentile , it become fast. i dont know the reson, i think 
> approx_percentile should fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2

2023-02-13 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-42227:
---
Attachment: percentile_approx+objectHashAggregateExec_shuffle_task.png
percentile+objectHashAggregateExec_shuffle_task.png

> Use approx_percentile function running slower in spark3 than spark2
> ---
>
> Key: SPARK-42227
> URL: https://issues.apache.org/jira/browse/SPARK-42227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: percentile+objectHashAggregateExec.png, 
> percentile+objectHashAggregateExec_shuffle_task.png, 
> percentile_approx+objectHashAggregateExec.png, 
> percentile_approx+objectHashAggregateExec_shuffle_task.png
>
>
> approx_percentile(end_ts-start_ts,0.9) cost_p90
> in spark3 , it use objectHashAggregate method , but it shuffle very slow. 
> when i use percentile , it become fast. i dont know the reson, i think 
> approx_percentile should fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2

2023-02-13 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-42227:
---
Attachment: percentile_approx+objectHashAggregateExec.png
percentile+objectHashAggregateExec.png

> Use approx_percentile function running slower in spark3 than spark2
> ---
>
> Key: SPARK-42227
> URL: https://issues.apache.org/jira/browse/SPARK-42227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: percentile+objectHashAggregateExec.png, 
> percentile+objectHashAggregateExec_shuffle_task.png, 
> percentile_approx+objectHashAggregateExec.png, 
> percentile_approx+objectHashAggregateExec_shuffle_task.png
>
>
> approx_percentile(end_ts-start_ts,0.9) cost_p90
> in spark3 , it use objectHashAggregate method , but it shuffle very slow. 
> when i use percentile , it become fast. i dont know the reson, i think 
> approx_percentile should fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2

2023-02-13 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687889#comment-17687889
 ] 

xuanzhiang commented on SPARK-42227:


spark version : 3.2.1

hadoop version : 3.0.0

job info:
!percentile_approx objectHashAggregateExec.png!
shuffle read task info :
!percentile_approx objectHashAggregateExec_shuffle_task.png!

 

> Use approx_percentile function running slower in spark3 than spark2
> ---
>
> Key: SPARK-42227
> URL: https://issues.apache.org/jira/browse/SPARK-42227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
> Attachments: percentile+objectHashAggregateExec.png, 
> percentile+objectHashAggregateExec_shuffle_task.png, 
> percentile_approx+objectHashAggregateExec.png, 
> percentile_approx+objectHashAggregateExec_shuffle_task.png
>
>
> approx_percentile(end_ts-start_ts,0.9) cost_p90
> in spark3 , it use objectHashAggregate method , but it shuffle very slow. 
> when i use percentile , it become fast. i dont know the reson, i think 
> approx_percentile should fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2

2023-02-13 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687852#comment-17687852
 ] 

xuanzhiang edited comment on SPARK-42227 at 2/13/23 9:58 AM:
-

[~gurwls223] The percentile is thirty percent faster than the 
approx_percentile.  It doesn't make sense. I see approx_percentile have a long 
time shuffle read task left. But percentile is normal . I'll repeat the problem 
later.


was (Author: JIRAUSER295364):
The percentile is thirty percent faster than the approx_percentile.  It doesn't 
make sense. I see approx_percentile have a long time shuffle read task left. 
But percentile is normal . I'll repeat the problem later.

> Use approx_percentile function running slower in spark3 than spark2
> ---
>
> Key: SPARK-42227
> URL: https://issues.apache.org/jira/browse/SPARK-42227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
>
> approx_percentile(end_ts-start_ts,0.9) cost_p90
> in spark3 , it use objectHashAggregate method , but it shuffle very slow. 
> when i use percentile , it become fast. i dont know the reson, i think 
> approx_percentile should fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2

2023-02-13 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687852#comment-17687852
 ] 

xuanzhiang edited comment on SPARK-42227 at 2/13/23 9:58 AM:
-

The percentile is thirty percent faster than the approx_percentile.  It doesn't 
make sense. I see approx_percentile have a long time shuffle read task left. 
But percentile is normal . I'll repeat the problem later.


was (Author: JIRAUSER295364):
The percentile is thirty percent faster than the approx_percentile.  It doesn't 
make sense. I see approx_percentile 

have a long time shuffle read task left. But percentile is normal . I'll repeat 
the problem later.

> Use approx_percentile function running slower in spark3 than spark2
> ---
>
> Key: SPARK-42227
> URL: https://issues.apache.org/jira/browse/SPARK-42227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
>
> approx_percentile(end_ts-start_ts,0.9) cost_p90
> in spark3 , it use objectHashAggregate method , but it shuffle very slow. 
> when i use percentile , it become fast. i dont know the reson, i think 
> approx_percentile should fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2

2023-02-13 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687852#comment-17687852
 ] 

xuanzhiang commented on SPARK-42227:


The percentile is thirty percent faster than the approx_percentile.  It doesn't 
make sense. I see approx_percentile 

have a long time shuffle read task left. But percentile is normal . I'll repeat 
the problem later.

> Use approx_percentile function running slower in spark3 than spark2
> ---
>
> Key: SPARK-42227
> URL: https://issues.apache.org/jira/browse/SPARK-42227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
>
> approx_percentile(end_ts-start_ts,0.9) cost_p90
> in spark3 , it use objectHashAggregate method , but it shuffle very slow. 
> when i use percentile , it become fast. i dont know the reson, i think 
> approx_percentile should fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2023-02-02 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683640#comment-17683640
 ] 

xuanzhiang edited comment on SPARK-40499 at 2/3/23 3:05 AM:


now we only choose use PERCENTILE,it use HashAggregate and shuffle normal. 


was (Author: JIRAUSER295364):
now when we use PERCENTILE_APPROX, we need disable objHashAggregate. Or we 
choose use PERCENTILE,it use HashAggregate and shuffle normal.

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Major
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2023-02-02 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683640#comment-17683640
 ] 

xuanzhiang commented on SPARK-40499:


now when we use PERCENTILE_APPROX, we need disable objHashAggregate. Or we 
choose use PERCENTILE,it use HashAggregate and shuffle normal.

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Major
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42292) Spark SQL not use hive partition info

2023-02-02 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang resolved SPARK-42292.

Resolution: Fixed

when i set spark.sql.hive.convertMetastoreParquet=true , spark3 use inner 
parquet reader.

> Spark SQL not use hive partition info
> -
>
> Key: SPARK-42292
> URL: https://issues.apache.org/jira/browse/SPARK-42292
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
>
> I use spark3 to count partition num , like : 
> table a is external parquet table, it have 3 partition columns (year ,month, 
> day).
> query sql : "select distinct month , day from a where year = '2022' "
> i think spark can find hive metadata and use partition info, but it load all  
> "year = '2022'" partition data.
> in spark2.4, it use TableLocalScanExec ,but spark3 use HiveTableRelation and 
> scan hive parquet.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42292) Spark SQL not use hive partition info

2023-02-02 Thread xuanzhiang (Jira)
xuanzhiang created SPARK-42292:
--

 Summary: Spark SQL not use hive partition info
 Key: SPARK-42292
 URL: https://issues.apache.org/jira/browse/SPARK-42292
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: xuanzhiang


I use spark3 to count partition num , like : 

table a is external parquet table, it have 3 partition columns (year ,month, 
day).

query sql : "select distinct month , day from a where year = '2022' "

i think spark can find hive metadata and use partition info, but it load all  
"year = '2022'" partition data.

in spark2.4, it use TableLocalScanExec ,but spark3 use HiveTableRelation and 
scan hive parquet.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2

2023-01-28 Thread xuanzhiang (Jira)
xuanzhiang created SPARK-42227:
--

 Summary: Use approx_percentile function running slower in spark3 
than spark2
 Key: SPARK-42227
 URL: https://issues.apache.org/jira/browse/SPARK-42227
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: xuanzhiang


approx_percentile(end_ts-start_ts,0.9) cost_p90

in spark3 , it use objectHashAggregate method , but it shuffle very slow. when 
i use percentile , it become fast. i dont know the reson, i think 
approx_percentile should fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41684) spark3 read the one partition data and write to anthor partition cause error

2023-01-28 Thread xuanzhiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17681657#comment-17681657
 ] 

xuanzhiang commented on SPARK-41684:


u can set spark.sql.hive.convertMetastoreParquet=false .

> spark3 read the one partition data and write to anthor partition cause error
> 
>
> Key: SPARK-41684
> URL: https://issues.apache.org/jira/browse/SPARK-41684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: sinlang
>Priority: Major
>
> spark3 read the one partition data and write to anthor partition cause error
> {code:java}
> 1 create temporary view t1 :
>  select * from jt_ods.ods_ebi_stm_retail_settle_detail_full_di 
> where dt = '2022-12-21' 
> union all ( 
> select * from jt_ods.ods_ebi_stm_retail_settle_detail_full_df as i 
> where i.dt = '2022-12-20'  
> and not exists(select 1 from jt_ods.ods_ebi_stm_retail_settle_detail_full_di 
> as d where  d.dt = '2022-12-21' and i.id = d.id))
> 2 insert data :
>  insert sql insert overwrite table 
> jt_ods.ods_ebi_stm_retail_settle_detail_full_df partition(dt = '2022-12-21') 
> select * from t distribute by rand() {code}
> {code:java}
> 2022-12-22 16:29:48 Driver ERROR 
> org.apache.spark.deploy.yarn.ApplicationMaster 
>  User class threw exception: org.apache.spark.sql.AnalysisException: Cannot 
> overwrite a path that is also being read from.
> org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also 
> being read from.
>     at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.cannotOverwritePathBeingReadFromError(QueryCompilationErrors.scala:1834)
>     at 
> org.apache.spark.sql.execution.command.DDLUtils$.verifyNotReadPath(ddl.scala:980)
>     at 
> org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:221)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Priority: Blocker  (was: Major)

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Blocker
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Priority: Major  (was: Blocker)

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Major
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Priority: Blocker  (was: Minor)

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Blocker
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Environment: 
hadoop: 3.0.0 

spark:  2.4.0 / 3.2.1

shuffle:spark 2.4.0

  was:
hadoop 3.0.0 

spark2.4.0 / spark3.2.1

shuffle: spark2.4.0


> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Minor
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Attachment: spark3.2-shuffle-data.png

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop 3.0.0 
> spark2.4.0 / spark3.2.1
> shuffle: spark2.4.0
>Reporter: xuanzhiang
>Priority: Minor
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Description: 
spark.sql(
      s"""
         |SELECT
         | Info ,
         | PERCENTILE_APPROX(cost,0.5) cost_p50,
         | PERCENTILE_APPROX(cost,0.9) cost_p90,
         | PERCENTILE_APPROX(cost,0.95) cost_p95,
         | PERCENTILE_APPROX(cost,0.99) cost_p99,
         | PERCENTILE_APPROX(cost,0.999) cost_p999
         |FROM
         | textData
         |""".stripMargin)
 * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
shuffle , 140M shuffle data cost 3 hours. 

 * If we upgrade the Shuffle, will we get performance regression?

 *  

  was:
spark.sql(
      s"""
         |SELECT
         | Info ,
         | PERCENTILE_APPROX(cost,0.5) cost_p50,
         | PERCENTILE_APPROX(cost,0.9) cost_p90,
         | PERCENTILE_APPROX(cost,0.95) cost_p95,
         | PERCENTILE_APPROX(cost,0.99) cost_p99,
         | PERCENTILE_APPROX(cost,0.999) cost_p999
         |FROM
         | textData
         |""".stripMargin)
 * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
shuffle , 140M shuffle data cost 3 hours. 

 *  


> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop 3.0.0 
> spark2.4.0 / spark3.2.1
> shuffle: spark2.4.0
>Reporter: xuanzhiang
>Priority: Minor
> Attachments: spark2.4-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Attachment: spark2.4-shuffle-data.png

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop 3.0.0 
> spark2.4.0 / spark3.2.1
> shuffle: spark2.4.0
>Reporter: xuanzhiang
>Priority: Minor
> Attachments: spark2.4-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Environment: 
hadoop 3.0.0 

spark2.4.0 / spark3.2.1

shuffle: spark2.4.0

  was:!image-2022-09-20-16-57-01-881.png!


> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop 3.0.0 
> spark2.4.0 / spark3.2.1
> shuffle: spark2.4.0
>Reporter: xuanzhiang
>Priority: Minor
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)
xuanzhiang created SPARK-40499:
--

 Summary: Spark 3.2.1 percentlie_approx query much slower than 
Spark 2.4.0
 Key: SPARK-40499
 URL: https://issues.apache.org/jira/browse/SPARK-40499
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.2.1
 Environment: !image-2022-09-20-16-57-01-881.png!
Reporter: xuanzhiang


spark.sql(
      s"""
         |SELECT
         | Info ,
         | PERCENTILE_APPROX(cost,0.5) cost_p50,
         | PERCENTILE_APPROX(cost,0.9) cost_p90,
         | PERCENTILE_APPROX(cost,0.95) cost_p95,
         | PERCENTILE_APPROX(cost,0.99) cost_p99,
         | PERCENTILE_APPROX(cost,0.999) cost_p999
         |FROM
         | textData
         |""".stripMargin)
 * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
shuffle , 140M shuffle data cost 3 hours. 

 *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org