[jira] [Updated] (SPARK-48580) Add consistency check and fallback for mapIds in push-merged block meta

2024-06-11 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-48580:
---
Parent: SPARK-33235
Issue Type: Sub-task  (was: Bug)

> Add consistency check and fallback for mapIds in push-merged block meta
> ---
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
> Attachments: image-2024-06-11-10-19-57-227.png
>
>
> When push-based shuffle enabled, 0.03% of the spark application in our 
> cluster experienced shuffle data loss. The metrics of Exchange as follows:
> !image-2024-06-11-10-19-57-227.png|width=405,height=170!
> We eventually found some WARN logs on the shuffle server:
>  
> {code:java}
> WARN shuffle-server-8-216 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
> application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to 
> index/meta failed{code}
>  
> And analyzed the cause from the code:
> The merge metadata obtained by the reduce side from the driver comes from the 
> {{mapTracker}} in the server's memory, while the actual reading of chunk data 
> is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
> consistency check between the two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48580) Add consistency check and fallback for mapIds in push-merged block meta

2024-06-11 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-48580:
---
Summary: Add consistency check and fallback for mapIds in push-merged block 
meta  (was: MergedBlock read by reduce have missing chunks, leading to 
inconsistent shuffle data)

> Add consistency check and fallback for mapIds in push-merged block meta
> ---
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
> Attachments: image-2024-06-11-10-19-57-227.png
>
>
> When push-based shuffle enabled, 0.03% of the spark application in our 
> cluster experienced shuffle data loss. The metrics of Exchange as follows:
> !image-2024-06-11-10-19-57-227.png|width=405,height=170!
> We eventually found some WARN logs on the shuffle server:
>  
> {code:java}
> WARN shuffle-server-8-216 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
> application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to 
> index/meta failed{code}
>  
> And analyzed the cause from the code:
> The merge metadata obtained by the reduce side from the driver comes from the 
> {{mapTracker}} in the server's memory, while the actual reading of chunk data 
> is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
> consistency check between the two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48580) MergedBlock read by reduce have missing chunks, leading to inconsistent shuffle data

2024-06-10 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-48580:
---
Description: 
When push-based shuffle enabled, 0.03% of the spark application in our cluster 
experienced shuffle data loss. The metrics of Exchange as follows:

!image-2024-06-11-10-19-57-227.png|width=405,height=170!

We eventually found some WARN logs on the shuffle server:
 
{code:java}
WARN shuffle-server-8-216 
org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to index/meta 
failed{code}
 

And analyzed the cause from the code:

The merge metadata obtained by the reduce side from the driver comes from the 
{{mapTracker}} in the server's memory, while the actual reading of chunk data 
is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
consistency check between the two.

  was:
When push-based shuffle enabled, 0.03% of the spark application in our cluster 
experienced shuffle data loss. The metrics for the job execution plan's 
Exchange are as follows:

!image-2024-06-11-10-19-57-227.png|width=405,height=170!


We eventually found some WARN logs on the shuffle server:
 
{code:java}
WARN shuffle-server-8-216 
org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to index/meta 
failed{code}
 

And analyzed the cause from the code:

The merge metadata obtained by the reduce side from the driver comes from the 
{{mapTracker}} in the server's memory, while the actual reading of chunk data 
is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
consistency check between the two.


> MergedBlock read by reduce have missing chunks, leading to inconsistent 
> shuffle data
> 
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
> Attachments: image-2024-06-11-10-19-57-227.png
>
>
> When push-based shuffle enabled, 0.03% of the spark application in our 
> cluster experienced shuffle data loss. The metrics of Exchange as follows:
> !image-2024-06-11-10-19-57-227.png|width=405,height=170!
> We eventually found some WARN logs on the shuffle server:
>  
> {code:java}
> WARN shuffle-server-8-216 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
> application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to 
> index/meta failed{code}
>  
> And analyzed the cause from the code:
> The merge metadata obtained by the reduce side from the driver comes from the 
> {{mapTracker}} in the server's memory, while the actual reading of chunk data 
> is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
> consistency check between the two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48580) MergedBlock read by reduce have missing chunks, leading to inconsistent shuffle data

2024-06-10 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-48580:
---
Description: 
When push-based shuffle enabled, 0.03% of the spark application in our cluster 
experienced shuffle data loss. The metrics for the job execution plan's 
Exchange are as follows:

!image-2024-06-11-10-19-57-227.png|width=405,height=170!


We eventually found some WARN logs on the shuffle server:
 
{code:java}
WARN shuffle-server-8-216 
org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to index/meta 
failed{code}
 

And analyzed the cause from the code:

The merge metadata obtained by the reduce side from the driver comes from the 
{{mapTracker}} in the server's memory, while the actual reading of chunk data 
is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
consistency check between the two.

> MergedBlock read by reduce have missing chunks, leading to inconsistent 
> shuffle data
> 
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
> Attachments: image-2024-06-11-10-19-57-227.png
>
>
> When push-based shuffle enabled, 0.03% of the spark application in our 
> cluster experienced shuffle data loss. The metrics for the job execution 
> plan's Exchange are as follows:
> !image-2024-06-11-10-19-57-227.png|width=405,height=170!
> We eventually found some WARN logs on the shuffle server:
>  
> {code:java}
> WARN shuffle-server-8-216 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
> application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to 
> index/meta failed{code}
>  
> And analyzed the cause from the code:
> The merge metadata obtained by the reduce side from the driver comes from the 
> {{mapTracker}} in the server's memory, while the actual reading of chunk data 
> is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
> consistency check between the two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48580) MergedBlock read by reduce have missing chunks, leading to inconsistent shuffle data

2024-06-10 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-48580:
---
Attachment: (was: image-2024-06-11-10-19-22-284.png)

> MergedBlock read by reduce have missing chunks, leading to inconsistent 
> shuffle data
> 
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
> Attachments: image-2024-06-11-10-19-57-227.png
>
>
> When push-based shuffle enabled, 0.03% of the spark application in our 
> cluster experienced shuffle data loss. The metrics for the job execution 
> plan's Exchange are as follows:
> !image-2024-06-11-10-19-57-227.png|width=405,height=170!
> We eventually found some WARN logs on the shuffle server:
>  
> {code:java}
> WARN shuffle-server-8-216 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
> application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to 
> index/meta failed{code}
>  
> And analyzed the cause from the code:
> The merge metadata obtained by the reduce side from the driver comes from the 
> {{mapTracker}} in the server's memory, while the actual reading of chunk data 
> is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
> consistency check between the two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48580) MergedBlock read by reduce have missing chunks, leading to inconsistent shuffle data

2024-06-10 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-48580:
---
Attachment: image-2024-06-11-10-19-57-227.png

> MergedBlock read by reduce have missing chunks, leading to inconsistent 
> shuffle data
> 
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
> Attachments: image-2024-06-11-10-19-22-284.png, 
> image-2024-06-11-10-19-57-227.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48580) MergedBlock read by reduce have missing chunks, leading to inconsistent shuffle data

2024-06-10 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-48580:
---
Attachment: image-2024-06-11-10-19-22-284.png

> MergedBlock read by reduce have missing chunks, leading to inconsistent 
> shuffle data
> 
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
> Attachments: image-2024-06-11-10-19-22-284.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48580) MergedBlock read by reduce have missing chunks, leading to inconsistent shuffle data

2024-06-10 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-48580:
---
Summary: MergedBlock read by reduce have missing chunks, leading to 
inconsistent shuffle data  (was: The merge blocks read by reduce have missing 
chunks, leading to inconsistent shuffle data)

> MergedBlock read by reduce have missing chunks, leading to inconsistent 
> shuffle data
> 
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48580) The merge blocks read by reduce have missing chunks, leading to inconsistent shuffle data

2024-06-10 Thread gaoyajun02 (Jira)
gaoyajun02 created SPARK-48580:
--

 Summary: The merge blocks read by reduce have missing chunks, 
leading to inconsistent shuffle data
 Key: SPARK-48580
 URL: https://issues.apache.org/jira/browse/SPARK-48580
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.5.0, 3.4.0, 3.3.0, 3.2.0
Reporter: gaoyajun02






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42694) Data duplication and loss occur after executing 'insert overwrite...' in Spark 3.1.1

2024-05-14 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846293#comment-17846293
 ] 

gaoyajun02 commented on SPARK-42694:


Have you enabled push-based shuffle?

> Data duplication and loss occur after executing 'insert overwrite...' in 
> Spark 3.1.1
> 
>
> Key: SPARK-42694
> URL: https://issues.apache.org/jira/browse/SPARK-42694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Hadoop 3.2.1
> Hive 3.1.2
>Reporter: FengZhou
>Priority: Critical
>  Labels: shuffle, spark
> Attachments: image-2023-03-07-15-59-08-818.png, 
> image-2023-03-07-15-59-27-665.png
>
>
> We are currently using Spark version 3.1.1 in our production environment. We 
> have noticed that occasionally, after executing 'insert overwrite ... 
> select', the resulting data is inconsistent, with some data being duplicated 
> or lost. This issue does not occur all the time and seems to be more 
> prevalent on large tables with tens of millions of records.
> We compared the execution plans for two runs of the same SQL and found that 
> they were identical. In the case where the SQL was executed successfully, the 
> amount of data being written and read during the shuffle stage was the same. 
> However, in the case where the problem occurred, the amount of data being 
> written and read during the shuffle stage was different. Please see the 
> attached screenshots for the write/read data during shuffle stage.
>  
> Normal SQL:
> !image-2023-03-07-15-59-08-818.png!
> SQL with issues:
> !image-2023-03-07-15-59-27-665.png!
>  
> Is this problem caused by a bug in version 3.1.1, specifically (SPARK-34534): 
> 'New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss 
> or correctness'? Or is it caused by something else? What could be the root 
> cause of this problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45134) Data duplication may occur when fallback to origin shuffle block

2023-09-17 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-45134:
---
Priority: Critical  (was: Blocker)

> Data duplication may occur when fallback to origin shuffle block
> 
>
> Key: SPARK-45134
> URL: https://issues.apache.org/jira/browse/SPARK-45134
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Critical
>
> One possible situation that has been found is that, during the process of 
> requesting mergedBlockMeta, when the channel is closed, it may trigger two 
> callback callbacks and result in duplicate data for the original shuffle 
> blocks.
>  # The first time is when the channel is inactivated, the responseHandler 
> will execute the callback for all outstandingRpcs.
>  # The second time is when the listener corresponding to 
> shuffleClient.writeAndFlush executes the callback after the channel is closed.
> Some Error Logs:
> {code:java}
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 TransportResponseHandler: Still 
> have 1 requests outstanding when connection from host/ip:prot is closed
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to 
> get the meta of push-merged block for (3, 54) from host:port
> java.io.IOException: Connection from host:port closed
>         at 
> org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:147)
>         at 
> org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>         at 
> io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>         at 
> org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
>         at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>         at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(Thread.java:745)
>  
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to 
> get the meta of push-merged block for (3, 54) from host:port
> java.io.IOException: Failed to send RPC RPC 8079698359363123411 to 
> host/ip:port: java.nio.channels.ClosedChannelException
>         at 
> org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.ja

[jira] [Updated] (SPARK-45134) Data duplication may occur when fallback to origin shuffle block

2023-09-17 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-45134:
---
Priority: Blocker  (was: Critical)

> Data duplication may occur when fallback to origin shuffle block
> 
>
> Key: SPARK-45134
> URL: https://issues.apache.org/jira/browse/SPARK-45134
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Blocker
>
> One possible situation that has been found is that, during the process of 
> requesting mergedBlockMeta, when the channel is closed, it may trigger two 
> callback callbacks and result in duplicate data for the original shuffle 
> blocks.
>  # The first time is when the channel is inactivated, the responseHandler 
> will execute the callback for all outstandingRpcs.
>  # The second time is when the listener corresponding to 
> shuffleClient.writeAndFlush executes the callback after the channel is closed.
> Some Error Logs:
> {code:java}
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 TransportResponseHandler: Still 
> have 1 requests outstanding when connection from host/ip:prot is closed
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to 
> get the meta of push-merged block for (3, 54) from host:port
> java.io.IOException: Connection from host:port closed
>         at 
> org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:147)
>         at 
> org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>         at 
> io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>         at 
> org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
>         at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>         at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(Thread.java:745)
>  
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to 
> get the meta of push-merged block for (3, 54) from host:port
> java.io.IOException: Failed to send RPC RPC 8079698359363123411 to 
> host/ip:port: java.nio.channels.ClosedChannelException
>         at 
> org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.jav

[jira] [Updated] (SPARK-45134) Data duplication may occur when fallback to origin shuffle block

2023-09-17 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-45134:
---
Priority: Critical  (was: Major)

> Data duplication may occur when fallback to origin shuffle block
> 
>
> Key: SPARK-45134
> URL: https://issues.apache.org/jira/browse/SPARK-45134
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Critical
>
> One possible situation that has been found is that, during the process of 
> requesting mergedBlockMeta, when the channel is closed, it may trigger two 
> callback callbacks and result in duplicate data for the original shuffle 
> blocks.
>  # The first time is when the channel is inactivated, the responseHandler 
> will execute the callback for all outstandingRpcs.
>  # The second time is when the listener corresponding to 
> shuffleClient.writeAndFlush executes the callback after the channel is closed.
> Some Error Logs:
> {code:java}
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 TransportResponseHandler: Still 
> have 1 requests outstanding when connection from host/ip:prot is closed
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to 
> get the meta of push-merged block for (3, 54) from host:port
> java.io.IOException: Connection from host:port closed
>         at 
> org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:147)
>         at 
> org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>         at 
> io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>         at 
> org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
>         at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>         at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(Thread.java:745)
>  
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to 
> get the meta of push-merged block for (3, 54) from host:port
> java.io.IOException: Failed to send RPC RPC 8079698359363123411 to 
> host/ip:port: java.nio.channels.ClosedChannelException
>         at 
> org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java

[jira] [Updated] (SPARK-45134) Data duplication may occur when fallback to origin shuffle block

2023-09-17 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-45134:
---
Affects Version/s: 3.5.0
   3.4.0
   3.3.0

> Data duplication may occur when fallback to origin shuffle block
> 
>
> Key: SPARK-45134
> URL: https://issues.apache.org/jira/browse/SPARK-45134
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
>
> One possible situation that has been found is that, during the process of 
> requesting mergedBlockMeta, when the channel is closed, it may trigger two 
> callback callbacks and result in duplicate data for the original shuffle 
> blocks.
>  # The first time is when the channel is inactivated, the responseHandler 
> will execute the callback for all outstandingRpcs.
>  # The second time is when the listener corresponding to 
> shuffleClient.writeAndFlush executes the callback after the channel is closed.
> Some Error Logs:
> {code:java}
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 TransportResponseHandler: Still 
> have 1 requests outstanding when connection from host/ip:prot is closed
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to 
> get the meta of push-merged block for (3, 54) from host:port
> java.io.IOException: Connection from host:port closed
>         at 
> org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:147)
>         at 
> org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>         at 
> io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>         at 
> org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
>         at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>         at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(Thread.java:745)
>  
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to 
> get the meta of push-merged block for (3, 54) from host:port
> java.io.IOException: Failed to send RPC RPC 8079698359363123411 to 
> host/ip:port: java.nio.channels.ClosedChannelException
>         at 
> org.apache.spark.network.client.TransportClient$RpcCha

[jira] [Commented] (SPARK-45134) Data duplication may occur when fallback to origin shuffle block

2023-09-12 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764130#comment-17764130
 ] 

gaoyajun02 commented on SPARK-45134:


Hi, [~csingh] [~vsowrirajan] [~mshen] , Can you take a look? hope some 
suggestions.

> Data duplication may occur when fallback to origin shuffle block
> 
>
> Key: SPARK-45134
> URL: https://issues.apache.org/jira/browse/SPARK-45134
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: gaoyajun02
>Priority: Major
>
> One possible situation that has been found is that, during the process of 
> requesting mergedBlockMeta, when the channel is closed, it may trigger two 
> callback callbacks and result in duplicate data for the original shuffle 
> blocks.
>  # The first time is when the channel is inactivated, the responseHandler 
> will execute the callback for all outstandingRpcs.
>  # The second time is when the listener corresponding to 
> shuffleClient.writeAndFlush executes the callback after the channel is closed.
> Some Error Logs:
> {code:java}
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 TransportResponseHandler: Still 
> have 1 requests outstanding when connection from host/ip:prot is closed
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to 
> get the meta of push-merged block for (3, 54) from host:port
> java.io.IOException: Connection from host:port closed
>         at 
> org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:147)
>         at 
> org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>         at 
> io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>         at 
> org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>         at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>         at 
> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
>         at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>         at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(Thread.java:745)
>  
> 23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to 
> get the meta of push-merged block for (3, 54) from host:port
> java.io.IOException: Failed to send RPC RPC 8079698359363123411 to 
> host/ip:port: java.nio.channels.ClosedChannelException
>         at 
> org.apache.spark.ne

[jira] [Updated] (SPARK-45134) Data duplication may occur when fallback to origin shuffle block

2023-09-12 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-45134:
---
Description: 
One possible situation that has been found is that, during the process of 
requesting mergedBlockMeta, when the channel is closed, it may trigger two 
callback callbacks and result in duplicate data for the original shuffle blocks.
 # The first time is when the channel is inactivated, the responseHandler will 
execute the callback for all outstandingRpcs.
 # The second time is when the listener corresponding to 
shuffleClient.writeAndFlush executes the callback after the channel is closed.

Some Error Logs:
{code:java}
23/09/08 09:22:21 ERROR shuffle-client-7-1 TransportResponseHandler: Still have 
1 requests outstanding when connection from host/ip:prot is closed
23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to get 
the meta of push-merged block for (3, 54) from host:port
java.io.IOException: Connection from host:port closed
        at 
org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:147)
        at 
org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
        at 
io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
        at 
org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:745)
 
23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to get 
the meta of push-merged block for (3, 54) from host:port
java.io.IOException: Failed to send RPC RPC 8079698359363123411 to 
host/ip:port: java.nio.channels.ClosedChannelException
        at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:433)
        at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:409)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
        at 
io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
        at 
io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
        at 
io.netty.util.conc

[jira] [Updated] (SPARK-45134) Data duplication may occur when fallback to origin shuffle block

2023-09-12 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-45134:
---
Description: 
One possible situation that has been found is that, during the process of 
requesting mergedBlockMeta, when the channel is closed, it may trigger two 
callback callbacks and result in duplicate data for the original shuffle blocks.
 # The first time is when the channel is inactivated, the responseHandler will 
execute the callback for all outstandingRpcs.
 # The second time is when the listener corresponding to 
shuffleClient.writeAndFlush executes the callback after the channel is closed.

Some Error Logs
{code:java}
// code placeholder
23/09/08 09:22:21 ERROR shuffle-client-7-1 TransportResponseHandler: Still have 
1 requests outstanding when connection from host/ip:prot is closed
23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to get 
the meta of push-merged block for (3, 54) from host:port
java.io.IOException: Connection from host:port closed
        at 
org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:147)
        at 
org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
        at 
io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
        at 
org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:745)
 
23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to get 
the meta of push-merged block for (3, 54) from host:port
java.io.IOException: Failed to send RPC RPC 8079698359363123411 to 
host/ip:port: java.nio.channels.ClosedChannelException
        at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:433)
        at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:409)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
        at 
io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
        at 
io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
        at 

[jira] [Updated] (SPARK-45134) Data duplication may occur when fallback to origin shuffle block

2023-09-12 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-45134:
---
Description: 
One possible situation that has been found is that, during the process of 
requesting mergedBlockMeta, when the channel is closed, it may trigger two 
callback callbacks and result in duplicate data for the original shuffle blocks.
 # The first time is when the channel is inactivated, the responseHandler will 
execute the callback for all outstandingRpcs.
 # The second time is when the listener corresponding to 
shuffleClient.writeAndFlush executes the callback after the channel is closed.

Some Error Logs

 

  was:
One possible situation that has been found is that, during the process of 
requesting mergedBlockMeta, when the channel is closed, it may trigger two 
callback callbacks and result in duplicate data for the original shuffle blocks.
 # The first time is when the channel is inactivated, the responseHandler will 
execute the callback for all outstandingRpcs.
 # The second time is when the listener corresponding to 
shuffleClient.writeAndFlush executes the callback after the channel is closed.

Some Error Logs

```

23/09/08 09:22:21 ERROR shuffle-client-7-1 TransportResponseHandler: Still have 
1 requests outstanding when connection from host/ip:prot is closed
23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to get 
the meta of push-merged block for (3, 54) from host:port
java.io.IOException: Connection from host:port closed
        at 
org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:147)
        at 
org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
        at 
io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
        at 
org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:745)

 

23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to get 
the meta of push-merged block for (3, 54) from host:port
java.io.IOException: Failed to send RPC RPC 8079698359363123411 to 
host/ip:port: java.nio.channels.ClosedChannelException
        at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:433)
        at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationCompl

[jira] [Created] (SPARK-45134) Data duplication may occur when fallback to origin shuffle block

2023-09-12 Thread gaoyajun02 (Jira)
gaoyajun02 created SPARK-45134:
--

 Summary: Data duplication may occur when fallback to origin 
shuffle block
 Key: SPARK-45134
 URL: https://issues.apache.org/jira/browse/SPARK-45134
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.2.0
Reporter: gaoyajun02


One possible situation that has been found is that, during the process of 
requesting mergedBlockMeta, when the channel is closed, it may trigger two 
callback callbacks and result in duplicate data for the original shuffle blocks.
 # The first time is when the channel is inactivated, the responseHandler will 
execute the callback for all outstandingRpcs.
 # The second time is when the listener corresponding to 
shuffleClient.writeAndFlush executes the callback after the channel is closed.

Some Error Logs

```

23/09/08 09:22:21 ERROR shuffle-client-7-1 TransportResponseHandler: Still have 
1 requests outstanding when connection from host/ip:prot is closed
23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to get 
the meta of push-merged block for (3, 54) from host:port
java.io.IOException: Connection from host:port closed
        at 
org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:147)
        at 
org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
        at 
io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
        at 
org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:745)

 

23/09/08 09:22:21 ERROR shuffle-client-7-1 PushBasedFetchHelper: Failed to get 
the meta of push-merged block for (3, 54) from host:port
java.io.IOException: Failed to send RPC RPC 8079698359363123411 to 
host/ip:port: java.nio.channels.ClosedChannelException
        at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:433)
        at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:409)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
        at 
io

[jira] [Commented] (SPARK-42203) JsonProtocol should skip logging of push-based shuffle read metrics when push-based shuffle is disabled

2023-07-23 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746225#comment-17746225
 ] 

gaoyajun02 commented on SPARK-42203:


ping [~thejdeep]  when can you resolve it?

> JsonProtocol should skip logging of push-based shuffle read metrics when 
> push-based shuffle is disabled
> ---
>
> Key: SPARK-42203
> URL: https://issues.apache.org/jira/browse/SPARK-42203
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> This is a followup to SPARK-36620:
> When push-based shuffle is disabled (the default), I think that we should 
> skip the logging of the new push-based shuffle read metrics. Because these 
> metrics are logged for every task, they will add significant additional size 
> to Spark event logs. It would be great to avoid this cost in cases where it's 
> not necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43864) Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL

2023-06-02 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728724#comment-17728724
 ] 

gaoyajun02 commented on SPARK-43864:


It looks like a series of test package dependencies need to be changed.I'm not 
very familiar with these, Can you solve it? @[~panbingkun] 

> Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 
> 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL
> 
>
> Key: SPARK-43864
> URL: https://issues.apache.org/jira/browse/SPARK-43864
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: gaoyajun02
>Priority: Minor
>
> CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]
> It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
> spark
> {code:java}
>     
>       org.htmlunit
>       htmlunit
>       test
>     
>     
>       org.htmlunit
>       htmlunit-core-js
>       test
>      {code}
> see: [https://www.htmlunit.org/migration.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43864) Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL

2023-05-30 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727827#comment-17727827
 ] 

gaoyajun02 commented on SPARK-43864:


@[~srowen] thank you for your reply,It doesn't matter for spark, but it will 
affect some users who use community spark to package or deploy.
Some companies that pay more attention to security will disable these packages 
with security vulnerabilities.
The implementation is to check dependencies during the packaging process. Once 
a dependency is introduced, it will prevent release and deployment, regardless 
of whether it is in the test scope.

> Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 
> 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL
> 
>
> Key: SPARK-43864
> URL: https://issues.apache.org/jira/browse/SPARK-43864
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: gaoyajun02
>Priority: Minor
>
> CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]
> It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
> spark
> {code:java}
>     
>       org.htmlunit
>       htmlunit
>       test
>     
>     
>       org.htmlunit
>       htmlunit-core-js
>       test
>      {code}
> see: [https://www.htmlunit.org/migration.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43864) Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL

2023-05-29 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727064#comment-17727064
 ] 

gaoyajun02 commented on SPARK-43864:


Can you take a look

 @[~yangjie01] 

> Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 
> 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL
> 
>
> Key: SPARK-43864
> URL: https://issues.apache.org/jira/browse/SPARK-43864
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: gaoyajun02
>Priority: Major
>
> CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]
> It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
> spark
> {code:java}
>     
>       org.htmlunit
>       htmlunit
>       test
>     
>     
>       org.htmlunit
>       htmlunit-core-js
>       test
>      {code}
> see: [https://www.htmlunit.org/migration.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43864) Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL

2023-05-29 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-43864:
---
Attachment: (was: image-2023-05-29-18-18-44-132.png)

> Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 
> 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL
> 
>
> Key: SPARK-43864
> URL: https://issues.apache.org/jira/browse/SPARK-43864
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: gaoyajun02
>Priority: Major
>
> CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]
> It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
> spark
> {code:java}
>     
>       org.htmlunit
>       htmlunit
>       test
>     
>     
>       org.htmlunit
>       htmlunit-core-js
>       test
>      {code}
> see: [https://www.htmlunit.org/migration.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43864) Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL

2023-05-29 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-43864:
---
Attachment: image-2023-05-29-18-18-44-132.png

> Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 
> 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL
> 
>
> Key: SPARK-43864
> URL: https://issues.apache.org/jira/browse/SPARK-43864
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: gaoyajun02
>Priority: Major
>
> CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]
> It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
> spark
> {code:java}
>     
>       org.htmlunit
>       htmlunit
>       test
>     
>     
>       org.htmlunit
>       htmlunit-core-js
>       test
>      {code}
> see: [https://www.htmlunit.org/migration.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43864) Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL

2023-05-29 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-43864:
---
Description: 
CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]

It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
spark
{code:java}
    
      org.htmlunit
      htmlunit
      test
    
    
      org.htmlunit
      htmlunit-core-js
      test
     {code}
see: [https://www.htmlunit.org/migration.html]

  was:
CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]

It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
spark

```

    
      org.htmlunit
      htmlunit
      test
    
    
      org.htmlunit
      htmlunit-core-js
      test
    

```

see: [https://www.htmlunit.org/migration.html]


> Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 
> 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL
> 
>
> Key: SPARK-43864
> URL: https://issues.apache.org/jira/browse/SPARK-43864
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: gaoyajun02
>Priority: Major
>
> CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]
> It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
> spark
> {code:java}
>     
>       org.htmlunit
>       htmlunit
>       test
>     
>     
>       org.htmlunit
>       htmlunit-core-js
>       test
>      {code}
> see: [https://www.htmlunit.org/migration.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43864) Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL

2023-05-29 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-43864:
---
Description: 
CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]

It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
spark

```

    
      org.htmlunit
      htmlunit
      test
    
    
      org.htmlunit
      htmlunit-core-js
      test
    

```

see: [https://www.htmlunit.org/migration.html]

> Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 
> 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL
> 
>
> Key: SPARK-43864
> URL: https://issues.apache.org/jira/browse/SPARK-43864
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: gaoyajun02
>Priority: Major
>
> CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]
> It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
> spark
> ```
>     
>       org.htmlunit
>       htmlunit
>       test
>     
>     
>       org.htmlunit
>       htmlunit-core-js
>       test
>     
> ```
> see: [https://www.htmlunit.org/migration.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43864) Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL

2023-05-29 Thread gaoyajun02 (Jira)
gaoyajun02 created SPARK-43864:
--

 Summary: Versions of the package net.sourceforge.htmlunit:htmlunit 
from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL
 Key: SPARK-43864
 URL: https://issues.apache.org/jira/browse/SPARK-43864
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: gaoyajun02






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40872) Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-10-21 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-40872:
---
Parent: SPARK-33235
Issue Type: Sub-task  (was: Improvement)

> Fallback to original shuffle block when a push-merged shuffle chunk is 
> zero-size
> 
>
> Key: SPARK-40872
> URL: https://issues.apache.org/jira/browse/SPARK-40872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.3.0, 3.2.2
>Reporter: gaoyajun02
>Priority: Major
>
> A large number of shuffle tests in our cluster show that bad nodes with chunk 
> corruption appear have a probability of fetching zero-size shuffleChunks. In 
> this case, we can fall back to original shuffle blocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40872) Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-10-21 Thread gaoyajun02 (Jira)
gaoyajun02 created SPARK-40872:
--

 Summary: Fallback to original shuffle block when a push-merged 
shuffle chunk is zero-size
 Key: SPARK-40872
 URL: https://issues.apache.org/jira/browse/SPARK-40872
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.2.2, 3.3.0
Reporter: gaoyajun02


A large number of shuffle tests in our cluster show that bad nodes with chunk 
corruption appear have a probability of fetching zero-size shuffleChunks. In 
this case, we can fall back to original shuffle blocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38010) Push-based shuffle disabled due to insufficient mergeLocations

2022-01-25 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 resolved SPARK-38010.

Resolution: Fixed

> Push-based shuffle disabled due to insufficient mergeLocations
> --
>
> Key: SPARK-38010
> URL: https://issues.apache.org/jira/browse/SPARK-38010
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: gaoyajun02
>Priority: Major
>
> The current shuffle merger locations is obtained based on the host of the 
> active or dead Executors.
> When dynamic executor allocation is enabled, when an application submits the 
> first few stages, there are often not enough locations to satisfy the push 
> merge, which causes most shuffles to not benefit from the push bashed shuffle.
> The first few shuffle write stages of spark applications are generally the 
> stages for reading tables or data sources, which account for a large amount 
> of shuffled data. Because push merge shuffle is disabled, the end-to-end 
> improvement of spark applications is very limited.
> I probably thought of a way, but not sure if it's possible:
>  *  Lazy initialize shuffle merger locations, After the mapper writes the 
> local shuffle data, it obtains the merge location in the push thread.
> Looking for advice and solutions on this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38010) Push-based shuffle disabled due to insufficient mergeLocations

2022-01-25 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-38010:
---
Issue Type: Question  (was: Brainstorming)

> Push-based shuffle disabled due to insufficient mergeLocations
> --
>
> Key: SPARK-38010
> URL: https://issues.apache.org/jira/browse/SPARK-38010
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: gaoyajun02
>Priority: Major
>
> The current shuffle merger locations is obtained based on the host of the 
> active or dead Executors.
> When dynamic executor allocation is enabled, when an application submits the 
> first few stages, there are often not enough locations to satisfy the push 
> merge, which causes most shuffles to not benefit from the push bashed shuffle.
> The first few shuffle write stages of spark applications are generally the 
> stages for reading tables or data sources, which account for a large amount 
> of shuffled data. Because push merge shuffle is disabled, the end-to-end 
> improvement of spark applications is very limited.
> I probably thought of a way, but not sure if it's possible:
>  *  Lazy initialize shuffle merger locations, After the mapper writes the 
> local shuffle data, it obtains the merge location in the push thread.
> Looking for advice and solutions on this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38010) Push-based shuffle disabled due to insufficient mergeLocations

2022-01-24 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481614#comment-17481614
 ] 

gaoyajun02 commented on SPARK-38010:


https://issues.apache.org/jira/browse/SPARK-34826 can solve it? [~vsowrirajan] 

> Push-based shuffle disabled due to insufficient mergeLocations
> --
>
> Key: SPARK-38010
> URL: https://issues.apache.org/jira/browse/SPARK-38010
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: gaoyajun02
>Priority: Major
>
> The current shuffle merger locations is obtained based on the host of the 
> active or dead Executors.
> When dynamic executor allocation is enabled, when an application submits the 
> first few stages, there are often not enough locations to satisfy the push 
> merge, which causes most shuffles to not benefit from the push bashed shuffle.
> The first few shuffle write stages of spark applications are generally the 
> stages for reading tables or data sources, which account for a large amount 
> of shuffled data. Because push merge shuffle is disabled, the end-to-end 
> improvement of spark applications is very limited.
> I probably thought of a way, but not sure if it's possible:
>  *  Lazy initialize shuffle merger locations, After the mapper writes the 
> local shuffle data, it obtains the merge location in the push thread.
> Looking for advice and solutions on this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38010) Push-based shuffle disabled due to insufficient mergeLocations

2022-01-24 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-38010:
---
Parent: (was: SPARK-33235)
Issue Type: Brainstorming  (was: Technical task)

> Push-based shuffle disabled due to insufficient mergeLocations
> --
>
> Key: SPARK-38010
> URL: https://issues.apache.org/jira/browse/SPARK-38010
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: gaoyajun02
>Priority: Major
>
> The current shuffle merger locations is obtained based on the host of the 
> active or dead Executors.
> When dynamic executor allocation is enabled, when an application submits the 
> first few stages, there are often not enough locations to satisfy the push 
> merge, which causes most shuffles to not benefit from the push bashed shuffle.
> The first few shuffle write stages of spark applications are generally the 
> stages for reading tables or data sources, which account for a large amount 
> of shuffled data. Because push merge shuffle is disabled, the end-to-end 
> improvement of spark applications is very limited.
> I probably thought of a way, but not sure if it's possible:
>  *  Lazy initialize shuffle merger locations, After the mapper writes the 
> local shuffle data, it obtains the merge location in the push thread.
> Looking for advice and solutions on this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38010) Push-based shuffle disabled due to insufficient mergeLocations

2022-01-24 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-38010:
---
Issue Type: Technical task  (was: Sub-task)

> Push-based shuffle disabled due to insufficient mergeLocations
> --
>
> Key: SPARK-38010
> URL: https://issues.apache.org/jira/browse/SPARK-38010
> Project: Spark
>  Issue Type: Technical task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: gaoyajun02
>Priority: Major
>
> The current shuffle merger locations is obtained based on the host of the 
> active or dead Executors.
> When dynamic executor allocation is enabled, when an application submits the 
> first few stages, there are often not enough locations to satisfy the push 
> merge, which causes most shuffles to not benefit from the push bashed shuffle.
> The first few shuffle write stages of spark applications are generally the 
> stages for reading tables or data sources, which account for a large amount 
> of shuffled data. Because push merge shuffle is disabled, the end-to-end 
> improvement of spark applications is very limited.
> I probably thought of a way, but not sure if it's possible:
>  *  Lazy initialize shuffle merger locations, After the mapper writes the 
> local shuffle data, it obtains the merge location in the push thread.
> Looking for advice and solutions on this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38010) Push-based shuffle disabled due to insufficient mergeLocations

2022-01-24 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-38010:
---
Parent: SPARK-33235
Issue Type: Sub-task  (was: Improvement)

> Push-based shuffle disabled due to insufficient mergeLocations
> --
>
> Key: SPARK-38010
> URL: https://issues.apache.org/jira/browse/SPARK-38010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: gaoyajun02
>Priority: Major
>
> The current shuffle merger locations is obtained based on the host of the 
> active or dead Executors.
> When dynamic executor allocation is enabled, when an application submits the 
> first few stages, there are often not enough locations to satisfy the push 
> merge, which causes most shuffles to not benefit from the push bashed shuffle.
> The first few shuffle write stages of spark applications are generally the 
> stages for reading tables or data sources, which account for a large amount 
> of shuffled data. Because push merge shuffle is disabled, the end-to-end 
> improvement of spark applications is very limited.
> I probably thought of a way, but not sure if it's possible:
>  *  Lazy initialize shuffle merger locations, After the mapper writes the 
> local shuffle data, it obtains the merge location in the push thread.
> Looking for advice and solutions on this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38010) Push-based shuffle disabled due to insufficient mergeLocations

2022-01-24 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-38010:
---
Description: 
The current shuffle merger locations is obtained based on the host of the 
active or dead Executors.
When dynamic executor allocation is enabled, when an application submits the 
first few stages, there are often not enough locations to satisfy the push 
merge, which causes most shuffles to not benefit from the push bashed shuffle.
The first few shuffle write stages of spark applications are generally the 
stages for reading tables or data sources, which account for a large amount of 
shuffled data. Because push merge shuffle is disabled, the end-to-end 
improvement of spark applications is very limited.

I probably thought of a way, but not sure if it's possible:
 *  Lazy initialize shuffle merger locations, After the mapper writes the local 
shuffle data, it obtains the merge location in the push thread.

Looking for advice and solutions on this issue

  was:
The current shuffle merger position is obtained based on the host of the active 
or dead Executor.
When dynamic resource allocation is enabled, when the application submits the 
first few stages, there are often not enough locations to satisfy the push 
merge, which causes most shuffles to not benefit from the push bashed shuffle.
The first few shuffle write stages of spark applications are generally the 
stages for reading tables or data sources, which account for a large amount of 
shuffled data and the proportion of data. Because push cannot be used, the 
end-to-end improvement of spark applications is very limited.

I probably thought of a way, but not sure if it's possible:
 *  Lazy initialize shuffle merger locations, After the mapper writes the local 
shuffle data, it obtains the merge location in the push thread.

Looking for advice and solutions on this issue


> Push-based shuffle disabled due to insufficient mergeLocations
> --
>
> Key: SPARK-38010
> URL: https://issues.apache.org/jira/browse/SPARK-38010
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: gaoyajun02
>Priority: Major
>
> The current shuffle merger locations is obtained based on the host of the 
> active or dead Executors.
> When dynamic executor allocation is enabled, when an application submits the 
> first few stages, there are often not enough locations to satisfy the push 
> merge, which causes most shuffles to not benefit from the push bashed shuffle.
> The first few shuffle write stages of spark applications are generally the 
> stages for reading tables or data sources, which account for a large amount 
> of shuffled data. Because push merge shuffle is disabled, the end-to-end 
> improvement of spark applications is very limited.
> I probably thought of a way, but not sure if it's possible:
>  *  Lazy initialize shuffle merger locations, After the mapper writes the 
> local shuffle data, it obtains the merge location in the push thread.
> Looking for advice and solutions on this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38010) Push-based shuffle disabled due to insufficient mergeLocations

2022-01-24 Thread gaoyajun02 (Jira)
gaoyajun02 created SPARK-38010:
--

 Summary: Push-based shuffle disabled due to insufficient 
mergeLocations
 Key: SPARK-38010
 URL: https://issues.apache.org/jira/browse/SPARK-38010
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 3.1.0
Reporter: gaoyajun02


The current shuffle merger position is obtained based on the host of the active 
or dead Executor.
When dynamic resource allocation is enabled, when the application submits the 
first few stages, there are often not enough locations to satisfy the push 
merge, which causes most shuffles to not benefit from the push bashed shuffle.
The first few shuffle write stages of spark applications are generally the 
stages for reading tables or data sources, which account for a large amount of 
shuffled data and the proportion of data. Because push cannot be used, the 
end-to-end improvement of spark applications is very limited.

I probably thought of a way, but not sure if it's possible:
 *  Lazy initialize shuffle merger locations, After the mapper writes the local 
shuffle data, it obtains the merge location in the push thread.

Looking for advice and solutions on this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36964) Reuse CachedDNSToSwitchMapping for yarn container requests

2021-10-17 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36964:
---
Affects Version/s: 3.3.0
   3.2.0

> Reuse CachedDNSToSwitchMapping for yarn  container requests
> ---
>
> Key: SPARK-36964
> URL: https://issues.apache.org/jira/browse/SPARK-36964
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: gaoyajun02
>Priority: Major
>
> Similar to SPARK-13704​, In some cases, YarnAllocator add container requests 
> with locality preference can be expensive, it may call the topology script 
> for rack awareness.
> When submit a very large job in a very large Yarn cluster, the topology 
> script may take signifiant time to run. And this blocks receiving 
> YarnSchedulerBackend's RequestExecutors rpc calls, This request comes from 
> spark dynamic executor allocation thread, which may blocks the 
> ExecutorAllocationListener, and then result in executorManagement queue 
> backlog.
>  
> Some logs:
> {code:java}
> 21/09/29 12:04:35 INFO spark-dynamic-executor-allocation 
> ExecutorAllocationManager: Error reaching cluster manager.21/09/29 12:04:35 
> INFO spark-dynamic-executor-allocation ExecutorAllocationManager: Error 
> reaching cluster manager.org.apache.spark.rpc.RpcTimeoutException: Futures 
> timed out after [120 seconds]. This timeout is controlled by 
> spark.rpc.askTimeout at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
>  at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
>  at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
>  at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) 
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:839)
>  at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:411)
>  at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:361)
>  at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:316)
>  at 
> org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:227)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)Caused by: 
> java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] 
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263) at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:294) at 
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 12 
> more21/09/29 12:04:35 WARN spark-dynamic-executor-allocation 
> ExecutorAllocationManager: Unable to reach the cluster manager to request 
> 1922 total executors!
> 21/09/29 12:04:35 INFO spark-dynamic-executor-allocation 
> ExecutorAllocationManager: Error reaching cluster manager.21/09/29 12:04:35 
> INFO spark-dynamic-executor-allocation ExecutorAllocationManager: Error 
> reaching cluster manager.org.apache.spark.rpc.RpcTimeoutException: Futures 
> timed out after [120 seconds]. This timeout is controlled by 
> spark.rpc.askTimeout at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
>  at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
>  at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
>  at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) 
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:839)
>  at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(E

[jira] [Updated] (SPARK-36964) Reuse CachedDNSToSwitchMapping for yarn container requests

2021-10-14 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36964:
---
Description: 
Similar to SPARK-13704​, In some cases, YarnAllocator add container requests 
with locality preference can be expensive, it may call the topology script for 
rack awareness.

When submit a very large job in a very large Yarn cluster, the topology script 
may take signifiant time to run. And this blocks receiving 
YarnSchedulerBackend's RequestExecutors rpc calls, This request comes from 
spark dynamic executor allocation thread, which may blocks the 
ExecutorAllocationListener, and then result in executorManagement queue backlog.

 

Some logs:
{code:java}
21/09/29 12:04:35 INFO spark-dynamic-executor-allocation 
ExecutorAllocationManager: Error reaching cluster manager.21/09/29 12:04:35 
INFO spark-dynamic-executor-allocation ExecutorAllocationManager: Error 
reaching cluster manager.org.apache.spark.rpc.RpcTimeoutException: Futures 
timed out after [120 seconds]. This timeout is controlled by 
spark.rpc.askTimeout at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
 at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
 at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
 at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) 
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:839)
 at 
org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:411)
 at 
org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:361)
 at 
org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:316)
 at 
org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:227)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)Caused by: 
java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at 
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263) at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:294) at 
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 12 
more21/09/29 12:04:35 WARN spark-dynamic-executor-allocation 
ExecutorAllocationManager: Unable to reach the cluster manager to request 1922 
total executors!

21/09/29 12:04:35 INFO spark-dynamic-executor-allocation 
ExecutorAllocationManager: Error reaching cluster manager.21/09/29 12:04:35 
INFO spark-dynamic-executor-allocation ExecutorAllocationManager: Error 
reaching cluster manager.org.apache.spark.rpc.RpcTimeoutException: Futures 
timed out after [120 seconds]. This timeout is controlled by 
spark.rpc.askTimeout at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
 at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
 at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
 at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) 
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:839)
 at 
org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:411)
 at 
org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:361)
 at 
org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:316)
 at 
org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:227)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThrea

[jira] [Updated] (SPARK-36964) Reuse CachedDNSToSwitchMapping for yarn container requests

2021-10-14 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36964:
---
Description: 
Similar to SPARK-13704​, In some cases, YarnAllocator add or remove container 
requests can be expensive, it may call the topology script for rack awareness.

When submit a very large job in a very large Yarn cluster, the topology script 
may take signifiant time to run. And this blocks receiving 
YarnSchedulerBackend's RequestExecutors rpc calls, This request comes from 
spark dynamic executor allocation thread, which may blocks the 
ExecutorAllocationListener,
{code}
12:04:35 INFO spark-dynamic-executor-allocation ExecutorAllocationManager: 
Error reaching cluster manager.21/09/29 12:04:35 INFO 
spark-dynamic-executor-allocation ExecutorAllocationManager: Error reaching 
cluster manager.org.apache.spark.rpc.RpcTimeoutException: Futures timed out 
after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
 at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
 at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
 at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) 
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:839)
 at 
org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:411)
 at 
org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:361)
 at 
org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:316)
 at 
org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:227)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)Caused by: 
java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at 
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263) at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:294) at 
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 12 
more21/09/29 12:04:35 WARN spark-dynamic-executor-allocation 
ExecutorAllocationManager: Unable to reach the cluster manager to request 1922 
total executors!{code}
and then result in executorManagement queue backlog. e.g. some log:
{code}
21/09/29 12:02:49 ERROR dag-scheduler-event-loop AsyncEventQueue: Dropping 
event from queue executorManagement. This likely means one of the listeners is 
too slow and cannot keep up with the rate at which tasks are being started by 
the scheduler.
21/09/29 12:02:49 WARN dag-scheduler-event-loop AsyncEventQueue: Dropped 1 
events from executorManagement since the application started.
21/09/29 12:02:55 INFO spark-listener-group-eventLog AsyncEventQueue: Process 
of event 
SparkListenerExecutorAdded(1632888172920,543,org.apache.spark.scheduler.cluster.ExecutorData@8cfab8f5,None)
 by listener EventLoggingListener took 3.037686034s.
21/09/29 12:03:03 INFO spark-listener-group-eventLog AsyncEventQueue: Process 
of event SparkListenerBlockManagerAdded(1632888181779,BlockManagerId(1359, --, 
57233, None),2704696934,Some(2704696934),Some(0)) by listener 
EventLoggingListener took 1.462598355s.
21/09/29 12:03:49 WARN dispatcher-BlockManagerMaster AsyncEventQueue: Dropped 
74388 events from executorManagement since Wed Sep 29 12:02:49 CST 2021.
21/09/29 12:04:35 INFO spark-listener-group-executorManagement AsyncEventQueue: 
Process of event 
SparkListenerStageSubmitted(org.apache.spark.scheduler.StageInfo@52f810ad,{...})
 by listener ExecutorAllocationListener took 116.526408932s.
21/09/29 12:04:49 WARN heartbeat-receiver-event-loop-thread AsyncEventQueue: 
Dropped 18892 events from executorManagement since Wed Sep 29 12:03:49 CST 2021.
21/09/29 12:05:49 WARN dispatcher-BlockManagerMaster AsyncEventQueue: Dropped 
19397 events from executorManagement since Wed Sep 29 12:04:49 CST 2021.
{code}

  was:
Similar to SPARK-13704​, In some cases, YarnAllocator add or r

[jira] [Updated] (SPARK-36964) Reuse CachedDNSToSwitchMapping for yarn container requests

2021-10-14 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36964:
---
Description: 
Similar to SPARK-13704​, In some cases, YarnAllocator add or remove container 
requests can be expensive, it may call the topology script for rack awareness.

When submit a very large job in a very large Yarn cluster, the topology script 
may take signifiant time to run. And this blocks receiving 
YarnSchedulerBackend's RequestExecutors rpc calls, This request comes from 
spark dynamic executor allocation thread, which may blocks the 
ExecutorAllocationListener,
{code:text}
21/09/29 12:04:35 INFO spark-dynamic-executor-allocation 
ExecutorAllocationManager: Error reaching cluster manager.21/09/29 12:04:35 
INFO spark-dynamic-executor-allocation ExecutorAllocationManager: Error 
reaching cluster manager.org.apache.spark.rpc.RpcTimeoutException: Futures 
timed out after [120 seconds]. This timeout is controlled by 
spark.rpc.askTimeout at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
 at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
 at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
 at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) 
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:839)
 at 
org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:411)
 at 
org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:361)
 at 
org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:316)
 at 
org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:227)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)Caused by: 
java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at 
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263) at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:294) at 
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 12 
more21/09/29 12:04:35 WARN spark-dynamic-executor-allocation 
ExecutorAllocationManager: Unable to reach the cluster manager to request 1922 
total executors!{code}
and then result in executorManagement queue backlog. e.g. some log:
{code:text}
21/09/29 12:02:49 ERROR dag-scheduler-event-loop AsyncEventQueue: Dropping 
event from queue executorManagement. This likely means one of the listeners is 
too slow and cannot keep up with the rate at which tasks are being started by 
the scheduler.
21/09/29 12:02:49 WARN dag-scheduler-event-loop AsyncEventQueue: Dropped 1 
events from executorManagement since the application started.
21/09/29 12:02:55 INFO spark-listener-group-eventLog AsyncEventQueue: Process 
of event 
SparkListenerExecutorAdded(1632888172920,543,org.apache.spark.scheduler.cluster.ExecutorData@8cfab8f5,None)
 by listener EventLoggingListener took 3.037686034s.
21/09/29 12:03:03 INFO spark-listener-group-eventLog AsyncEventQueue: Process 
of event SparkListenerBlockManagerAdded(1632888181779,BlockManagerId(1359, --, 
57233, None),2704696934,Some(2704696934),Some(0)) by listener 
EventLoggingListener took 1.462598355s.
21/09/29 12:03:49 WARN dispatcher-BlockManagerMaster AsyncEventQueue: Dropped 
74388 events from executorManagement since Wed Sep 29 12:02:49 CST 2021.
21/09/29 12:04:35 INFO spark-listener-group-executorManagement AsyncEventQueue: 
Process of event 
SparkListenerStageSubmitted(org.apache.spark.scheduler.StageInfo@52f810ad,{...})
 by listener ExecutorAllocationListener took 116.526408932s.
21/09/29 12:04:49 WARN heartbeat-receiver-event-loop-thread AsyncEventQueue: 
Dropped 18892 events from executorManagement since Wed Sep 29 12:03:49 CST 2021.
21/09/29 12:05:49 WARN dispatcher-BlockManagerMaster AsyncEventQueue: Dropped 
19397 events from executorManagement since Wed Sep 29 12:04:49 CST 2021.
{code}



  was:
Similar to SPARK-13704​, In some cases, 

[jira] [Created] (SPARK-36964) Reuse CachedDNSToSwitchMapping for yarn container requests

2021-10-09 Thread gaoyajun02 (Jira)
gaoyajun02 created SPARK-36964:
--

 Summary: Reuse CachedDNSToSwitchMapping for yarn  container 
requests
 Key: SPARK-36964
 URL: https://issues.apache.org/jira/browse/SPARK-36964
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 3.1.2, 3.0.3
Reporter: gaoyajun02


Similar to SPARK-13704​, In some cases, YarnAllocator add or remove container 
requests can be expensive, it may call the topology script for rack awareness.

When submit a very large job in a very large Yarn cluster, the topology script 
may take signifiant time to run. And this blocks receiving 
YarnSchedulerBackend's RequestExecutors rpc calls, This request comes from 
spark dynamic executor allocation thread, which may blocks the 
ExecutorAllocationListener, and then result in executorManagement queue backlog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36815) Found duplicate rewrite attributes

2021-09-22 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36815:
---
Description: 
We are using Spark version 3.0.2 in production and some ETLs contain 
multi-level CTEs and the following error occurs when we join them.
{code:java}
java.lang.AssertionError: assertion failed: Found duplicate rewrite attributes 
at scala.Predef$.assert(Predef.scala:223) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
{code}
I reproduced the problem with a simplified SQL as follows:
{code:java}
-- SQL
with
a as ( select name, get_json_object(json, '$.id') id, n from (
select get_json_object(json, '$.name') name, json from values 
('{"name":"a", "id": 1}' ) people(json)
) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) c 
from a group by name) a2 on a1.name = a2.name)
select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
In debugging I found that a reference to the root Project existed in both 
subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
occurred in both subqueries, containing two new attrMapping, and they were both 
eventually passed to the root Project, leading to this error

plan:
{code:java}
Project [name#218, id#219, n#229]
+- Join LeftOuter, (name#218 = name#232)
   :- SubqueryAlias a1
   :  +- SubqueryAlias a
   : +- Project [name#218, get_json_object(json#225, $.id) AS id#219, n#229]
   :+- Generate explode(array(1, 1, 2)), false, num, [n#229]
   :   +- SubqueryAlias __auto_generated_subquery_name
   :  +- Project [get_json_object(json#225, $.name) AS name#218, 
json#225]
   : +- SubqueryAlias people
   :+- LocalRelation [json#225]
   +- SubqueryAlias a2
  +- Aggregate [name#232], [name#232, count(1) AS c#220L]
 +- SubqueryAlias a
+- Project [name#232, get_json_object(json#226, $.id) AS id#219, 
n#230]
   +- Generate explode(array(1, 1, 2)), false, num, [n#230]
  +- SubqueryAlias __auto_generated_subquery_name
 +- Project [get_json_object(json#226, $.name) AS name#232, 
json#226]
+- SubqueryAlias people
   +- LocalRelation [json#226]

{code}
 newPlan:
{code:java}
!Project [name#218, id#219, n#229]
+- Join LeftOuter, (name#218 = name#232)
   :- SubqueryAlias a1
   :  +- SubqueryAlias a
   : +- Project [name#218, get_json_object(json#225, $.id) AS id#233, n#229]
   :+- Generate explode(array(1, 1, 2)), false, num, [n#229]
   :   +- SubqueryAlias __auto_generated_subquery_name
   :  +- Project [get_json_object(json#225, $.name) AS name#218, 
json#225]
   : +- SubqueryAlias people
   :+- LocalRelation [json#225]
   +- SubqueryAlias a2
  +- Aggregate [name#232], [name#232, count(1) AS c#220L]
 +- SubqueryAlias a
+- Project [name#232, get_json_object(json#226, $.id) AS id#234, 
n#230]
   +- Generate explode(array(1, 1, 2)), false, num, [n#230]
  +- SubqueryAlias __auto_generated_subquery_name
 +- Project [get_json_object(json#226, $.name) AS name#232, 
json#226]
+- SubqueryAlias people
   +- LocalRelation [json#226]

{code}
attrMapping:
{code:java}
attrMapping = {ArrayBuffer@9099} "ArrayBuffer" size = 2
 0 = {Tuple2@17769} "(id#219,id#233)"
 1 = {Tuple2@17770} "(id#219,id#234)"
{code}
 

 

 

  was:
We are using Spark version 3.0.2 in production and some ETLs contain 
multi-level CETs and the following error occurs when we join them.
{code:java}
java.lang.AssertionError: assertion failed: Found duplicate rewrite attributes 
at scala.Predef$.assert(Predef.scala:223) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
{code}
I reproduced the problem with a simplified SQL as follows:
{code:java}
-- SQL
with
a as ( select name, get_json_object(json, '$.id') id, n from (
select get_json_object(json, '$.nam

[jira] [Comment Edited] (SPARK-36815) Found duplicate rewrite attributes

2021-09-21 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418064#comment-17418064
 ] 

gaoyajun02 edited comment on SPARK-36815 at 9/21/21, 2:19 PM:
--

https://issues.apache.org/jira/browse/SPARK-33272 fixes this issue, but the 
Spark 3.0.2 branch does not.

Hi [~cloud_fan], can we open backport PRs for 3.0?


was (Author: gaoyajun02):
https://issues.apache.org/jira/browse/SPARK-33272 fixes this issue, but the 
Spark 3.0.2 branch does not.

Hi [~cloud_fan], can we open backport PRs for 3.0.2?

> Found duplicate rewrite attributes
> --
>
> Key: SPARK-36815
> URL: https://issues.apache.org/jira/browse/SPARK-36815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: gaoyajun02
>Priority: Major
> Fix For: 3.0.2
>
>
> We are using Spark version 3.0.2 in production and some ETLs contain 
> multi-level CETs and the following error occurs when we join them.
> {code:java}
> java.lang.AssertionError: assertion failed: Found duplicate rewrite 
> attributes at scala.Predef$.assert(Predef.scala:223) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) 
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
> {code}
> I reproduced the problem with a simplified SQL as follows:
> {code:java}
> -- SQL
> with
> a as ( select name, get_json_object(json, '$.id') id, n from (
> select get_json_object(json, '$.name') name, json from values 
> ('{"name":"a", "id": 1}' ) people(json)
> ) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
> b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) 
> c from a group by name) a2 on a1.name = a2.name)
> select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
> In debugging I found that a reference to the root Project existed in both 
> subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
> occurred in both subqueries, containing two new attrMapping, and they were 
> both eventually passed to the root Project, leading to this error
> plan:
> {code:java}
> Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#219, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#219, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
>  newPlan:
> {code:java}
> !Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#233, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#234, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
> attrMapping:
> {code:java}
> attrMapping = {ArrayBuffer@9099} "ArrayBuffer" size = 2
>  0 = {Tupl

[jira] [Comment Edited] (SPARK-36815) Found duplicate rewrite attributes

2021-09-21 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418064#comment-17418064
 ] 

gaoyajun02 edited comment on SPARK-36815 at 9/21/21, 12:17 PM:
---

https://issues.apache.org/jira/browse/SPARK-33272 fixes this issue, but the 
Spark 3.0.2 branch does not.

Hi [~cloud_fan], can we open backport PRs for 3.0.2?


was (Author: gaoyajun02):
https://issues.apache.org/jira/browse/SPARK-33272 fixes this issue, but the 
Spark 3.0.2 branch does not

> Found duplicate rewrite attributes
> --
>
> Key: SPARK-36815
> URL: https://issues.apache.org/jira/browse/SPARK-36815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: gaoyajun02
>Priority: Major
> Fix For: 3.0.2
>
>
> We are using Spark version 3.0.2 in production and some ETLs contain 
> multi-level CETs and the following error occurs when we join them.
> {code:java}
> java.lang.AssertionError: assertion failed: Found duplicate rewrite 
> attributes at scala.Predef$.assert(Predef.scala:223) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) 
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
> {code}
> I reproduced the problem with a simplified SQL as follows:
> {code:java}
> -- SQL
> with
> a as ( select name, get_json_object(json, '$.id') id, n from (
> select get_json_object(json, '$.name') name, json from values 
> ('{"name":"a", "id": 1}' ) people(json)
> ) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
> b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) 
> c from a group by name) a2 on a1.name = a2.name)
> select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
> In debugging I found that a reference to the root Project existed in both 
> subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
> occurred in both subqueries, containing two new attrMapping, and they were 
> both eventually passed to the root Project, leading to this error
> plan:
> {code:java}
> Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#219, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#219, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
>  newPlan:
> {code:java}
> !Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#233, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#234, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
> attrMapping:
> {code:java}
> attrMapping = {ArrayBuffer@9099} "ArrayBuffer" size = 2
>  0 = {Tuple2@17769} "(id#219,id#233)"
>  1 = {Tuple2@17770} "

[jira] [Commented] (SPARK-36815) Found duplicate rewrite attributes

2021-09-21 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418064#comment-17418064
 ] 

gaoyajun02 commented on SPARK-36815:


https://issues.apache.org/jira/browse/SPARK-33272 fixes this issue, but the 
Spark 3.0.2 branch does not

> Found duplicate rewrite attributes
> --
>
> Key: SPARK-36815
> URL: https://issues.apache.org/jira/browse/SPARK-36815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: gaoyajun02
>Priority: Major
> Fix For: 3.0.2
>
>
> We are using Spark version 3.0.2 in production and some ETLs contain 
> multi-level CETs and the following error occurs when we join them.
> {code:java}
> java.lang.AssertionError: assertion failed: Found duplicate rewrite 
> attributes at scala.Predef$.assert(Predef.scala:223) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) 
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
> {code}
> I reproduced the problem with a simplified SQL as follows:
> {code:java}
> -- SQL
> with
> a as ( select name, get_json_object(json, '$.id') id, n from (
> select get_json_object(json, '$.name') name, json from values 
> ('{"name":"a", "id": 1}' ) people(json)
> ) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
> b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) 
> c from a group by name) a2 on a1.name = a2.name)
> select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
> In debugging I found that a reference to the root Project existed in both 
> subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
> occurred in both subqueries, containing two new attrMapping, and they were 
> both eventually passed to the root Project, leading to this error
> plan:
> {code:java}
> Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#219, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#219, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
>  newPlan:
> {code:java}
> !Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#233, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#234, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
> attrMapping:
> {code:java}
> attrMapping = {ArrayBuffer@9099} "ArrayBuffer" size = 2
>  0 = {Tuple2@17769} "(id#219,id#233)"
>  1 = {Tuple2@17770} "(id#219,id#234)"
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additi

[jira] [Updated] (SPARK-36815) Found duplicate rewrite attributes

2021-09-21 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36815:
---
Description: 
We are using Spark version 3.0.2 in production and some ETLs contain 
multi-level CETs and the following error occurs when we join them.
{code:java}
java.lang.AssertionError: assertion failed: Found duplicate rewrite attributes 
at scala.Predef$.assert(Predef.scala:223) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
{code}
I reproduced the problem with a simplified SQL as follows:
{code:java}
-- SQL
with
a as ( select name, get_json_object(json, '$.id') id, n from (
select get_json_object(json, '$.name') name, json from values 
('{"name":"a", "id": 1}' ) people(json)
) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) c 
from a group by name) a2 on a1.name = a2.name)
select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
In debugging I found that a reference to the root Project existed in both 
subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
occurred in both subqueries, containing two new attrMapping, and they were both 
eventually passed to the root Project, leading to this error

plan:
{code:java}
Project [name#218, id#219, n#229]
+- Join LeftOuter, (name#218 = name#232)
   :- SubqueryAlias a1
   :  +- SubqueryAlias a
   : +- Project [name#218, get_json_object(json#225, $.id) AS id#219, n#229]
   :+- Generate explode(array(1, 1, 2)), false, num, [n#229]
   :   +- SubqueryAlias __auto_generated_subquery_name
   :  +- Project [get_json_object(json#225, $.name) AS name#218, 
json#225]
   : +- SubqueryAlias people
   :+- LocalRelation [json#225]
   +- SubqueryAlias a2
  +- Aggregate [name#232], [name#232, count(1) AS c#220L]
 +- SubqueryAlias a
+- Project [name#232, get_json_object(json#226, $.id) AS id#219, 
n#230]
   +- Generate explode(array(1, 1, 2)), false, num, [n#230]
  +- SubqueryAlias __auto_generated_subquery_name
 +- Project [get_json_object(json#226, $.name) AS name#232, 
json#226]
+- SubqueryAlias people
   +- LocalRelation [json#226]

{code}
 newPlan:
{code:java}

!Project [name#218, id#219, n#229]
+- Join LeftOuter, (name#218 = name#232)
   :- SubqueryAlias a1
   :  +- SubqueryAlias a
   : +- Project [name#218, get_json_object(json#225, $.id) AS id#233, n#229]
   :+- Generate explode(array(1, 1, 2)), false, num, [n#229]
   :   +- SubqueryAlias __auto_generated_subquery_name
   :  +- Project [get_json_object(json#225, $.name) AS name#218, 
json#225]
   : +- SubqueryAlias people
   :+- LocalRelation [json#225]
   +- SubqueryAlias a2
  +- Aggregate [name#232], [name#232, count(1) AS c#220L]
 +- SubqueryAlias a
+- Project [name#232, get_json_object(json#226, $.id) AS id#234, 
n#230]
   +- Generate explode(array(1, 1, 2)), false, num, [n#230]
  +- SubqueryAlias __auto_generated_subquery_name
 +- Project [get_json_object(json#226, $.name) AS name#232, 
json#226]
+- SubqueryAlias people
   +- LocalRelation [json#226]

{code}
attrMapping:
{code:java}
attrMapping = {ArrayBuffer@9099} "ArrayBuffer" size = 2
 0 = {Tuple2@17769} "(id#219,id#233)"
 1 = {Tuple2@17770} "(id#219,id#234)"
{code}
 

 

 

  was:
We are using Spark version 3.0.2 in production and some ETLs contain 
multi-level CETs and the following error occurs when we join them.
{code:java}
java.lang.AssertionError: assertion failed: Found duplicate rewrite attributes 
at scala.Predef$.assert(Predef.scala:223) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
{code}
I reproduced the problem with a simplified SQL as follows:
{code:java}
-- SQL
with
a as ( select name, get_json_object(json, '$.id') id, n from (
select get_json_object(json, '$.na

[jira] [Created] (SPARK-36815) Found duplicate rewrite attributes

2021-09-21 Thread gaoyajun02 (Jira)
gaoyajun02 created SPARK-36815:
--

 Summary: Found duplicate rewrite attributes
 Key: SPARK-36815
 URL: https://issues.apache.org/jira/browse/SPARK-36815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2
Reporter: gaoyajun02
 Fix For: 3.0.2


We are using Spark version 3.0.2 in production and some ETLs contain 
multi-level CETs and the following error occurs when we join them.
{code:java}
java.lang.AssertionError: assertion failed: Found duplicate rewrite attributes 
at scala.Predef$.assert(Predef.scala:223) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
{code}
I reproduced the problem with a simplified SQL as follows:
{code:java}
-- SQL
with
a as ( select name, get_json_object(json, '$.id') id, n from (
select get_json_object(json, '$.name') name, json from values 
('{"name":"a", "id": 1}' ) people(json)
) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) c 
from a group by name) a2 on a1.name = a2.name)
select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
In debugging I found that a reference to the root Project existed in both 
subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
occurred in both subqueries, containing two new attrMapping, and they were both 
eventually passed to the root Project, leading to this error

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36630) Add the option to use physical statistics to avoid large tables being broadcast

2021-09-02 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36630:
---
Parent: (was: SPARK-33828)
Issue Type: Question  (was: Sub-task)

> Add the option to use physical statistics to avoid large tables being 
> broadcast
> ---
>
> Key: SPARK-36630
> URL: https://issues.apache.org/jira/browse/SPARK-36630
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: gaoyajun02
>Priority: Major
>
> Currently when AQE's queryStage is not materialized, it uses the stats of the 
> logical plan to estimate whether the plan can be converted to BHJ, and in 
> some scenarios the estimated value is several orders of magnitude smaller 
> than the actual broadcast data, which can lead to large tables being broadcast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-36630) Add the option to use physical statistics to avoid large tables being broadcast

2021-09-02 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36630:
---
Comment: was deleted

(was: close it)

> Add the option to use physical statistics to avoid large tables being 
> broadcast
> ---
>
> Key: SPARK-36630
> URL: https://issues.apache.org/jira/browse/SPARK-36630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: gaoyajun02
>Priority: Major
>
> Currently when AQE's queryStage is not materialized, it uses the stats of the 
> logical plan to estimate whether the plan can be converted to BHJ, and in 
> some scenarios the estimated value is several orders of magnitude smaller 
> than the actual broadcast data, which can lead to large tables being broadcast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-36630) Add the option to use physical statistics to avoid large tables being broadcast

2021-09-02 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 closed SPARK-36630.
--

close it

> Add the option to use physical statistics to avoid large tables being 
> broadcast
> ---
>
> Key: SPARK-36630
> URL: https://issues.apache.org/jira/browse/SPARK-36630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: gaoyajun02
>Priority: Major
>
> Currently when AQE's queryStage is not materialized, it uses the stats of the 
> logical plan to estimate whether the plan can be converted to BHJ, and in 
> some scenarios the estimated value is several orders of magnitude smaller 
> than the actual broadcast data, which can lead to large tables being broadcast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36630) Add the option to use physical statistics to avoid large tables being broadcast

2021-09-02 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 resolved SPARK-36630.

Resolution: Fixed

> Add the option to use physical statistics to avoid large tables being 
> broadcast
> ---
>
> Key: SPARK-36630
> URL: https://issues.apache.org/jira/browse/SPARK-36630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: gaoyajun02
>Priority: Major
>
> Currently when AQE's queryStage is not materialized, it uses the stats of the 
> logical plan to estimate whether the plan can be converted to BHJ, and in 
> some scenarios the estimated value is several orders of magnitude smaller 
> than the actual broadcast data, which can lead to large tables being broadcast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36630) Add the option to use physical statistics to avoid large tables being broadcast

2021-09-02 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408901#comment-17408901
 ] 

gaoyajun02 commented on SPARK-36630:


Found my issue is similar to https://issues.apache.org/jira/browse/SPARK-35264.

I can set the autoBroadcastThreshold to -1 to disable logical plan stats,

and set spark.sql.adaptive.autoBroadcastJoinThreshold to control the threshold 
for physical stats

> Add the option to use physical statistics to avoid large tables being 
> broadcast
> ---
>
> Key: SPARK-36630
> URL: https://issues.apache.org/jira/browse/SPARK-36630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: gaoyajun02
>Priority: Major
>
> Currently when AQE's queryStage is not materialized, it uses the stats of the 
> logical plan to estimate whether the plan can be converted to BHJ, and in 
> some scenarios the estimated value is several orders of magnitude smaller 
> than the actual broadcast data, which can lead to large tables being broadcast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36630) Add the option to use physical statistics to avoid large tables being broadcast

2021-09-02 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36630:
---
Parent: SPARK-33828
Issue Type: Sub-task  (was: Improvement)

> Add the option to use physical statistics to avoid large tables being 
> broadcast
> ---
>
> Key: SPARK-36630
> URL: https://issues.apache.org/jira/browse/SPARK-36630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: gaoyajun02
>Priority: Major
>
> Currently when AQE's queryStage is not materialized, it uses the stats of the 
> logical plan to estimate whether the plan can be converted to BHJ, and in 
> some scenarios the estimated value is several orders of magnitude smaller 
> than the actual broadcast data, which can lead to large tables being broadcast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36630) Add the option to use physical statistics to avoid large tables being broadcast

2021-09-01 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36630:
---
Description: Currently when AQE's queryStage is not materialized, it uses 
the stats of the logical plan to estimate whether the plan can be converted to 
BHJ, and in some scenarios the estimated value is several orders of magnitude 
smaller than the actual broadcast data, which can lead to large tables being 
broadcast  (was: Currently when AQE's queryStage is not materialized, it uses 
the stats of the logical plan to estimate whether the plan can be converted to 
BHJ, and in some scenarios the estimated value is several orders of magnitude 
larger than the actual broadcast data, which can lead to large tables being 
broadcast)

> Add the option to use physical statistics to avoid large tables being 
> broadcast
> ---
>
> Key: SPARK-36630
> URL: https://issues.apache.org/jira/browse/SPARK-36630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: gaoyajun02
>Priority: Major
>
> Currently when AQE's queryStage is not materialized, it uses the stats of the 
> logical plan to estimate whether the plan can be converted to BHJ, and in 
> some scenarios the estimated value is several orders of magnitude smaller 
> than the actual broadcast data, which can lead to large tables being broadcast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36630) Add the option to use physical statistics to avoid large tables being broadcast

2021-08-31 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36630:
---
Description: Currently when AQE's queryStage is not materialized, it uses 
the stats of the logical plan to estimate whether the plan can be converted to 
BHJ, and in some scenarios the estimated value is several orders of magnitude 
larger than the actual broadcast data, which can lead to large tables being 
broadcast  (was: Currently AQE is turned on, when queryStage is not 
materialized, it uses the stats of the logical plan to estimate whether the 
plan can be converted to BHJ, and in some scenarios the estimated value is 
several orders of magnitude larger than the actual broadcast data, which can 
lead to large tables being broadcast)

> Add the option to use physical statistics to avoid large tables being 
> broadcast
> ---
>
> Key: SPARK-36630
> URL: https://issues.apache.org/jira/browse/SPARK-36630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: gaoyajun02
>Priority: Major
>
> Currently when AQE's queryStage is not materialized, it uses the stats of the 
> logical plan to estimate whether the plan can be converted to BHJ, and in 
> some scenarios the estimated value is several orders of magnitude larger than 
> the actual broadcast data, which can lead to large tables being broadcast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36630) Add the option to use physical statistics to avoid large tables being broadcast

2021-08-31 Thread gaoyajun02 (Jira)
gaoyajun02 created SPARK-36630:
--

 Summary: Add the option to use physical statistics to avoid large 
tables being broadcast
 Key: SPARK-36630
 URL: https://issues.apache.org/jira/browse/SPARK-36630
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: gaoyajun02


Currently AQE is turned on, when queryStage is not materialized, it uses the 
stats of the logical plan to estimate whether the plan can be converted to BHJ, 
and in some scenarios the estimated value is several orders of magnitude larger 
than the actual broadcast data, which can lead to large tables being broadcast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36339) aggsBuffer should collect AggregateExpression in the map range

2021-07-28 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36339:
---
Description: 
show demo for this ISSUE:
{code:java}
// SQL without error

SELECT name, count(name) c
FROM VALUES ('Alice'), ('Bob') people(name)
GROUP BY name GROUPING SETS(name);

// An error is reported after exchanging the order of the query columns:

SELECT count(name) c, name
FROM VALUES ('Alice'), ('Bob') people(name)
GROUP BY name GROUPING SETS(name);

{code}
The error message is:
{code:java}
Error in query: expression 'people.`name`' is neither present in the group by, 
nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;
Aggregate [name#5, spark_grouping_id#3], [count(name#1) AS c#0L, name#1]
+- Expand [List(name#1, name#4, 0)], [name#1, name#5, spark_grouping_id#3]
   +- Project [name#1, name#1 AS name#4]
  +- SubqueryAlias `people`
 +- LocalRelation [name#1]

{code}
So far, I have checked that there is no problem before version 2.3.

 

During debugging, I found that the behavior of constructAggregateExprs in 
ResolveGroupingAnalytics has changed.
{code:java}
/*
 * Construct new aggregate expressions by replacing grouping functions.
 */
private def constructAggregateExprs(
groupByExprs: Seq[Expression],
aggregations: Seq[NamedExpression],
groupByAliases: Seq[Alias],
groupingAttrs: Seq[Expression],
gid: Attribute): Seq[NamedExpression] = aggregations.map {
  // collect all the found AggregateExpression, so we can check an 
expression is part of
  // any AggregateExpression or not.
  val aggsBuffer = ArrayBuffer[Expression]()
  // Returns whether the expression belongs to any expressions in 
`aggsBuffer` or not.
  def isPartOfAggregation(e: Expression): Boolean = {
aggsBuffer.exists(a => a.find(_ eq e).isDefined)
  }
  replaceGroupingFunc(_, groupByExprs, gid).transformDown {
// AggregateExpression should be computed on the unmodified value of 
its argument
// expressions, so we should not replace any references to grouping 
expression
// inside it.
case e: AggregateExpression =>
  aggsBuffer += e
  e
case e if isPartOfAggregation(e) => e
case e =>
  // Replace expression by expand output attribute.
  val index = groupByAliases.indexWhere(_.child.semanticEquals(e))
  if (index == -1) {
e
  } else {
groupingAttrs(index)
  }
  }.asInstanceOf[NamedExpression]
}

{code}
When performing aggregations.map, the aggsBuffer here seems to be outside the 
scope of the map. It can store the AggregateExpression of all the elements 
processed by the map function, but this is not before 2.3.

  was:
show demo for this ISSUE:

This SQL 

This SQL can be executed normally

 
{code:java}
// SQL without error

SELECT name, count(name) c
FROM VALUES ('Alice'), ('Bob') people(name)
GROUP BY name GROUPING SETS(name);

// An error is reported after exchanging the order of the query columns:

SELECT count(name) c, name
FROM VALUES ('Alice'), ('Bob') people(name)
GROUP BY name GROUPING SETS(name);

{code}
The error message is:

 
{code:java}

Error in query: expression 'people.`name`' is neither present in the group by, 
nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;
Aggregate [name#5, spark_grouping_id#3], [count(name#1) AS c#0L, name#1]
+- Expand [List(name#1, name#4, 0)], [name#1, name#5, spark_grouping_id#3]
   +- Project [name#1, name#1 AS name#4]
  +- SubqueryAlias `people`
 +- LocalRelation [name#1]

{code}
So far, I have checked that there is no problem before version 2.3. 

 

During debugging, I found that the behavior of constructAggregateExprs in 
ResolveGroupingAnalytics has changed. 
{code:java}
/*
 * Construct new aggregate expressions by replacing grouping functions.
 */
private def constructAggregateExprs(
groupByExprs: Seq[Expression],
aggregations: Seq[NamedExpression],
groupByAliases: Seq[Alias],
groupingAttrs: Seq[Expression],
gid: Attribute): Seq[NamedExpression] = aggregations.map {
  // collect all the found AggregateExpression, so we can check an 
expression is part of
  // any AggregateExpression or not.
  val aggsBuffer = ArrayBuffer[Expression]()
  // Returns whether the expression belongs to any expressions in 
`aggsBuffer` or not.
  def isPartOfAggregation(e: Expression): Boolean = {
aggsBuffer.exists(a => a.find(_ eq e).isDefined)
  }
  replaceGroupingFunc(_, groupByExprs, gid).transformDown {
// AggregateExpression should be computed on the unmodified value of 
its argument
// expressions, so 

[jira] [Created] (SPARK-36339) aggsBuffer should collect AggregateExpression in the map range

2021-07-28 Thread gaoyajun02 (Jira)
gaoyajun02 created SPARK-36339:
--

 Summary: aggsBuffer should collect AggregateExpression in the map 
range
 Key: SPARK-36339
 URL: https://issues.apache.org/jira/browse/SPARK-36339
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2, 3.0.3, 2.4.8
Reporter: gaoyajun02


show demo for this ISSUE:

This SQL 

This SQL can be executed normally

 
{code:java}
// SQL without error

SELECT name, count(name) c
FROM VALUES ('Alice'), ('Bob') people(name)
GROUP BY name GROUPING SETS(name);

// An error is reported after exchanging the order of the query columns:

SELECT count(name) c, name
FROM VALUES ('Alice'), ('Bob') people(name)
GROUP BY name GROUPING SETS(name);

{code}
The error message is:

 
{code:java}

Error in query: expression 'people.`name`' is neither present in the group by, 
nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;
Aggregate [name#5, spark_grouping_id#3], [count(name#1) AS c#0L, name#1]
+- Expand [List(name#1, name#4, 0)], [name#1, name#5, spark_grouping_id#3]
   +- Project [name#1, name#1 AS name#4]
  +- SubqueryAlias `people`
 +- LocalRelation [name#1]

{code}
So far, I have checked that there is no problem before version 2.3. 

 

During debugging, I found that the behavior of constructAggregateExprs in 
ResolveGroupingAnalytics has changed. 
{code:java}
/*
 * Construct new aggregate expressions by replacing grouping functions.
 */
private def constructAggregateExprs(
groupByExprs: Seq[Expression],
aggregations: Seq[NamedExpression],
groupByAliases: Seq[Alias],
groupingAttrs: Seq[Expression],
gid: Attribute): Seq[NamedExpression] = aggregations.map {
  // collect all the found AggregateExpression, so we can check an 
expression is part of
  // any AggregateExpression or not.
  val aggsBuffer = ArrayBuffer[Expression]()
  // Returns whether the expression belongs to any expressions in 
`aggsBuffer` or not.
  def isPartOfAggregation(e: Expression): Boolean = {
aggsBuffer.exists(a => a.find(_ eq e).isDefined)
  }
  replaceGroupingFunc(_, groupByExprs, gid).transformDown {
// AggregateExpression should be computed on the unmodified value of 
its argument
// expressions, so we should not replace any references to grouping 
expression
// inside it.
case e: AggregateExpression =>
  aggsBuffer += e
  e
case e if isPartOfAggregation(e) => e
case e =>
  // Replace expression by expand output attribute.
  val index = groupByAliases.indexWhere(_.child.semanticEquals(e))
  if (index == -1) {
e
  } else {
groupingAttrs(index)
  }
  }.asInstanceOf[NamedExpression]
}

{code}
When performing aggregations.map, the aggsBuffer here seems to be outside the 
scope of the map. It can store the AggregateExpression of all the elements 
processed by the map function, but this is not before 2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36121) Write data loss caused by stage retry when enable v2 FileOutputCommitter

2021-07-14 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380646#comment-17380646
 ] 

gaoyajun02 commented on SPARK-36121:


Hi [~hyukjin.kwon], Thank you very much.

It seems to be related to  SPARK-27194. More precisely, my issue has the same 
phenomenon as SPARK-26682, we are using the earlier Spark 2.2.1 version.

I will read and cherrypick this part of the patch.

> Write data loss caused by stage retry when enable v2 FileOutputCommitter
> 
>
> Key: SPARK-36121
> URL: https://issues.apache.org/jira/browse/SPARK-36121
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1, 3.0.1
>Reporter: gaoyajun02
>Priority: Critical
>
> All our ETL scenarios are configured: 
> mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed 
> occurs, the stage retry is triggered, and then the zombie stage and the retry 
> stage may write tasks of the same part at the same time, and their task 
> directory and file name are exactly the same. This may cause data part loss 
> due to conflicts between delete and rename operations.
> For example, this is also a data loss case I encountered recently: Stage 5.0 
> is a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a retry 
> stage. They have two tasks concurrently writing the same part file: 
> part-00298.
>  # The task of stage 5.1 has preemptively created part file: part-00298 and 
> written data.
>  # At the same time as the task commit of stage 5.1, the task of sage 5.0 is 
> going to create this part file to write data, because the file already 
> exists, it throw an exception and delete the task's temporary directory.
>  # Then stage 5.0 starts commitTask, it will traverse the sub-directories and 
> execute rename. At this time, because the file has been deleted, it finally 
> moves empty without any exception, which causes data loss.
>  
> I read this part of the code, and currently I think of two ideas:
>  # Add stageAttemptNumber to taskAttemptPath to avoid conflicts.
>  # Check the number of files after commitTask, and throw an exception 
> directly when it is found to be missing.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36121) Write data loss caused by stage retry when enable v2 FileOutputCommitter

2021-07-13 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36121:
---
Description: 
All our ETL scenarios are configured: 
mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed 
occurs, the stage retry is triggered, and then the zombie stage and the retry 
stage may write tasks of the same part at the same time, and their task 
directory and file name are exactly the same. This may cause data part loss due 
to conflicts between delete and rename operations.

For example, this is also a data loss case I encountered recently: Stage 5.0 is 
a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a retry stage. 
They have two tasks concurrently writing the same part file: part-00298.
 # The task of stage 5.1 has preemptively created part file: part-00298 and 
written data.
 # At the same time as the task commit of stage 5.1, the task of sage 5.0 is 
going to create this part file to write data, because the file already exists, 
it throw an exception and delete the task's temporary directory.
 # Then stage 5.0 starts commitTask, it will traverse the sub-directories and 
execute rename. At this time, because the file has been deleted, it finally 
moves empty without any exception, which causes data loss.

 

I read this part of the code, and currently I think of two ideas:
 # Add stageAttemptNumber to taskAttemptPath to avoid conflicts.
 # Check the number of files after commitTask, and throw an exception directly 
when it is found to be missing.

 

 

  was:
All our ETL scenarios are configured: 
mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed 
occurs, the stage retry is triggered, and then the zombie stage and the retry 
stage may write tasks of the same part at the same time, and their task 
directory and file name are exactly the same. This may cause data part loss due 
to conflicts between file writing and rename operations.

For example, this is also a data loss case I encountered recently: Stage 5.0 is 
a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a retry stage. 
They have two tasks concurrently writing the same part file: part-00298.
 # The task of stage 5.1 has preemptively created part file: part-00298 and 
written data.
 # At the same time as the task commit of stage 5.1, the task of sage 5.0 is 
going to create this part file to write data, because the file already exists, 
it throw an exception and delete the task's temporary directory.
 # Then stage 5.0 starts commitTask, it will traverse the sub-directories and 
execute rename. At this time, because the file has been deleted, it finally 
moves empty without any exception, which causes data loss.

 

I read this part of the code, and currently I think of two ideas: 
 # Add stageAttemptNumber to taskAttemptPath to avoid conflicts.
 # Check the number of files after commitTask, and throw an exception directly 
when it is found to be missing.

 

 


> Write data loss caused by stage retry when enable v2 FileOutputCommitter
> 
>
> Key: SPARK-36121
> URL: https://issues.apache.org/jira/browse/SPARK-36121
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1, 3.0.1
>Reporter: gaoyajun02
>Priority: Critical
>
> All our ETL scenarios are configured: 
> mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed 
> occurs, the stage retry is triggered, and then the zombie stage and the retry 
> stage may write tasks of the same part at the same time, and their task 
> directory and file name are exactly the same. This may cause data part loss 
> due to conflicts between delete and rename operations.
> For example, this is also a data loss case I encountered recently: Stage 5.0 
> is a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a retry 
> stage. They have two tasks concurrently writing the same part file: 
> part-00298.
>  # The task of stage 5.1 has preemptively created part file: part-00298 and 
> written data.
>  # At the same time as the task commit of stage 5.1, the task of sage 5.0 is 
> going to create this part file to write data, because the file already 
> exists, it throw an exception and delete the task's temporary directory.
>  # Then stage 5.0 starts commitTask, it will traverse the sub-directories and 
> execute rename. At this time, because the file has been deleted, it finally 
> moves empty without any exception, which causes data loss.
>  
> I read this part of the code, and currently I think of two ideas:
>  # Add stageAttemptNumber to taskAttemptPath to avoid conflicts.
>  # Check the number of files after commitTask, and throw an exception 
> directly when it is found to be missing.
>  
>  



--
This message w

[jira] [Updated] (SPARK-36121) Write data loss caused by stage retry when enable v2 FileOutputCommitter

2021-07-13 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36121:
---
Description: 
All our ETL scenarios are configured: 
mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed 
occurs, the stage retry is triggered, and then the zombie stage and the retry 
stage may write tasks of the same part at the same time, and their task 
directory and file name are exactly the same. This may cause data part loss due 
to conflicts between file writing and rename operations.

For example, this is also a data loss case I encountered recently: Stage 5.0 is 
a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a retry stage. 
They have two tasks concurrently writing the same part file: part-00298.
 # The task of stage 5.1 has preemptively created part file: part-00298 and 
written data.
 # At the same time as the task commit of stage 5.1, the task of sage 5.0 is 
going to create this part file to write data, because the file already exists, 
it throw an exception and delete the task's temporary directory.
 # Then stage 5.0 starts commitTask, it will traverse the sub-directories and 
execute rename. At this time, because the file has been deleted, it finally 
moves empty without any exception, which causes data loss.

 

I read this part of the code, and currently I think of two ideas: 
 # Add stageAttemptNumber to taskAttemptPath to avoid conflicts.
 # Check the number of files after commitTask, and throw an exception directly 
when it is found to be missing.

 

 

  was:
All our ETL scenarios are configured:
mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed 
occurs, the stage retry is triggered, and then the zombie stage and the retry 
stage may write tasks of the same part at the same time, and their task 
directory and file name are exactly the same. This may cause data part loss due 
to conflicts between file writing and rename operations. For example, recently 
encountered a case of data loss:

Stage 5.0 is a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a 
retry stage. They have two tasks concurrently writing the same part file: 
part-00298.
 # The task of stage 5.1 has preemptively created part file: part-00298 and 
written data.
 # At the same time as the task commit of stage 5.1, the task of sage 5.0 is 
going to create this part file to write data, because the file already exists, 
it throw an exception and delete the task's temporary directory.
 # Then stage 5.0 starts commitTask, it will traverse the sub-directories and 
execute rename. At this time, because the file has been deleted, it finally 
moves without any exception, which causes data loss.


> Write data loss caused by stage retry when enable v2 FileOutputCommitter
> 
>
> Key: SPARK-36121
> URL: https://issues.apache.org/jira/browse/SPARK-36121
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1, 3.0.1
>Reporter: gaoyajun02
>Priority: Critical
>
> All our ETL scenarios are configured: 
> mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed 
> occurs, the stage retry is triggered, and then the zombie stage and the retry 
> stage may write tasks of the same part at the same time, and their task 
> directory and file name are exactly the same. This may cause data part loss 
> due to conflicts between file writing and rename operations.
> For example, this is also a data loss case I encountered recently: Stage 5.0 
> is a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a retry 
> stage. They have two tasks concurrently writing the same part file: 
> part-00298.
>  # The task of stage 5.1 has preemptively created part file: part-00298 and 
> written data.
>  # At the same time as the task commit of stage 5.1, the task of sage 5.0 is 
> going to create this part file to write data, because the file already 
> exists, it throw an exception and delete the task's temporary directory.
>  # Then stage 5.0 starts commitTask, it will traverse the sub-directories and 
> execute rename. At this time, because the file has been deleted, it finally 
> moves empty without any exception, which causes data loss.
>  
> I read this part of the code, and currently I think of two ideas: 
>  # Add stageAttemptNumber to taskAttemptPath to avoid conflicts.
>  # Check the number of files after commitTask, and throw an exception 
> directly when it is found to be missing.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36121) Write data loss caused by stage retry when enable v2 FileOutputCommitter

2021-07-13 Thread gaoyajun02 (Jira)
gaoyajun02 created SPARK-36121:
--

 Summary: Write data loss caused by stage retry when enable v2 
FileOutputCommitter
 Key: SPARK-36121
 URL: https://issues.apache.org/jira/browse/SPARK-36121
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.0.1, 2.2.1
Reporter: gaoyajun02


All our ETL scenarios are configured:
mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed 
occurs, the stage retry is triggered, and then the zombie stage and the retry 
stage may write tasks of the same part at the same time, and their task 
directory and file name are exactly the same. This may cause data part loss due 
to conflicts between file writing and rename operations. For example, recently 
encountered a case of data loss:

Stage 5.0 is a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a 
retry stage. They have two tasks concurrently writing the same part file: 
part-00298.
 # The task of stage 5.1 has preemptively created part file: part-00298 and 
written data.
 # At the same time as the task commit of stage 5.1, the task of sage 5.0 is 
going to create this part file to write data, because the file already exists, 
it throw an exception and delete the task's temporary directory.
 # Then stage 5.0 starts commitTask, it will traverse the sub-directories and 
execute rename. At this time, because the file has been deleted, it finally 
moves without any exception, which causes data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35920) Upgrade to Chill 0.10.0

2021-06-29 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371213#comment-17371213
 ] 

gaoyajun02 edited comment on SPARK-35920 at 6/29/21, 9:31 AM:
--

The common/unsafe module loses this dependency in this update and needs to be 
added
{code:java}

 com.esotericsoftware
 kryo-shaded
 4.0.2
{code}
I submitted a PR to try to fix this problem, please see if there is anything 
else to add?

 


was (Author: gaoyajun02):
The spark core module loses this dependency in this update and needs to be added
{code:java}

 com.esotericsoftware
 kryo-shaded
 4.0.2
{code}

> Upgrade to Chill 0.10.0
> ---
>
> Key: SPARK-35920
> URL: https://issues.apache.org/jira/browse/SPARK-35920
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35920) Upgrade to Chill 0.10.0

2021-06-29 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371213#comment-17371213
 ] 

gaoyajun02 commented on SPARK-35920:


The spark core module loses this dependency in this update and needs to be added
{code:java}

 com.esotericsoftware
 kryo-shaded
 4.0.2
{code}

> Upgrade to Chill 0.10.0
> ---
>
> Key: SPARK-35920
> URL: https://issues.apache.org/jira/browse/SPARK-35920
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org