[jira] [Updated] (SPARK-47919) credulousTrustStoreManagers should return empty array for accepted issuers

2024-04-19 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-47919:

Priority: Minor  (was: Major)

> credulousTrustStoreManagers should return empty array for accepted issuers
> --
>
> Key: SPARK-47919
> URL: https://issues.apache.org/jira/browse/SPARK-47919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Mridul Muralidharan
>Priority: Minor
>
> {{TrustManager}} in  {{SSLFactory.credulousTrustStoreManagers}} currently 
> returns {{null}}, but should be returning an empty array instead - as per 
> javadoc and expectation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47919) credulousTrustStoreManagers should return empty array for accepted issuers

2024-04-19 Thread Mridul Muralidharan (Jira)
Mridul Muralidharan created SPARK-47919:
---

 Summary: credulousTrustStoreManagers should return empty array for 
accepted issuers
 Key: SPARK-47919
 URL: https://issues.apache.org/jira/browse/SPARK-47919
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Mridul Muralidharan


{{TrustManager}} in  {{SSLFactory.credulousTrustStoreManagers}} currently 
returns {{null}}, but should be returning an empty array instead - as per 
javadoc and expectation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47172) Upgrade Transport block cipher mode to GCM

2024-04-17 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838474#comment-17838474
 ] 

Mridul Muralidharan edited comment on SPARK-47172 at 4/18/24 5:47 AM:
--

We do not backport features to released versions - so TLS will be in 4.x, not 
3.x
Given the security implications for  SPARK-47318, it was backported to 3.4 and 
3.5 - as it was fixing a security issue in existing functionality.

This proposal reads like a new feature development, which would typically be 
out of scope for 3.x
Given TLS, it would not very useful for 4.x either ?


was (Author: mridulm80):
We do not backport features to released versions - so TLS will be in 4.x, not 
3.x
Given the security implications for  SPARK-47318, it was backported to 3.4 and 
3.5 - as it was fixing a security issue in existing functionality.

This proposal reads like a new feature development, while would be out of scope 
for 3.x
Given TLS, not very useful for 4.x either ?

> Upgrade Transport block cipher mode to GCM
> --
>
> Key: SPARK-47172
> URL: https://issues.apache.org/jira/browse/SPARK-47172
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Steve Weis
>Priority: Minor
>
> The cipher transformation currently used for encrypting RPC calls is an 
> unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an 
> authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being 
> modified in transit.
> The relevant line is here: 
> [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220]
> GCM is relatively more computationally expensive than CTR and adds a 16-byte 
> block of authentication tag data to each payload. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47172) Upgrade Transport block cipher mode to GCM

2024-04-17 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838474#comment-17838474
 ] 

Mridul Muralidharan commented on SPARK-47172:
-

We do not backport features to released versions - so TLS will be in 4.x, not 
3.x
Given the security implications for  SPARK-47318, it was backported to 3.4 and 
3.5 - as it was fixing a security issue in existing functionality.

This proposal reads like a new feature development, while would be out of scope 
for 3.x
Given TLS, not very useful for 4.x either ?

> Upgrade Transport block cipher mode to GCM
> --
>
> Key: SPARK-47172
> URL: https://issues.apache.org/jira/browse/SPARK-47172
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Steve Weis
>Priority: Minor
>
> The cipher transformation currently used for encrypting RPC calls is an 
> unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an 
> authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being 
> modified in transit.
> The relevant line is here: 
> [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220]
> GCM is relatively more computationally expensive than CTR and adds a 16-byte 
> block of authentication tag data to each payload. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47172) Upgrade Transport block cipher mode to GCM

2024-04-16 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837927#comment-17837927
 ] 

Mridul Muralidharan commented on SPARK-47172:
-

Given we have addressed SPARK-47318, and with Spark supporting TLS from 4.0 - 
is this a concern ?
Given it will be backwardly incompatible, I am hesitant to expand support for 
something which is expected to go away "soon".

> Upgrade Transport block cipher mode to GCM
> --
>
> Key: SPARK-47172
> URL: https://issues.apache.org/jira/browse/SPARK-47172
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Steve Weis
>Priority: Minor
>
> The cipher transformation currently used for encrypting RPC calls is an 
> unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an 
> authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being 
> modified in transit.
> The relevant line is here: 
> [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220]
> GCM is relatively more computationally expensive than CTR and adds a 16-byte 
> block of authentication tag data to each payload. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47253) Allow LiveEventBus to stop without the completely draining of event queue

2024-04-12 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-47253:
---

Assignee: TakawaAkirayo

> Allow LiveEventBus to stop without the completely draining of event queue
> -
>
> Key: SPARK-47253
> URL: https://issues.apache.org/jira/browse/SPARK-47253
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Assignee: TakawaAkirayo
>Priority: Minor
>  Labels: pull-request-available
>
> #Problem statement:
> The SparkContext.stop() hung a long time on LiveEventBus.stop() when 
> listeners slow
> #User scenarios:
> We have a centralized service with multiple instances to regularly execute 
> user's scheduled tasks.
> For each user task within one service instance, the process is as follows:
> 1.Create a Spark session directly within the service process with an account 
> defined in the task.
> 2.Instantiate listeners by class names and register them with the 
> SparkContext. The JARs containing the listener classes are uploaded to the 
> service by the user.
> 3.Prepare resources.
> 4.Run user logic (Spark SQL).
> 5.Stop the Spark session by invoking SparkSession.stop().
> In step 5, it will wait for the LiveEventBus to stop, which requires the 
> remaining events to be completely drained by each listener.
> Since the listener is implemented by users and we cannot prevent some heavy 
> stuffs within the listener on each event, there are cases where a single 
> heavy job has over 30,000 tasks,
> and it could take 30 minutes for the listener to process all the remaining 
> events, because within the listener, it requires a coarse-grained global lock 
> and update the internal status to the remote database.
> This kind of delay affects other user tasks in the queue. Therefore, from the 
> server side perspective, we need the guarantee that the stop operation 
> finishes quickly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47253) Allow LiveEventBus to stop without the completely draining of event queue

2024-04-12 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-47253.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45367
[https://github.com/apache/spark/pull/45367]

> Allow LiveEventBus to stop without the completely draining of event queue
> -
>
> Key: SPARK-47253
> URL: https://issues.apache.org/jira/browse/SPARK-47253
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Assignee: TakawaAkirayo
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> #Problem statement:
> The SparkContext.stop() hung a long time on LiveEventBus.stop() when 
> listeners slow
> #User scenarios:
> We have a centralized service with multiple instances to regularly execute 
> user's scheduled tasks.
> For each user task within one service instance, the process is as follows:
> 1.Create a Spark session directly within the service process with an account 
> defined in the task.
> 2.Instantiate listeners by class names and register them with the 
> SparkContext. The JARs containing the listener classes are uploaded to the 
> service by the user.
> 3.Prepare resources.
> 4.Run user logic (Spark SQL).
> 5.Stop the Spark session by invoking SparkSession.stop().
> In step 5, it will wait for the LiveEventBus to stop, which requires the 
> remaining events to be completely drained by each listener.
> Since the listener is implemented by users and we cannot prevent some heavy 
> stuffs within the listener on each event, there are cases where a single 
> heavy job has over 30,000 tasks,
> and it could take 30 minutes for the listener to process all the remaining 
> events, because within the listener, it requires a coarse-grained global lock 
> and update the internal status to the remote database.
> This kind of delay affects other user tasks in the queue. Therefore, from the 
> server side perspective, we need the guarantee that the stop operation 
> finishes quickly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47318) AuthEngine key exchange needs additional KDF round

2024-04-11 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-47318:

Affects Version/s: 3.5.0
   3.4.0

>  AuthEngine key exchange needs additional KDF round
> ---
>
> Key: SPARK-47318
> URL: https://issues.apache.org/jira/browse/SPARK-47318
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Steve Weis
>Assignee: Steve Weis
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> AuthEngine implements a bespoke [key exchange protocol 
> |[https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto]|https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto].]
>  based on the NNpsk0 Noise pattern and using X25519.
> The Spark code improperly uses the derived shared secret directly, which is 
> an encoded X coordinate. This should be passed into a KDF rather than used 
> directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47318) AuthEngine key exchange needs additional KDF round

2024-04-11 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-47318.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45425
[https://github.com/apache/spark/pull/45425]

>  AuthEngine key exchange needs additional KDF round
> ---
>
> Key: SPARK-47318
> URL: https://issues.apache.org/jira/browse/SPARK-47318
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 4.0.0
>Reporter: Steve Weis
>Assignee: Steve Weis
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> AuthEngine implements a bespoke [key exchange protocol 
> |[https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto]|https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto].]
>  based on the NNpsk0 Noise pattern and using X25519.
> The Spark code improperly uses the derived shared secret directly, which is 
> an encoded X coordinate. This should be passed into a KDF rather than used 
> directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-09 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835587#comment-17835587
 ] 

Mridul Muralidharan edited comment on SPARK-47759 at 4/10/24 4:08 AM:
--

In order to validate, I would suggest two things.
Wrap the input str (or lower), within quotes, in the exception message ... for 
example, "120s\u00A0" will look like 120s in the exception message as it is a 
unicode non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.


was (Author: mridulm80):
In order to validate, I would suggest two things.
Wrap the input str, within quote, in the exception message ... for example, 
"120s\u00A0" will look like 120s in the exception message as it is a unicode 
non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.

> Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate 
> time string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a legitimate time string. Note that we 
> manually killed the stuck app instances and the retry goes thru on the same 
> cluster (without requiring any app code change).
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> 

[jira] [Comment Edited] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-09 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835587#comment-17835587
 ] 

Mridul Muralidharan edited comment on SPARK-47759 at 4/10/24 4:08 AM:
--

In order to validate, I would suggest two things.
Wrap the input str, within quote, in the exception message ... for example, 
"120s\u00A0" will look like 120s in the exception message as it is a unicode 
non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.


was (Author: mridulm80):
In order to validate, I would suggest two things.
Wrap the input str, within quote, to the exception ... for example, 
"120s\u00A0" will look like 120s in the exception message as it is a unicode 
non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.

> Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate 
> time string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a legitimate time string. Note that we 
> manually killed the stuck app instances and the retry goes thru on the same 
> cluster (without requiring any app code change).
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> 

[jira] [Commented] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-09 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835587#comment-17835587
 ] 

Mridul Muralidharan commented on SPARK-47759:
-

In order to validate, I would suggest two things.
Wrap the input str, within quote, to the exception ... for example, 
"120s\u00A0" will look like 120s in the exception message as it is a unicode 
non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.

> Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate 
> time string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a legitimate time string. Note that we 
> manually killed the stuck app instances and the retry goes thru on the same 
> cluster (without requiring any app code change).
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> 

[jira] [Assigned] (SPARK-45375) [CORE] Mark connection as timedOut in TransportClient.close

2024-03-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45375:
---

Assignee: Hasnain Lakhani

> [CORE] Mark connection as timedOut in TransportClient.close
> ---
>
> Key: SPARK-45375
> URL: https://issues.apache.org/jira/browse/SPARK-45375
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>
> Avoid a race condition where a connection which is in the process of being 
> closed could be returned by the TransportClientFactory only to be immediately 
> closed and cause errors upon use
>  
> This doesn't happen much in practice but is observed more frequently as part 
> of efforts to add SSL support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828087#comment-17828087
 ] 

Mridul Muralidharan edited comment on SPARK-45374 at 3/18/24 8:28 PM:
--

Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

I did it for this PR, please lte me know if you are unable to do it for the 
others


was (Author: mridulm80):
Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

I did it for this PR

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45374:
---

Assignee: Hasnain Lakhani

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828087#comment-17828087
 ] 

Mridul Muralidharan edited comment on SPARK-45374 at 3/18/24 8:05 PM:
--

Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

I did it for this PR


was (Author: mridulm80):
Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828087#comment-17828087
 ] 

Mridul Muralidharan commented on SPARK-45374:
-

Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-05 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-47146:

Fix Version/s: 3.5.2
   3.4.3

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
> data in the stream is read through.
> {code:java}
> @Override
> public void loadNext() throws IOException {
>   // Kill the task in case it has been marked as killed. This logic is from
>   // InterruptibleIterator, but we inline it here instead of wrapping the 
> iterator in order
>   // to avoid performance overhead. This check is 

[jira] [Commented] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-05 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823832#comment-17823832
 ] 

Mridul Muralidharan commented on SPARK-47146:
-

Backported to 3.5 and 3.4 in PR: https://github.com/apache/spark/pull/45390

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
> data in the stream is read through.
> {code:java}
> @Override
> public void loadNext() throws IOException {
>   // Kill the task in case it has been marked as killed. This logic is from
>   // InterruptibleIterator, but we inline it here instead of wrapping the 
> 

[jira] [Assigned] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-04 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-47146:
---

Assignee: JacobZheng

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
> data in the stream is read through.
> {code:java}
> @Override
> public void loadNext() throws IOException {
>   // Kill the task in case it has been marked as killed. This logic is from
>   // InterruptibleIterator, but we inline it here instead of wrapping the 
> iterator in order
>   // to avoid performance overhead. This check is added here in `loadNext()` 
> instead of in
>   // `hasNext()` 

[jira] [Resolved] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-04 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-47146.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45327
[https://github.com/apache/spark/pull/45327]

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
> data in the stream is read through.
> {code:java}
> @Override
> public void loadNext() throws IOException {
>   // Kill the task in case it has been marked as killed. This logic is from
>   // InterruptibleIterator, but we inline it here instead of wrapping the 
> 

[jira] [Resolved] (SPARK-46512) Optimize shuffle reading when both sort and combine are used.

2024-02-04 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46512.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44512
[https://github.com/apache/spark/pull/44512]

> Optimize shuffle reading when both sort and combine are used.
> -
>
> Key: SPARK-46512
> URL: https://issues.apache.org/jira/browse/SPARK-46512
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> After the shuffle reader obtains the block, it will first perform a combine 
> operation, and then perform a sort operation. It is known that both combine 
> and sort may generate temporary files, so the performance may be poor when 
> both sort and combine are used. In fact, combine operations can be performed 
> during the sort process, and we can avoid the combine spill file.
>  
> I did not find any direct api to construct the shuffle which both sort and 
> combine is used. But I can do like following code, here is a wordcount, and 
> the output words is sorted.
> {code:java}
> sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
> reduceByKey(_ + _, 1).
> asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
> collect().foreach(println) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46512) Optimize shuffle reading when both sort and combine are used.

2024-02-04 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46512:
---

Assignee: Chenyu Zheng

> Optimize shuffle reading when both sort and combine are used.
> -
>
> Key: SPARK-46512
> URL: https://issues.apache.org/jira/browse/SPARK-46512
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Minor
>  Labels: pull-request-available
>
> After the shuffle reader obtains the block, it will first perform a combine 
> operation, and then perform a sort operation. It is known that both combine 
> and sort may generate temporary files, so the performance may be poor when 
> both sort and combine are used. In fact, combine operations can be performed 
> during the sort process, and we can avoid the combine spill file.
>  
> I did not find any direct api to construct the shuffle which both sort and 
> combine is used. But I can do like following code, here is a wordcount, and 
> the output words is sorted.
> {code:java}
> sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
> reduceByKey(_ + _, 1).
> asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
> collect().foreach(println) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46733) Simplify the ContextCleaner|BlockManager by the exit operation only depend on interrupt thread

2024-01-23 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46733:
---

Assignee: Jiaan Geng

> Simplify the ContextCleaner|BlockManager by the exit operation only depend on 
> interrupt thread
> --
>
> Key: SPARK-46733
> URL: https://issues.apache.org/jira/browse/SPARK-46733
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46733) Simplify the ContextCleaner|BlockManager by the exit operation only depend on interrupt thread

2024-01-23 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46733.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44732
[https://github.com/apache/spark/pull/44732]

> Simplify the ContextCleaner|BlockManager by the exit operation only depend on 
> interrupt thread
> --
>
> Key: SPARK-46733
> URL: https://issues.apache.org/jira/browse/SPARK-46733
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46623) Replace SimpleDateFormat with DateTimeFormatter

2024-01-18 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808080#comment-17808080
 ] 

Mridul Muralidharan commented on SPARK-46623:
-

Issue resolved by pull request 44616
https://github.com/apache/spark/pull/44616

> Replace SimpleDateFormat with DateTimeFormatter
> ---
>
> Key: SPARK-46623
> URL: https://issues.apache.org/jira/browse/SPARK-46623
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46623) Replace SimpleDateFormat with DateTimeFormatter

2024-01-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46623.
-
Fix Version/s: 4.0.0
 Assignee: Jiaan Geng
   Resolution: Fixed

> Replace SimpleDateFormat with DateTimeFormatter
> ---
>
> Key: SPARK-46623
> URL: https://issues.apache.org/jira/browse/SPARK-46623
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.

2024-01-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46696.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44705
[https://github.com/apache/spark/pull/44705]

> In ResourceProfileManager, function calls should occur after variable 
> declarations.
> ---
>
> Key: SPARK-46696
> URL: https://issues.apache.org/jira/browse/SPARK-46696
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: liangyongyuan
>Assignee: liangyongyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> As the title suggests, in *ResourceProfileManager*, function calls should be 
> made after variable declarations. When determining *isSupport*, all variables 
> are uninitialized, with booleans defaulting to false and objects to null. 
> While the end result is correct, the evaluation process is abnormal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.

2024-01-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46696:
---

Assignee: liangyongyuan

> In ResourceProfileManager, function calls should occur after variable 
> declarations.
> ---
>
> Key: SPARK-46696
> URL: https://issues.apache.org/jira/browse/SPARK-46696
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: liangyongyuan
>Assignee: liangyongyuan
>Priority: Major
>  Labels: pull-request-available
>
> As the title suggests, in *ResourceProfileManager*, function calls should be 
> made after variable declarations. When determining *isSupport*, all variables 
> are uninitialized, with booleans defaulting to false and objects to null. 
> While the end result is correct, the evaluation process is abnormal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46399) Add exit status to the Application End event for the use of Spark Listener

2023-12-20 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46399:
---

Assignee: Reza Safi

> Add exit status to the Application End event for the use of Spark Listener
> --
>
> Key: SPARK-46399
> URL: https://issues.apache.org/jira/browse/SPARK-46399
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Reza Safi
>Assignee: Reza Safi
>Priority: Minor
>  Labels: pull-request-available
>
> Currently SparkListenerApplicationEnd only has a timestamp value and there is 
> not exit status recorded with it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46399) Add exit status to the Application End event for the use of Spark Listener

2023-12-20 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46399.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44340
[https://github.com/apache/spark/pull/44340]

> Add exit status to the Application End event for the use of Spark Listener
> --
>
> Key: SPARK-46399
> URL: https://issues.apache.org/jira/browse/SPARK-46399
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Reza Safi
>Assignee: Reza Safi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently SparkListenerApplicationEnd only has a timestamp value and there is 
> not exit status recorded with it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46132) [CORE] Support key password for JKS keys for RPC SSL

2023-12-12 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46132.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44264
[https://github.com/apache/spark/pull/44264]

> [CORE] Support key password for JKS keys for RPC SSL
> 
>
> Key: SPARK-46132
> URL: https://issues.apache.org/jira/browse/SPARK-46132
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> See thread at 
> https://github.com/apache/spark/pull/43998#discussion_r1406993411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46132) [CORE] Support key password for JKS keys for RPC SSL

2023-12-12 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46132:
---

Assignee: Hasnain Lakhani

> [CORE] Support key password for JKS keys for RPC SSL
> 
>
> Key: SPARK-46132
> URL: https://issues.apache.org/jira/browse/SPARK-46132
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> See thread at 
> https://github.com/apache/spark/pull/43998#discussion_r1406993411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword

2023-12-06 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-46058:

Labels: pull-request-available  (was: pull-request-available releasenotes)

> [CORE] Add separate flag for privateKeyPassword
> ---
>
> Key: SPARK-46058
> URL: https://issues.apache.org/jira/browse/SPARK-46058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Right now with config inheritance we support:
>  * JKS with password A, PEM with password B
>  * JKS with no password, PEM with password A
>  * JKS and PEM with no password
>  
> But we do not support the case where JKS has a password and PEM does not. If 
> we set keyPassword we will attempt to use it, and cannot set 
> `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the 
> easiest workaround.
>  
> This was noticed while migrating some existing deployments to the RPC SSL 
> support where we use openssl support for RPC and use a key with no password



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword

2023-12-06 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-46058:

Labels: pull-request-available releasenotes  (was: pull-request-available)

> [CORE] Add separate flag for privateKeyPassword
> ---
>
> Key: SPARK-46058
> URL: https://issues.apache.org/jira/browse/SPARK-46058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available, releasenotes
> Fix For: 4.0.0
>
>
> Right now with config inheritance we support:
>  * JKS with password A, PEM with password B
>  * JKS with no password, PEM with password A
>  * JKS and PEM with no password
>  
> But we do not support the case where JKS has a password and PEM does not. If 
> we set keyPassword we will attempt to use it, and cannot set 
> `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the 
> easiest workaround.
>  
> This was noticed while migrating some existing deployments to the RPC SSL 
> support where we use openssl support for RPC and use a key with no password



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword

2023-12-06 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46058.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43998
[https://github.com/apache/spark/pull/43998]

> [CORE] Add separate flag for privateKeyPassword
> ---
>
> Key: SPARK-46058
> URL: https://issues.apache.org/jira/browse/SPARK-46058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Right now with config inheritance we support:
>  * JKS with password A, PEM with password B
>  * JKS with no password, PEM with password A
>  * JKS and PEM with no password
>  
> But we do not support the case where JKS has a password and PEM does not. If 
> we set keyPassword we will attempt to use it, and cannot set 
> `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the 
> easiest workaround.
>  
> This was noticed while migrating some existing deployments to the RPC SSL 
> support where we use openssl support for RPC and use a key with no password



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword

2023-12-06 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46058:
---

Assignee: Hasnain Lakhani

> [CORE] Add separate flag for privateKeyPassword
> ---
>
> Key: SPARK-46058
> URL: https://issues.apache.org/jira/browse/SPARK-46058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Right now with config inheritance we support:
>  * JKS with password A, PEM with password B
>  * JKS with no password, PEM with password A
>  * JKS and PEM with no password
>  
> But we do not support the case where JKS has a password and PEM does not. If 
> we set keyPassword we will attempt to use it, and cannot set 
> `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the 
> easiest workaround.
>  
> This was noticed while migrating some existing deployments to the RPC SSL 
> support where we use openssl support for RPC and use a key with no password



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45762) Shuffle managers defined in user jars are not available for some launch modes

2023-11-16 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45762:
---

Assignee: Alessandro Bellina

> Shuffle managers defined in user jars are not available for some launch modes
> -
>
> Key: SPARK-45762
> URL: https://issues.apache.org/jira/browse/SPARK-45762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Alessandro Bellina
>Assignee: Alessandro Bellina
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Starting a spark job in standalone mode with a custom `ShuffleManager` 
> provided in a jar via `--jars` does not work. This can also be experienced in 
> local-cluster mode.
> The approach that works consistently is to copy the jar containing the custom 
> `ShuffleManager` to a specific location in each node then add it to 
> `spark.driver.extraClassPath` and `spark.executor.extraClassPath`, but we 
> would like to move away from setting extra configurations unnecessarily.
> Example:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --jars user-code.jar
> {code}
> This yields `java.lang.ClassNotFoundException` in the executors.
> {code:java}
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1915)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:436)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:425)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.examples.TestShuffleManager
>   at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
>   at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
>   at java.base/java.lang.Class.forName0(Native Method)
>   at java.base/java.lang.Class.forName(Class.java:467)
>   at 
> org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41)
>   at 
> org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:95)
>   at 
> org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2574)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
>   at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:255)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:487)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
>   at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>   ... 4 more
> {code}
> We can change our command to use `extraClassPath`:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --conf spark.driver.extraClassPath=user-code.jar \
>  --conf spark.executor.extraClassPath=user-code.jar
> {code}
> Success after adding the jar to `extraClassPath`:
> {code:java}
> 23/10/26 12:58:26 INFO TransportClientFactory: Successfully created 
> connection to localhost/127.0.0.1:33053 after 7 ms (0 ms spent in bootstraps)
> 23/10/26 12:58:26 WARN TestShuffleManager: Instantiated TestShuffleManager!!
> 23/10/26 12:58:26 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-cb101b05-c4b7-4ba9-8b3d-5b23baa7cb46/executor-5d5335dd-c116-4211-9691-87d8566017fd/blockmgr-2fcb1ab2-d886--8c7f-9dca2c880c2c
> {code}
> We would like to change startup order such that the original command 
> succeeds, without specifying `extraClassPath`:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --jars user-code.jar
> {code}
> Proposed changes:
> Refactor code so we initialize the `ShuffleManager` later, after jars have 
> been 

[jira] [Resolved] (SPARK-45762) Shuffle managers defined in user jars are not available for some launch modes

2023-11-16 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45762.
-
Resolution: Fixed

Issue resolved by pull request 43627
[https://github.com/apache/spark/pull/43627]

> Shuffle managers defined in user jars are not available for some launch modes
> -
>
> Key: SPARK-45762
> URL: https://issues.apache.org/jira/browse/SPARK-45762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Alessandro Bellina
>Assignee: Alessandro Bellina
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Starting a spark job in standalone mode with a custom `ShuffleManager` 
> provided in a jar via `--jars` does not work. This can also be experienced in 
> local-cluster mode.
> The approach that works consistently is to copy the jar containing the custom 
> `ShuffleManager` to a specific location in each node then add it to 
> `spark.driver.extraClassPath` and `spark.executor.extraClassPath`, but we 
> would like to move away from setting extra configurations unnecessarily.
> Example:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --jars user-code.jar
> {code}
> This yields `java.lang.ClassNotFoundException` in the executors.
> {code:java}
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1915)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:436)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:425)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.examples.TestShuffleManager
>   at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
>   at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
>   at java.base/java.lang.Class.forName0(Native Method)
>   at java.base/java.lang.Class.forName(Class.java:467)
>   at 
> org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41)
>   at 
> org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:95)
>   at 
> org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2574)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
>   at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:255)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:487)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
>   at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>   ... 4 more
> {code}
> We can change our command to use `extraClassPath`:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --conf spark.driver.extraClassPath=user-code.jar \
>  --conf spark.executor.extraClassPath=user-code.jar
> {code}
> Success after adding the jar to `extraClassPath`:
> {code:java}
> 23/10/26 12:58:26 INFO TransportClientFactory: Successfully created 
> connection to localhost/127.0.0.1:33053 after 7 ms (0 ms spent in bootstraps)
> 23/10/26 12:58:26 WARN TestShuffleManager: Instantiated TestShuffleManager!!
> 23/10/26 12:58:26 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-cb101b05-c4b7-4ba9-8b3d-5b23baa7cb46/executor-5d5335dd-c116-4211-9691-87d8566017fd/blockmgr-2fcb1ab2-d886--8c7f-9dca2c880c2c
> {code}
> We would like to change startup order such that the original command 
> succeeds, without specifying `extraClassPath`:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --jars user-code.jar
> {code}
> Proposed changes:
> Refactor code so we 

[jira] [Comment Edited] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2023-11-14 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786077#comment-17786077
 ] 

Mridul Muralidharan edited comment on SPARK-30602 at 11/14/23 8:50 PM:
---

[~MasterDDT], currently there is no active plans to add support for it.


was (Author: mridulm80):
@MasterDDT, currently there is no active plans to add support for it.

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Min Shen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.2.0
>
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2023-11-14 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786077#comment-17786077
 ] 

Mridul Muralidharan commented on SPARK-30602:
-

@MasterDDT, currently there is no active plans to add support for it.

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Min Shen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.2.0
>
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44937) Add SSL/TLS support for RPC and Shuffle communications

2023-11-14 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786076#comment-17786076
 ] 

Mridul Muralidharan commented on SPARK-44937:
-

Thanks [~dongjoon] ! Missed out on that :-)

> Add SSL/TLS support for RPC and Shuffle communications
> --
>
> Key: SPARK-44937
> URL: https://issues.apache.org/jira/browse/SPARK-44937
> Project: Spark
>  Issue Type: Epic
>  Components: Block Manager, Security, Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: releasenotes
> Fix For: 4.0.0
>
>
> Add support for SSL/TLS based communication for Spark RPCs and block 
> transfers - providing an alternative to the existing encryption / 
> authentication implementation documented at 
> [https://spark.apache.org/docs/latest/security.html#spark-rpc-communication-protocol-between-spark-processes]
> This is a superset of the functionality discussed in 
> https://issues.apache.org/jira/browse/SPARK-6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45911) [CORE] Make TLS1.3 the default for RPC SSL support

2023-11-14 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45911:
---

Assignee: Hasnain Lakhani

> [CORE] Make TLS1.3 the default for RPC SSL support
> --
>
> Key: SPARK-45911
> URL: https://issues.apache.org/jira/browse/SPARK-45911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Make TLS1.3 the default to improve security



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45911) [CORE] Make TLS1.3 the default for RPC SSL support

2023-11-14 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45911.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43803
[https://github.com/apache/spark/pull/43803]

> [CORE] Make TLS1.3 the default for RPC SSL support
> --
>
> Key: SPARK-45911
> URL: https://issues.apache.org/jira/browse/SPARK-45911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Make TLS1.3 the default to improve security



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44937) [umbrella] Add SSL/TLS support for RPC and Shuffle communications

2023-11-07 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-44937:

Epic Name: ssl/tls for rpc and shuffle

> [umbrella] Add SSL/TLS support for RPC and Shuffle communications
> -
>
> Key: SPARK-44937
> URL: https://issues.apache.org/jira/browse/SPARK-44937
> Project: Spark
>  Issue Type: Epic
>  Components: Block Manager, Security, Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add support for SSL/TLS based communication for Spark RPCs and block 
> transfers - providing an alternative to the existing encryption / 
> authentication implementation documented at 
> [https://spark.apache.org/docs/latest/security.html#spark-rpc-communication-protocol-between-spark-processes]
> This is a superset of the functionality discussed in 
> https://issues.apache.org/jira/browse/SPARK-6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44937) [umbrella] Add SSL/TLS support for RPC and Shuffle communications

2023-11-07 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-44937.
-
Fix Version/s: (was: 3.4.2)
   (was: 3.5.1)
   (was: 3.3.4)
   Resolution: Fixed

The SSL feature itself landed in 4.0 - so marking the feature as available for 
4.0 and removing 3.4.2, 3.5.1, 3.3.4 from fix version (some of the PR's linked 
were merged to older version - and unfortunately this umbrella jira got reused 
for them).


> [umbrella] Add SSL/TLS support for RPC and Shuffle communications
> -
>
> Key: SPARK-44937
> URL: https://issues.apache.org/jira/browse/SPARK-44937
> Project: Spark
>  Issue Type: Epic
>  Components: Block Manager, Security, Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add support for SSL/TLS based communication for Spark RPCs and block 
> transfers - providing an alternative to the existing encryption / 
> authentication implementation documented at 
> [https://spark.apache.org/docs/latest/security.html#spark-rpc-communication-protocol-between-spark-processes]
> This is a superset of the functionality discussed in 
> https://issues.apache.org/jira/browse/SPARK-6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45431) [DOCS] Document SSL RPC feature

2023-11-07 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45431.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43240
[https://github.com/apache/spark/pull/43240]

> [DOCS] Document SSL RPC feature
> ---
>
> Key: SPARK-45431
> URL: https://issues.apache.org/jira/browse/SPARK-45431
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add documentation for users



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45431) [DOCS] Document SSL RPC feature

2023-11-07 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45431:
---

Assignee: Hasnain Lakhani

> [DOCS] Document SSL RPC feature
> ---
>
> Key: SPARK-45431
> URL: https://issues.apache.org/jira/browse/SPARK-45431
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Add documentation for users



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45730) [CORE] Make ReloadingX509TrustManagerSuite less flaky

2023-11-02 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45730:
---

Assignee: Hasnain Lakhani

> [CORE] Make ReloadingX509TrustManagerSuite less flaky
> -
>
> Key: SPARK-45730
> URL: https://issues.apache.org/jira/browse/SPARK-45730
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45730) [CORE] Make ReloadingX509TrustManagerSuite less flaky

2023-11-02 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45730.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43596
[https://github.com/apache/spark/pull/43596]

> [CORE] Make ReloadingX509TrustManagerSuite less flaky
> -
>
> Key: SPARK-45730
> URL: https://issues.apache.org/jira/browse/SPARK-45730
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45544) [CORE] Integrate SSL support into TransportContext

2023-10-29 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45544.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43541
[https://github.com/apache/spark/pull/43541]

> [CORE] Integrate SSL support into TransportContext
> --
>
> Key: SPARK-45544
> URL: https://issues.apache.org/jira/browse/SPARK-45544
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Integrate the SSL support into TransportContext so that Spark can use RPC SSL 
> support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45544) [CORE] Integrate SSL support into TransportContext

2023-10-29 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45544:
---

Assignee: Hasnain Lakhani

> [CORE] Integrate SSL support into TransportContext
> --
>
> Key: SPARK-45544
> URL: https://issues.apache.org/jira/browse/SPARK-45544
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Integrate the SSL support into TransportContext so that Spark can use RPC SSL 
> support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45545) [CORE] Pass SSLOptions wherever we create a SparkTransportConf

2023-10-25 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45545:
---

Assignee: Hasnain Lakhani

> [CORE] Pass SSLOptions wherever we create a SparkTransportConf
> --
>
> Key: SPARK-45545
> URL: https://issues.apache.org/jira/browse/SPARK-45545
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> This ensures that we can properly inherit SSL options and use them 
> everywhere. And tests to ensure things will work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45545) [CORE] Pass SSLOptions wherever we create a SparkTransportConf

2023-10-25 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45545.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43387
[https://github.com/apache/spark/pull/43387]

> [CORE] Pass SSLOptions wherever we create a SparkTransportConf
> --
>
> Key: SPARK-45545
> URL: https://issues.apache.org/jira/browse/SPARK-45545
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This ensures that we can properly inherit SSL options and use them 
> everywhere. And tests to ensure things will work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45541) [CORE] Add SSLFactory

2023-10-22 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45541.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43386
[https://github.com/apache/spark/pull/43386]

> [CORE] Add SSLFactory
> -
>
> Key: SPARK-45541
> URL: https://issues.apache.org/jira/browse/SPARK-45541
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We need to add a factory to support creating SSL engines which will be used 
> to create client/server connections



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45613) Expose DeterministicLevel as a DeveloperApi

2023-10-19 Thread Mridul Muralidharan (Jira)
Mridul Muralidharan created SPARK-45613:
---

 Summary: Expose DeterministicLevel as a DeveloperApi
 Key: SPARK-45613
 URL: https://issues.apache.org/jira/browse/SPARK-45613
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0, 3.4.0, 4.0.0
Reporter: Mridul Muralidharan


{{RDD.getOutputDeterministicLevel}} is a {{DeveloperApi}} which users can 
override to specify the {{DeterministicLevel}} of the {{RDD}}.
Unfortunately, {{DeterministicLevel}} itself is {{private[spark]}}.

Expose {{DeterministicLevel}} to allow users to users this method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45534) Use `java.lang.ref.Cleaner` instead of `finalize` for `RemoteBlockPushResolver`

2023-10-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45534.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43371
[https://github.com/apache/spark/pull/43371]

> Use `java.lang.ref.Cleaner` instead of `finalize` for 
> `RemoteBlockPushResolver`
> ---
>
> Key: SPARK-45534
> URL: https://issues.apache.org/jira/browse/SPARK-45534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Min Zhao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45534) Use `java.lang.ref.Cleaner` instead of `finalize` for `RemoteBlockPushResolver`

2023-10-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45534:
---

Assignee: Min Zhao

> Use `java.lang.ref.Cleaner` instead of `finalize` for 
> `RemoteBlockPushResolver`
> ---
>
> Key: SPARK-45534
> URL: https://issues.apache.org/jira/browse/SPARK-45534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Min Zhao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45576) [CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite

2023-10-17 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45576:
---

Assignee: Hasnain Lakhani

> [CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite
> --
>
> Key: SPARK-45576
> URL: https://issues.apache.org/jira/browse/SPARK-45576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> These were added accidentally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45576) [CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite

2023-10-17 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45576.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43404
[https://github.com/apache/spark/pull/43404]

> [CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite
> --
>
> Key: SPARK-45576
> URL: https://issues.apache.org/jira/browse/SPARK-45576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> These were added accidentally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45429) [CORE] Add helper classes for RPC SSL communication

2023-10-14 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45429.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43244
[https://github.com/apache/spark/pull/43244]

> [CORE] Add helper classes for RPC SSL communication
> ---
>
> Key: SPARK-45429
> URL: https://issues.apache.org/jira/browse/SPARK-45429
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add helper classes to handle the fact that encryption cannot work with 
> zero-copy transfers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45429) [CORE] Add helper classes for RPC SSL communication

2023-10-14 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45429:
---

Assignee: Hasnain Lakhani

> [CORE] Add helper classes for RPC SSL communication
> ---
>
> Key: SPARK-45429
> URL: https://issues.apache.org/jira/browse/SPARK-45429
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Add helper classes to handle the fact that encryption cannot work with 
> zero-copy transfers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45427) [CORE] Add RPC SSL settings to SSLOptions and SparkTransportConf

2023-10-14 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45427.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43238
[https://github.com/apache/spark/pull/43238]

> [CORE] Add RPC SSL settings to SSLOptions and SparkTransportConf
> 
>
> Key: SPARK-45427
> URL: https://issues.apache.org/jira/browse/SPARK-45427
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add the options here for RPC SSL so we can handle setting inheritance and 
> propagate the options to the callsites that need it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45427) [CORE] Add RPC SSL settings to SSLOptions and SparkTransportConf

2023-10-14 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45427:
---

Assignee: Hasnain Lakhani

> [CORE] Add RPC SSL settings to SSLOptions and SparkTransportConf
> 
>
> Key: SPARK-45427
> URL: https://issues.apache.org/jira/browse/SPARK-45427
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Add the options here for RPC SSL so we can handle setting inheritance and 
> propagate the options to the callsites that need it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45426) [CORE] Add support for a ReloadingTrustManager

2023-10-13 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45426:
---

Assignee: Hasnain Lakhani

> [CORE] Add support for a ReloadingTrustManager
> --
>
> Key: SPARK-45426
> URL: https://issues.apache.org/jira/browse/SPARK-45426
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> For the RPC SSL feature, this allows us to properly reload the trust store if 
> needed at runtime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45426) [CORE] Add support for a ReloadingTrustManager

2023-10-13 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45426.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43249
[https://github.com/apache/spark/pull/43249]

> [CORE] Add support for a ReloadingTrustManager
> --
>
> Key: SPARK-45426
> URL: https://issues.apache.org/jira/browse/SPARK-45426
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> For the RPC SSL feature, this allows us to properly reload the trust store if 
> needed at runtime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45408) [CORE] Add RPC SSL settings to TransportConf

2023-10-05 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45408:
---

Assignee: Hasnain Lakhani

> [CORE] Add RPC SSL settings to TransportConf
> 
>
> Key: SPARK-45408
> URL: https://issues.apache.org/jira/browse/SPARK-45408
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Add support for the settings for SSL RPC support to TransportConf and some 
> associated tests + sample configs used by other tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45408) [CORE] Add RPC SSL settings to TransportConf

2023-10-05 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45408.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43220
[https://github.com/apache/spark/pull/43220]

> [CORE] Add RPC SSL settings to TransportConf
> 
>
> Key: SPARK-45408
> URL: https://issues.apache.org/jira/browse/SPARK-45408
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add support for the settings for SSL RPC support to TransportConf and some 
> associated tests + sample configs used by other tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45283) Make StatusTrackerSuite less fragile

2023-10-03 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45283.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43194
[https://github.com/apache/spark/pull/43194]

> Make StatusTrackerSuite less fragile
> 
>
> Key: SPARK-45283
> URL: https://issues.apache.org/jira/browse/SPARK-45283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It's discovered from [Github 
> Actions|https://github.com/xiongbo-sjtu/spark/actions/runs/6270601155/job/17028788767]
>  that StatusTrackerSuite can run into random failures, as shown by the 
> following stack trace (highlighted in red).  The proposed fix is to update 
> the unit test to remove the nondeterministic behavior.
> {quote}[info] StatusTrackerSuite:
> [info] - basic status API usage (99 milliseconds)
> [info] - getJobIdsForGroup() (56 milliseconds)
> [info] - getJobIdsForGroup() with takeAsync() (48 milliseconds)
> [info] - getJobIdsForGroup() with takeAsync() across multiple partitions (58 
> milliseconds)
> [info] - getJobIdsForTag() *** FAILED *** (10 seconds, 77 milliseconds)
> {color:#ff}[info] The code passed to eventually never returned normally. 
> Attempted 651 times over 10.00505994401 seconds. Last failure message: 
> Set(3, 2, 1) was not equal to Set(1, 2). (StatusTrackerSuite.scala:148){color}
> [info] org.scalatest.exceptions.TestFailedDueToTimeoutException:
> [info] at 
> org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
> [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
> [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:348)
> [info] at 
> org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:347)
> [info] at 
> org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
> [info] at 
> org.apache.spark.StatusTrackerSuite.$anonfun$new$21(StatusTrackerSuite.scala:148)
> [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
> [info] at 
> org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)
> [info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)
> [info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)
> [info] at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
> [info] at 
> org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
> [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info] at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69)
> [info] at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info] at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info] at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info] at scala.collection.immutable.List.foreach(List.scala:333)
> [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
> [info] at 

[jira] [Assigned] (SPARK-45283) Make StatusTrackerSuite less fragile

2023-10-03 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45283:
---

Assignee: Bo Xiong

> Make StatusTrackerSuite less fragile
> 
>
> Key: SPARK-45283
> URL: https://issues.apache.org/jira/browse/SPARK-45283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It's discovered from [Github 
> Actions|https://github.com/xiongbo-sjtu/spark/actions/runs/6270601155/job/17028788767]
>  that StatusTrackerSuite can run into random failures, as shown by the 
> following stack trace (highlighted in red).  The proposed fix is to update 
> the unit test to remove the nondeterministic behavior.
> {quote}[info] StatusTrackerSuite:
> [info] - basic status API usage (99 milliseconds)
> [info] - getJobIdsForGroup() (56 milliseconds)
> [info] - getJobIdsForGroup() with takeAsync() (48 milliseconds)
> [info] - getJobIdsForGroup() with takeAsync() across multiple partitions (58 
> milliseconds)
> [info] - getJobIdsForTag() *** FAILED *** (10 seconds, 77 milliseconds)
> {color:#ff}[info] The code passed to eventually never returned normally. 
> Attempted 651 times over 10.00505994401 seconds. Last failure message: 
> Set(3, 2, 1) was not equal to Set(1, 2). (StatusTrackerSuite.scala:148){color}
> [info] org.scalatest.exceptions.TestFailedDueToTimeoutException:
> [info] at 
> org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
> [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
> [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:348)
> [info] at 
> org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:347)
> [info] at 
> org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
> [info] at 
> org.apache.spark.StatusTrackerSuite.$anonfun$new$21(StatusTrackerSuite.scala:148)
> [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
> [info] at 
> org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)
> [info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)
> [info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)
> [info] at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
> [info] at 
> org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
> [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info] at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69)
> [info] at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info] at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info] at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info] at scala.collection.immutable.List.foreach(List.scala:333)
> [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
> [info] at org.scalatest.Suite.run(Suite.scala:1114)
> [info] at org.scalatest.Suite.run$(Suite.scala:1096)
> [info] at 
> 

[jira] [Assigned] (SPARK-45376) [CORE] Add netty-tcnative-boringssl-static dependency

2023-10-03 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45376:
---

Assignee: Hasnain Lakhani

> [CORE] Add netty-tcnative-boringssl-static dependency
> -
>
> Key: SPARK-45376
> URL: https://issues.apache.org/jira/browse/SPARK-45376
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Add the boringssl dependency which is needed for SSL functionality to work, 
> and provide the network common test helper to other test modules which need 
> to test SSL functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45376) [CORE] Add netty-tcnative-boringssl-static dependency

2023-10-03 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45376.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43164
[https://github.com/apache/spark/pull/43164]

> [CORE] Add netty-tcnative-boringssl-static dependency
> -
>
> Key: SPARK-45376
> URL: https://issues.apache.org/jira/browse/SPARK-45376
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add the boringssl dependency which is needed for SSL functionality to work, 
> and provide the network common test helper to other test modules which need 
> to test SSL functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45250) Support stage level task resource profile for yarn cluster when dynamic allocation disabled

2023-10-02 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45250:
---

Assignee: Bobby Wang

> Support stage level task resource profile for yarn cluster when dynamic 
> allocation disabled
> ---
>
> Key: SPARK-45250
> URL: https://issues.apache.org/jira/browse/SPARK-45250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Bobby Wang
>Assignee: Bobby Wang
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/spark/pull/37268] has introduced a new feature 
> that supports stage-level schedule task resource profile for standalone 
> cluster when dynamic allocation is disabled. It's really cool feature, 
> especially for ML/DL cases, more details can be found in that PR.
>  
> The problem here is that the feature is only available for standalone cluster 
> for now, but most users would also expect it can be used for other spark 
> clusters like yarn and k8s.
>  
> So I file this issue to track this task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45250) Support stage level task resource profile for yarn cluster when dynamic allocation disabled

2023-10-02 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45250.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43030
[https://github.com/apache/spark/pull/43030]

> Support stage level task resource profile for yarn cluster when dynamic 
> allocation disabled
> ---
>
> Key: SPARK-45250
> URL: https://issues.apache.org/jira/browse/SPARK-45250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Bobby Wang
>Assignee: Bobby Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [https://github.com/apache/spark/pull/37268] has introduced a new feature 
> that supports stage-level schedule task resource profile for standalone 
> cluster when dynamic allocation is disabled. It's really cool feature, 
> especially for ML/DL cases, more details can be found in that PR.
>  
> The problem here is that the feature is only available for standalone cluster 
> for now, but most users would also expect it can be used for other spark 
> clusters like yarn and k8s.
>  
> So I file this issue to track this task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45378) [CORE] Add convertToNettyForSsl to ManagedBuffer

2023-10-02 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45378.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43166
[https://github.com/apache/spark/pull/43166]

> [CORE] Add convertToNettyForSsl to ManagedBuffer
> 
>
> Key: SPARK-45378
> URL: https://issues.apache.org/jira/browse/SPARK-45378
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Since netty's SSL support does not support zero-copy transfers, add another 
> API to ManagedBuffer so we can get buffers in a format that works with SSL



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45378) [CORE] Add convertToNettyForSsl to ManagedBuffer

2023-10-02 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45378:
---

Assignee: Hasnain Lakhani

> [CORE] Add convertToNettyForSsl to ManagedBuffer
> 
>
> Key: SPARK-45378
> URL: https://issues.apache.org/jira/browse/SPARK-45378
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Since netty's SSL support does not support zero-copy transfers, add another 
> API to ManagedBuffer so we can get buffers in a format that works with SSL



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45227) Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck

2023-09-30 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-45227:

Fix Version/s: 3.3.4

> Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an 
> executor process randomly gets stuck
> 
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, pull-request-available, 
> race-condition, stuck, threadsafe
> Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote}* Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Closed] (SPARK-45227) Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck

2023-09-28 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan closed SPARK-45227.
---

> Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an 
> executor process randomly gets stuck
> 
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, pull-request-available, 
> race-condition, stuck, threadsafe
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote}* Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Assigned] (SPARK-45227) Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck

2023-09-28 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45227:
---

Assignee: Bo Xiong

> Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an 
> executor process randomly gets stuck
> 
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, pull-request-available, 
> race-condition, stuck, threadsafe
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote}* Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Resolved] (SPARK-45227) Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck

2023-09-28 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45227.
-
Resolution: Fixed

> Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an 
> executor process randomly gets stuck
> 
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, pull-request-available, 
> race-condition, stuck, threadsafe
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote}* Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Updated] (SPARK-45227) Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck

2023-09-28 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-45227:

Fix Version/s: 3.4.2
   4.0.0
   3.5.1

> Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an 
> executor process randomly gets stuck
> 
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, pull-request-available, 
> race-condition, stuck, threadsafe
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote}* Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Assigned] (SPARK-45057) Deadlock caused by rdd replication level of 2

2023-09-28 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45057:
---

Assignee: Zhongwei Zhu

> Deadlock caused by rdd replication level of 2
> -
>
> Key: SPARK-45057
> URL: https://issues.apache.org/jira/browse/SPARK-45057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Major
>  Labels: pull-request-available
>
>  
> When 2 tasks try to compute same rdd with replication level of 2 and running 
> on only 2 executors. Deadlock will happen.
> Task only release lock after writing into local machine and replicate to 
> remote executor.
>  
> ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task 
> Thread T3)||Exe 2 (Shuffle Server Thread T4)||
> |T0|write lock of rdd| | | |
> |T1| | |write lock of rdd| |
> |T2|replicate -> UploadBlockSync (blocked by T4)| | | |
> |T3| | | |Received UploadBlock request from T1 (blocked by T4)|
> |T4| | |replicate -> UploadBlockSync (blocked by T2)| |
> |T5| |Received UploadBlock request from T3 (blocked by T1)| | |
> |T6|Deadlock|Deadlock|Deadlock|Deadlock|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45057) Deadlock caused by rdd replication level of 2

2023-09-28 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45057.
-
Fix Version/s: 3.3.4
   3.5.1
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 43067
[https://github.com/apache/spark/pull/43067]

> Deadlock caused by rdd replication level of 2
> -
>
> Key: SPARK-45057
> URL: https://issues.apache.org/jira/browse/SPARK-45057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.4, 3.5.1, 4.0.0, 3.4.2
>
>
>  
> When 2 tasks try to compute same rdd with replication level of 2 and running 
> on only 2 executors. Deadlock will happen.
> Task only release lock after writing into local machine and replicate to 
> remote executor.
>  
> ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task 
> Thread T3)||Exe 2 (Shuffle Server Thread T4)||
> |T0|write lock of rdd| | | |
> |T1| | |write lock of rdd| |
> |T2|replicate -> UploadBlockSync (blocked by T4)| | | |
> |T3| | | |Received UploadBlock request from T1 (blocked by T4)|
> |T4| | |replicate -> UploadBlockSync (blocked by T2)| |
> |T5| |Received UploadBlock request from T3 (blocked by T1)| | |
> |T6|Deadlock|Deadlock|Deadlock|Deadlock|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44937) Add SSL/TLS support for RPC and Shuffle communications

2023-09-28 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-44937.
-
Fix Version/s: 3.3.4
   3.5.1
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 43162
[https://github.com/apache/spark/pull/43162]

> Add SSL/TLS support for RPC and Shuffle communications
> --
>
> Key: SPARK-44937
> URL: https://issues.apache.org/jira/browse/SPARK-44937
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Security, Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.4, 3.5.1, 4.0.0, 3.4.2
>
>
> Add support for SSL/TLS based communication for Spark RPCs and block 
> transfers - providing an alternative to the existing encryption / 
> authentication implementation documented at 
> [https://spark.apache.org/docs/latest/security.html#spark-rpc-communication-protocol-between-spark-processes]
> This is a superset of the functionality discussed in 
> https://issues.apache.org/jira/browse/SPARK-6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44937) Add SSL/TLS support for RPC and Shuffle communications

2023-09-28 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-44937:
---

Assignee: Hasnain Lakhani

> Add SSL/TLS support for RPC and Shuffle communications
> --
>
> Key: SPARK-44937
> URL: https://issues.apache.org/jira/browse/SPARK-44937
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Security, Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Add support for SSL/TLS based communication for Spark RPCs and block 
> transfers - providing an alternative to the existing encryption / 
> authentication implementation documented at 
> [https://spark.apache.org/docs/latest/security.html#spark-rpc-communication-protocol-between-spark-processes]
> This is a superset of the functionality discussed in 
> https://issues.apache.org/jira/browse/SPARK-6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44756) Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-09-26 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-44756.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42426
[https://github.com/apache/spark/pull/42426]

> Executor hangs when RetryingBlockTransferor fails to initiate retry
> ---
>
> Key: SPARK-44756
> URL: https://issues.apache.org/jira/browse/SPARK-44756
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.3.1
>Reporter: Harunobu Daikoku
>Assignee: Harunobu Daikoku
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We have been observing this issue several times in our production where some 
> executors are being stuck at BlockTransferService#fetchBlockSync().
> After some investigation, the issue seems to be caused by an unhandled edge 
> case in RetryingBlockTransferor.
> 1. Shuffle transfer fails for whatever reason
> {noformat}
> java.io.IOException: Cannot allocate memory
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
>   at 
> org.apache.spark.network.shuffle.SimpleDownloadFile$SimpleDownloadWritableChannel.write(SimpleDownloadFile.java:78)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onData(OneForOneBlockFetcher.java:340)
>   at 
> org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:79)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:263)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:87)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
> {noformat}
> 2. The above exception caught by 
> [AbstractChannelHandlerContext#invokeChannelRead()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L381],
>  and propagated to the exception handler
> 3. Exception reaches 
> [RetryingBlockTransferor#initiateRetry()|https://github.com/apache/spark/blob/v3.3.1/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L178-L180],
>  and it tries to initiate retry
> {noformat}
> 23/08/09 16:58:37 shuffle-client-4-2 INFO RetryingBlockTransferor: Retrying 
> fetch (1/3) for 1 outstanding blocks after 5000 ms
> {noformat}
> 4. Retry initiation fails (in our case, it fails to create a new thread)
> 5. Exception caught by 
> [AbstractChannelHandlerContext#invokeExceptionCaught()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L305-L309],
>  and not further processed
> {noformat}
> 23/08/09 16:58:53 shuffle-client-4-2 DEBUG AbstractChannelHandlerContext: An 
> exception java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:719)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.initiateRetry(RetryingBlockTransferor.java:182)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.access$500(RetryingBlockTransferor.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.handleBlockTransferFailure(RetryingBlockTransferor.java:230)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.onBlockFetchFailure(RetryingBlockTransferor.java:260)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.failRemainingBlocks(OneForOneBlockFetcher.java:318)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.access$300(OneForOneBlockFetcher.java:55)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onFailure(OneForOneBlockFetcher.java:357)
>   at 
> org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:56)
>   at 
> 

[jira] [Assigned] (SPARK-44756) Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-09-26 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-44756:
---

Assignee: Harunobu Daikoku

> Executor hangs when RetryingBlockTransferor fails to initiate retry
> ---
>
> Key: SPARK-44756
> URL: https://issues.apache.org/jira/browse/SPARK-44756
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.3.1
>Reporter: Harunobu Daikoku
>Assignee: Harunobu Daikoku
>Priority: Minor
>  Labels: pull-request-available
>
> We have been observing this issue several times in our production where some 
> executors are being stuck at BlockTransferService#fetchBlockSync().
> After some investigation, the issue seems to be caused by an unhandled edge 
> case in RetryingBlockTransferor.
> 1. Shuffle transfer fails for whatever reason
> {noformat}
> java.io.IOException: Cannot allocate memory
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
>   at 
> org.apache.spark.network.shuffle.SimpleDownloadFile$SimpleDownloadWritableChannel.write(SimpleDownloadFile.java:78)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onData(OneForOneBlockFetcher.java:340)
>   at 
> org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:79)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:263)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:87)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
> {noformat}
> 2. The above exception caught by 
> [AbstractChannelHandlerContext#invokeChannelRead()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L381],
>  and propagated to the exception handler
> 3. Exception reaches 
> [RetryingBlockTransferor#initiateRetry()|https://github.com/apache/spark/blob/v3.3.1/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L178-L180],
>  and it tries to initiate retry
> {noformat}
> 23/08/09 16:58:37 shuffle-client-4-2 INFO RetryingBlockTransferor: Retrying 
> fetch (1/3) for 1 outstanding blocks after 5000 ms
> {noformat}
> 4. Retry initiation fails (in our case, it fails to create a new thread)
> 5. Exception caught by 
> [AbstractChannelHandlerContext#invokeExceptionCaught()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L305-L309],
>  and not further processed
> {noformat}
> 23/08/09 16:58:53 shuffle-client-4-2 DEBUG AbstractChannelHandlerContext: An 
> exception java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:719)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.initiateRetry(RetryingBlockTransferor.java:182)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.access$500(RetryingBlockTransferor.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.handleBlockTransferFailure(RetryingBlockTransferor.java:230)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.onBlockFetchFailure(RetryingBlockTransferor.java:260)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.failRemainingBlocks(OneForOneBlockFetcher.java:318)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.access$300(OneForOneBlockFetcher.java:55)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onFailure(OneForOneBlockFetcher.java:357)
>   at 
> org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:56)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:231)
>   at 
> 

[jira] [Resolved] (SPARK-44306) Group FileStatus with few RPC calls within Yarn Client

2023-09-19 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-44306.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42357
[https://github.com/apache/spark/pull/42357]

> Group FileStatus with few RPC calls within Yarn Client
> --
>
> Key: SPARK-44306
> URL: https://issues.apache.org/jira/browse/SPARK-44306
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Affects Versions: 0.9.2, 2.3.0, 3.5.0
>Reporter: SHU WANG
>Assignee: SHU WANG
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> It's inefficient to obtain *FileStatus* for each resource [one by 
> one|https://github.com/apache/spark/blob/531ec8bddc8dd22ca39486dbdd31e62e989ddc15/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71C1].
>  In our company setting, we are running Spark with Hadoop Yarn and HDFS. We 
> noticed the current behavior has two major drawbacks:
>  # Since each *getFileStatus* call involves network delays, the overall delay 
> can be *large* and add *uncertainty* to the overall Spark job runtime. 
> Specifically, we quantify this overhead within our cluster. We see the p50 
> overhead is around 10s, p80 is 1 min, and p100 is up to 15 mins. When HDFS is 
> overloaded, the delays become more severe. 
>  # In our cluster, we have nearly 100 million *getFileStatus* call to HDFS 
> daily. We noticed that in our cluster, most resources come from the same HDFS 
> directory for each user (See our [engineer blog 
> post|https://engineering.linkedin.com/blog/2023/reducing-apache-spark-application-dependencies-upload-by-99-]
>  about why we took this approach). Therefore, we can greatly reduce nearly 
> 100 million *getFileStatus* call to 0.1 million *listStatus* calls daily. 
> This will further reduce overhead from the HDFS side. 
> All in all, a more efficient way to fetch the *FileStatus* for each resource 
> is highly needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44306) Group FileStatus with few RPC calls within Yarn Client

2023-09-19 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-44306:
---

Assignee: SHU WANG

> Group FileStatus with few RPC calls within Yarn Client
> --
>
> Key: SPARK-44306
> URL: https://issues.apache.org/jira/browse/SPARK-44306
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Affects Versions: 0.9.2, 2.3.0, 3.5.0
>Reporter: SHU WANG
>Assignee: SHU WANG
>Priority: Major
>  Labels: pull-request-available
>
> It's inefficient to obtain *FileStatus* for each resource [one by 
> one|https://github.com/apache/spark/blob/531ec8bddc8dd22ca39486dbdd31e62e989ddc15/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71C1].
>  In our company setting, we are running Spark with Hadoop Yarn and HDFS. We 
> noticed the current behavior has two major drawbacks:
>  # Since each *getFileStatus* call involves network delays, the overall delay 
> can be *large* and add *uncertainty* to the overall Spark job runtime. 
> Specifically, we quantify this overhead within our cluster. We see the p50 
> overhead is around 10s, p80 is 1 min, and p100 is up to 15 mins. When HDFS is 
> overloaded, the delays become more severe. 
>  # In our cluster, we have nearly 100 million *getFileStatus* call to HDFS 
> daily. We noticed that in our cluster, most resources come from the same HDFS 
> directory for each user (See our [engineer blog 
> post|https://engineering.linkedin.com/blog/2023/reducing-apache-spark-application-dependencies-upload-by-99-]
>  about why we took this approach). Therefore, we can greatly reduce nearly 
> 100 million *getFileStatus* call to 0.1 million *listStatus* calls daily. 
> This will further reduce overhead from the HDFS side. 
> All in all, a more efficient way to fetch the *FileStatus* for each resource 
> is highly needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44845) spark job copies jars repeatedly if fs.defaultFS and application jar are same url

2023-09-06 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-44845.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42529
[https://github.com/apache/spark/pull/42529]

> spark job copies jars repeatedly if fs.defaultFS and application jar are same 
> url
> -
>
> Key: SPARK-44845
> URL: https://issues.apache.org/jira/browse/SPARK-44845
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.1
>Reporter: zheju_he
>Assignee: zheju_he
>Priority: Major
> Fix For: 4.0.0
>
>
> In the org.apache.spark.deploy.yarn.Client#compareUri method, 
> hdfs://hadoop81:8020 and hdfs://192.168.0.81:8020 are regarded as different 
> file systems (hadoop81 corresponds to 192.168.0.81). The specific reason is 
> that in the last pr, different URIs of user information are also regarded as 
> different file systems. Uri.getauthority is used to determine the user 
> information, but authority contains the host so the URI above must be 
> different from authority. To determine whether the user authentication 
> information is different, you only need to determine URI.getUserInfo.
>  
> the last pr and issue link:
> https://issues.apache.org/jira/browse/SPARK-22587
> https://github.com/apache/spark/pull/19885



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44845) spark job copies jars repeatedly if fs.defaultFS and application jar are same url

2023-09-06 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-44845:
---

Assignee: zheju_he

> spark job copies jars repeatedly if fs.defaultFS and application jar are same 
> url
> -
>
> Key: SPARK-44845
> URL: https://issues.apache.org/jira/browse/SPARK-44845
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.1
>Reporter: zheju_he
>Assignee: zheju_he
>Priority: Major
>
> In the org.apache.spark.deploy.yarn.Client#compareUri method, 
> hdfs://hadoop81:8020 and hdfs://192.168.0.81:8020 are regarded as different 
> file systems (hadoop81 corresponds to 192.168.0.81). The specific reason is 
> that in the last pr, different URIs of user information are also regarded as 
> different file systems. Uri.getauthority is used to determine the user 
> information, but authority contains the host so the URI above must be 
> different from authority. To determine whether the user authentication 
> information is different, you only need to determine URI.getUserInfo.
>  
> the last pr and issue link:
> https://issues.apache.org/jira/browse/SPARK-22587
> https://github.com/apache/spark/pull/19885



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44238) Introduce a new readFrom method with byte array input for BloomFilter

2023-08-31 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-44238:
---

Assignee: Yang Jie

> Introduce a new readFrom method with byte array input for BloomFilter
> -
>
> Key: SPARK-44238
> URL: https://issues.apache.org/jira/browse/SPARK-44238
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44238) Introduce a new readFrom method with byte array input for BloomFilter

2023-08-31 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-44238.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 41781
[https://github.com/apache/spark/pull/41781]

> Introduce a new readFrom method with byte array input for BloomFilter
> -
>
> Key: SPARK-44238
> URL: https://issues.apache.org/jira/browse/SPARK-44238
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44162) Support G1GC in `spark.eventLog.gcMetrics.*` without warning

2023-08-31 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-44162:
---

Assignee: Jia Fan

> Support G1GC in `spark.eventLog.gcMetrics.*` without warning
> 
>
> Key: SPARK-44162
> URL: https://issues.apache.org/jira/browse/SPARK-44162
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Jia Fan
>Priority: Major
>
> {code}
> >>> 23/06/23 14:26:53 WARN GarbageCollectionMetrics: To enable non-built-in 
> >>> garbage collector(s) List(G1 Concurrent GC), users should configure 
> >>> it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or 
> >>> spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44162) Support G1GC in `spark.eventLog.gcMetrics.*` without warning

2023-08-31 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-44162.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 41808
[https://github.com/apache/spark/pull/41808]

> Support G1GC in `spark.eventLog.gcMetrics.*` without warning
> 
>
> Key: SPARK-44162
> URL: https://issues.apache.org/jira/browse/SPARK-44162
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Jia Fan
>Priority: Major
> Fix For: 4.0.0
>
>
> {code}
> >>> 23/06/23 14:26:53 WARN GarbageCollectionMetrics: To enable non-built-in 
> >>> garbage collector(s) List(G1 Concurrent GC), users should configure 
> >>> it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or 
> >>> spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44242) Spark job submission failed because Xmx string is available on one parameter provided into spark.driver.extraJavaOptions

2023-08-11 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-44242.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 41806
[https://github.com/apache/spark/pull/41806]

> Spark job submission failed because Xmx string is available on one parameter 
> provided into spark.driver.extraJavaOptions
> 
>
> Key: SPARK-44242
> URL: https://issues.apache.org/jira/browse/SPARK-44242
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.3.2, 3.4.1
>Reporter: Nicolas Fraison
>Assignee: Nicolas Fraison
>Priority: Major
> Fix For: 4.0.0
>
>
> The spark-submit command failed if Xmx string is found on any parameters 
> provided to spark.driver.extraJavaOptions.
> For ex. running this spark-submit command line
> {code:java}
> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --conf 
> "spark.driver.extraJavaOptions=-Dtest=Xmx"  
> examples/jars/spark-examples_2.12-3.4.1.jar 100{code}
> failed due to
> {code:java}
> Error: Not allowed to specify max heap(Xmx) memory settings through java 
> options (was -Dtest=Xmx). Use the corresponding --driver-memory or 
> spark.driver.memory configuration instead.{code}
> The check performed in 
> [https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L314]
>  seems to broad



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44242) Spark job submission failed because Xmx string is available on one parameter provided into spark.driver.extraJavaOptions

2023-08-11 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-44242:
---

Assignee: Nicolas Fraison

> Spark job submission failed because Xmx string is available on one parameter 
> provided into spark.driver.extraJavaOptions
> 
>
> Key: SPARK-44242
> URL: https://issues.apache.org/jira/browse/SPARK-44242
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.3.2, 3.4.1
>Reporter: Nicolas Fraison
>Assignee: Nicolas Fraison
>Priority: Major
>
> The spark-submit command failed if Xmx string is found on any parameters 
> provided to spark.driver.extraJavaOptions.
> For ex. running this spark-submit command line
> {code:java}
> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --conf 
> "spark.driver.extraJavaOptions=-Dtest=Xmx"  
> examples/jars/spark-examples_2.12-3.4.1.jar 100{code}
> failed due to
> {code:java}
> Error: Not allowed to specify max heap(Xmx) memory settings through java 
> options (was -Dtest=Xmx). Use the corresponding --driver-memory or 
> spark.driver.memory configuration instead.{code}
> The check performed in 
> [https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L314]
>  seems to broad



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43987) Separate finalizeShuffleMerge Processing to Dedicated Thread Pools

2023-08-11 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-43987.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 41489
[https://github.com/apache/spark/pull/41489]

> Separate finalizeShuffleMerge Processing to Dedicated Thread Pools
> --
>
> Key: SPARK-43987
> URL: https://issues.apache.org/jira/browse/SPARK-43987
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.4.0
>Reporter: SHU WANG
>Assignee: SHU WANG
>Priority: Critical
> Fix For: 4.0.0
>
>
> In our production environment, _finalizeShuffleMerge_ processing took longer 
> time (p90 is around 20s) than other PRC requests. This is due to 
> _finalizeShuffleMerge_ invoking IO operations like truncate and file 
> open/close.  
> More importantly, processing this _finalizeShuffleMerge_ can block other 
> critical lightweight messages like authentications, which can cause 
> authentication timeout as well as fetch failures. Those timeout and fetch 
> failures affect the stability of the Spark job executions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43987) Separate finalizeShuffleMerge Processing to Dedicated Thread Pools

2023-08-11 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-43987:
---

Assignee: SHU WANG

> Separate finalizeShuffleMerge Processing to Dedicated Thread Pools
> --
>
> Key: SPARK-43987
> URL: https://issues.apache.org/jira/browse/SPARK-43987
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.4.0
>Reporter: SHU WANG
>Assignee: SHU WANG
>Priority: Critical
>
> In our production environment, _finalizeShuffleMerge_ processing took longer 
> time (p90 is around 20s) than other PRC requests. This is due to 
> _finalizeShuffleMerge_ invoking IO operations like truncate and file 
> open/close.  
> More importantly, processing this _finalizeShuffleMerge_ can block other 
> critical lightweight messages like authentications, which can cause 
> authentication timeout as well as fetch failures. Those timeout and fetch 
> failures affect the stability of the Spark job executions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44272) Path Inconsistency when Operating statCache within Yarn Client

2023-07-19 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-44272:

Affects Version/s: (was: 0.9.1)
   (was: 2.3.0)

> Path Inconsistency when Operating statCache within Yarn Client
> --
>
> Key: SPARK-44272
> URL: https://issues.apache.org/jira/browse/SPARK-44272
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.4.0, 3.5.0
>Reporter: SHU WANG
>Assignee: SHU WANG
>Priority: Critical
> Fix For: 3.5.0, 4.0.0
>
>
> The *addResource* from *ClientDistributedCacheManager* can add *FileStatus* 
> to 
> *statCache* when it is not yet cached. Also, there is a subtle bug from 
> *isPublic* from 
> *getVisibility* method. *uri.getPath()* will not retain URI information like 
> scheme, host, etc. So, the *uri* passed to checkPermissionOfOther will differ 
> from the original {*}uri{*}.
> For example, if uri is "file:/foo.invalid.com:8080/tmp/testing", then 
> {code:java}
> uri.getPath -> /foo.invalid.com:8080/tmp/testing
> uri.toString -> file:/foo.invalid.com:8080/tmp/testing{code}
> The consequence of this bug is that we will *double RPC calls* when the 
> resources are remote, which is unnecessary. We see nontrivial overhead when 
> checking those resources from our HDFS, especially when HDFS is overloaded. 
>  
> Ref: related code within *ClientDistributedCacheManager*
> {code:java}
> def addResource(...) {
>     val destStatus = statCache.getOrElse(destPath.toUri(), 
> fs.getFileStatus(destPath))
> val visibility = getVisibility(conf, destPath.toUri(), statCache)
> }
> private[yarn] def getVisibility() {
> isPublic(conf, uri, statCache)
> }
> private def isPublic(conf: Configuration, uri: URI, statCache: Map[URI, 
> FileStatus]): Boolean = {
> val current = new Path(uri.getPath()) // Should not use getPath
> checkPermissionOfOther(fs, uri, FsAction.READ, statCache)
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   >