This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 0d997e5  [SPARK-31219][YARN] Enable closeIdleConnections in 
YarnShuffleService
0d997e5 is described below

commit 0d997e5156a751c99cd6f8be1528ed088a585d1f
Author: manuzhang <owenzhang1...@gmail.com>
AuthorDate: Mon Mar 30 12:44:46 2020 -0500

    [SPARK-31219][YARN] Enable closeIdleConnections in YarnShuffleService
    
    ### What changes were proposed in this pull request?
    Close idle connections at shuffle server side when an `IdleStateEvent` is 
triggered after `spark.shuffle.io.connectionTimeout` or `spark.network.timeout` 
time. It's based on following investigations.
    
    1. We found connections on our clusters building up continuously (> 10k for 
some nodes). Is that normal ? We don't think so.
    2. We looked into the connections on one node and found there were a lot of 
half-open connections. (connections only existed on one node)
    3. We also checked those connections were very old (> 21 hours). (FYI, 
https://superuser.com/questions/565991/how-to-determine-the-socket-connection-up-time-on-linux)
    4. Looking at the code, TransportContext registers an IdleStateHandler 
which should fire an IdleStateEvent when timeout. We did a heap dump of the 
YarnShuffleService and checked the attributes of IdleStateHandler. It turned 
out firstAllIdleEvent of many IdleStateHandlers were already false so 
IdleStateEvent were already fired.
    5. Finally, we realized the IdleStateEvent would not be handled since 
closeIdleConnections are hardcoded to false for YarnShuffleService.
    
    ### Why are the changes needed?
    Idle connections to YarnShuffleService could never be closed, and will be 
accumulating and taking up memory and file descriptors.
    
    ### Does this PR introduce any user-facing change?
    No.
    
    ### How was this patch tested?
    Existing tests.
    
    Closes #27998 from manuzhang/spark-31219.
    
    Authored-by: manuzhang <owenzhang1...@gmail.com>
    Signed-off-by: Thomas Graves <tgra...@apache.org>
---
 .../src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
 
b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
index 815a56d..c41efba 100644
--- 
a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
+++ 
b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
@@ -188,7 +188,7 @@ public class YarnShuffleService extends AuxiliaryService {
 
       int port = conf.getInt(
         SPARK_SHUFFLE_SERVICE_PORT_KEY, DEFAULT_SPARK_SHUFFLE_SERVICE_PORT);
-      transportContext = new TransportContext(transportConf, blockHandler);
+      transportContext = new TransportContext(transportConf, blockHandler, 
true);
       shuffleServer = transportContext.createServer(port, bootstraps);
       // the port should normally be fixed, but for tests its useful to find 
an open port
       port = shuffleServer.getPort();


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to