[ https://issues.apache.org/jira/browse/SPARK-31219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Manu Zhang updated SPARK-31219: ------------------------------- Description: Recently, we find our YarnShuffleService has a lot of [half-open connections|https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html] where shuffle servers' connections are active while clients have already closed. For example, from server's `ss -nt sport = :7337` output we have {code:java} ESTAB 0 0 server:7337 client:port {code} However, on client `ss -nt dport =: 7337 | grep server` would return nothing. Looking at the code, `YarnShuffleService` creates a `TransportContext` with `closeIdleConnections` set to false. {code:java} public class YarnShuffleService extends AuxiliaryService { ... @Override protected void serviceInit(Configuration conf) throws Exception { ... transportContext = new TransportContext(transportConf, blockHandler); ... } ... } public class TransportContext implements Closeable { ... public TransportContext(TransportConf conf, RpcHandler rpcHandler) { this(conf, rpcHandler, false, false); } public TransportContext(TransportConf conf, RpcHandler rpcHandler, boolean closeIdleConnections) { this(conf, rpcHandler, closeIdleConnections, false); } ... }{code} Hence, it's possible the channel may never get closed at server side if the server misses the event that the client has closed it. I find that parameter is true for `ExternalShuffleService`. Is there any reason for the difference here ? Can we enable closeIdleConnections in YarnShuffleService or at least add a configuration to enable it ? was: Recently, we find our YarnShuffleService has a lot of [half-open connections|https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html] where shuffle servers' connections are active while clients have already closed. For example, from server's `ss -nt sport = :7337` output we have {code:java} ESTAB 0 0 server:7337 client:port {code} However, on client `ss -nt dport =: 7337 | grep server` would return nothing. Looking at the code, `YarnShuffleService` creates a `TransportContext` with `closeIdleConnections` set to false. {code:java} public class YarnShuffleService extends AuxiliaryService { ... @Override protected void serviceInit(Configuration conf) throws Exception { ... transportContext = new TransportContext(transportConf, blockHandler); ... } ... } public class TransportContext implements Closeable { ... public TransportContext(TransportConf conf, RpcHandler rpcHandler) { this(conf, rpcHandler, false, false); } public TransportContext(TransportConf conf, RpcHandler rpcHandler, boolean closeIdleConnections) { this(conf, rpcHandler, closeIdleConnections, false); } ... }{code} Hence, it's possible the channel may never get closed at server side if the server misses the event that the client has closed it. I find that parameter is true for `ExternalShuffleService`. Is there any reason for the difference here ? Will it be valuable to add a configuration to allow enabling closeIdleConnections ? > YarnShuffleService doesn't close idle netty channel > --------------------------------------------------- > > Key: SPARK-31219 > URL: https://issues.apache.org/jira/browse/SPARK-31219 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 2.4.5, 3.0.0 > Reporter: Manu Zhang > Priority: Major > > Recently, we find our YarnShuffleService has a lot of [half-open > connections|https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html] > where shuffle servers' connections are active while clients have already > closed. > For example, from server's `ss -nt sport = :7337` output we have > {code:java} > ESTAB 0 0 server:7337 client:port > {code} > However, on client `ss -nt dport =: 7337 | grep server` would return nothing. > Looking at the code, `YarnShuffleService` creates a `TransportContext` with > `closeIdleConnections` set to false. > {code:java} > public class YarnShuffleService extends AuxiliaryService { > ... > @Override protected void serviceInit(Configuration conf) throws Exception > { > ... > transportContext = new TransportContext(transportConf, blockHandler); > ... > } > ... > } > public class TransportContext implements Closeable { > ... > public TransportContext(TransportConf conf, RpcHandler rpcHandler) { > this(conf, rpcHandler, false, false); > } > public TransportContext(TransportConf conf, RpcHandler rpcHandler, boolean > closeIdleConnections) { > this(conf, rpcHandler, closeIdleConnections, false); > } > ... > }{code} > Hence, it's possible the channel may never get closed at server side if the > server misses the event that the client has closed it. > I find that parameter is true for `ExternalShuffleService`. > Is there any reason for the difference here ? Can we enable > closeIdleConnections in YarnShuffleService or at least add a configuration to > enable it ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org