[ https://issues.apache.org/jira/browse/SPARK-12583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Or updated SPARK-12583: ------------------------------ Assignee: Bertrand Bossy > spark shuffle fails with mesos after 2mins > ------------------------------------------ > > Key: SPARK-12583 > URL: https://issues.apache.org/jira/browse/SPARK-12583 > Project: Spark > Issue Type: Bug > Components: Shuffle > Affects Versions: 1.6.0 > Reporter: Adrian Bridgett > Assignee: Bertrand Bossy > Fix For: 2.0.0 > > > See user mailing list "Executor deregistered after 2mins" for more details. > As of 1.6, the driver registers with each shuffle manager via > MesosExternalShuffleClient. Once this disconnects, the shuffle manager > automatically cleans up the data associate with that driver. > However, the connection is terminated before this happens as it's idle. > Looking at a packet trace, after 120secs the shuffle manager is sending a FIN > packet to the driver. The only way to delay this is to increase > spark.shuffle.io.connectionTimeout=3600s on the shuffle manager. > I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with > newbie Scala skills to call the TransportContext call with > closeIdleConnections "false" and this didn't help (hadn't done the network > trace first). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org