[ https://issues.apache.org/jira/browse/SPARK-16505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373654#comment-15373654 ]
Apache Spark commented on SPARK-16505: -------------------------------------- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/14162 > YARN shuffle service should throw errors when it fails to start > --------------------------------------------------------------- > > Key: SPARK-16505 > URL: https://issues.apache.org/jira/browse/SPARK-16505 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 2.0.0 > Reporter: Marcelo Vanzin > > Right now the YARN shuffle service will swallow errors that happen during > startup and just log them: > {code} > try { > blockHandler = new ExternalShuffleBlockHandler(transportConf, > registeredExecutorFile); > } catch (Exception e) { > logger.error("Failed to initialize external shuffle service", e); > } > {code} > This causes two undesirable things to happen: > - because {{blockHandler}} will remain {{null}} when an error happens, every > request to the shuffle service will cause an NPE > - because the NM is running, containers may be assigned to that host, only to > fail to register with the shuffle service. > Example of the first: > {noformat} > 2016-05-25 15:01:12,198 ERROR org.apache.spark.network.TransportContext: > Error while initializing Netty pipeline > java.lang.NullPointerException > at > org.apache.spark.network.server.TransportRequestHandler.<init>(TransportRequestHandler.java:77) > at > org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) > at > org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) > at > io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) > {noformat} > Example of the second: > {noformat} > 16/05/25 15:01:12 INFO storage.BlockManager: Registering executor with local > external shuffle service. > 16/05/25 15:01:12 ERROR client.TransportClient: Failed to send RPC > 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: > java.nio.channels.ClosedChannelException > java.nio.channels.ClosedChannelException > 16/05/25 15:01:12 ERROR storage.BlockManager: Failed to connect to external > shuffle server, will retry 2 more times after waiting 5 seconds... > java.lang.RuntimeException: java.io.IOException: Failed to send RPC > 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: > java.nio.channels.ClosedChannelException > at > org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:272) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org