Matthew Sharp created SPARK-15519: ------------------------------------- Summary: Shuffle Service fails to start if first yarn.nodemanager.local-dirs is bad Key: SPARK-15519 URL: https://issues.apache.org/jira/browse/SPARK-15519 Project: Spark Issue Type: Bug Components: Shuffle, YARN Affects Versions: 1.6.1 Environment: Ubuntu 14.04 LTS, MapR 5.1, hadoop-2.7.0 Reporter: Matthew Sharp
{{yarn.nodemanager.local-dirs}} is set to {{/mnt/data0,/mnt/data1,/mnt/data2,/mnt/data3,/mnt/data4}} /mnt/data0 was not mounted due to a disk failure, so it was an empty directory which users were not allowed to write to. Starting up the node manager, we get this in the logs: {quote} 2016-05-24 15:41:56,456 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing YARN shuffle service for Spark 2016-05-24 15:41:56,456 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding auxiliary service spark_shuffle, "spark_shuffle" 2016-05-24 15:41:56,609 ERROR org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: error opening leveldb file /mnt/data0/registeredExecutors.ldb. Creating new file, will not be able to recover state for existing applications org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /mnt/data0/registeredExecutors.ldb/LOCK: Permission denied at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:100) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:81) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.<init>(ExternalShuffleBlockHandler.java:56) at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:128) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:157) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:256) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:476) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:524) 2016-05-24 15:41:56,611 WARN org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: error deleting /mnt/data0/registeredExecutors.ldb 2016-05-24 15:41:56,611 ERROR org.apache.spark.network.yarn.YarnShuffleService: Failed to initialize external shuffle service java.io.IOException: Unable to create state store at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:129) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:81) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.<init>(ExternalShuffleBlockHandler.java:56) at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:128) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:157) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:256) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:476) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:524) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /mnt/data0/registeredExecutors.ldb/LOCK: Permission denied at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:127) ... 14 more 2016-05-24 15:41:56,723 INFO org.apache.spark.network.yarn.YarnShuffleService: Started YARN shuffle service for Spark on port 7337. Authentication is not enabled. Registered executor file is /mnt/data0/registeredExecutors.ldb {quote} Later on, when jobs run on that node, we get many of this message: {quote} 2016-05-24 15:39:57,171 ERROR org.apache.spark.network.TransportContext: Error while initializing Netty pipeline java.lang.NullPointerException at org.apache.spark.network.server.TransportRequestHandler.<init>(TransportRequestHandler.java:77) at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:158) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:144) at io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:732) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:442) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:374) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:418) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) {quote} We would expect that the shuffle service would fail over to the next local-dir if the first one fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org