[ https://issues.apache.org/jira/browse/SPARK-21444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen reassigned SPARK-21444: ---------------------------------- Assignee: Josh Rosen > Fetch failure due to node reboot causes job failure > --------------------------------------------------- > > Key: SPARK-21444 > URL: https://issues.apache.org/jira/browse/SPARK-21444 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 2.3.0 > Reporter: Sital Kedia > Assignee: Josh Rosen > > We started seeing this issue after merging the PR - > https://github.com/apache/spark/pull/17955. > This PR introduced a change to keep the map-output statuses only in the map > output tracker and whenever there is a fetch failure, the scheduler tries to > invalidate the map-output statuses by talking all the the block manager slave > end point. However, in case the fetch failure is because node reboot, the > block manager slave end point would not be reachable and so the driver fails > the job as a result. See exception below - > {code} > 2017-07-15 05:20:50,255 WARN storage.BlockManagerMaster > (Logging.scala:logWarning(87)) - Failed to remove broadcast 10 with > removeFromMaster = true - Connection reset by peer > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > 2017-07-15 05:20:50,275 ERROR scheduler.DAGSchedulerEventProcessLoop > (Logging.scala:logError(91)) - DAGSchedulerEventProcessLoop failed; shutting > down SparkContext > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:143) > at > org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:271) > at > org.apache.spark.broadcast.TorrentBroadcast.doDestroy(TorrentBroadcast.scala:165) > at org.apache.spark.broadcast.Broadcast.destroy(Broadcast.scala:111) > at org.apache.spark.broadcast.Broadcast.destroy(Broadcast.scala:98) > at > org.apache.spark.ShuffleStatus.invalidateSerializedMapOutputStatusCache(MapOutputTracker.scala:197) > at > org.apache.spark.ShuffleStatus.removeMapOutput(MapOutputTracker.scala:105) > at > org.apache.spark.MapOutputTrackerMaster.unregisterMapOutput(MapOutputTracker.scala:420) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1324) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1715) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1673) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1662) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org