We met broadcast issue in some of our applications, but not every time we run application, usually it gone when we rerun it. In the exception log, I see below two types of exception:
Exception 1: 10:09:20.295 [shuffle-server-6-2] ERROR org.apache.spark.network.server.TransportRequestHandler - Error opening block StreamChunkId{streamId=365584526097, chunkIndex=0} for request from /10.33.46.33:19866 org.apache.spark.storage.BlockNotFoundException: Block broadcast_334_piece0 not found at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:361) ~[spark-core_2.11-2.2.1.jar:2.2.1] at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:61) ~[spark-core_2.11-2.2.1.jar:2.2.1] at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:60) ~[spark-core_2.11-2.2.1.jar:2.2.1] at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) ~[scala-library-2.11.0.jar:?] at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) ~[scala-library-2.11.0.jar:?] at org.apache.spark.network.server.OneForOneStreamManager.getChunk(OneForOneStreamManager.java:87) ~[spark-network-common_2.11-2.2.1.jar:2.2.1] at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:125) [spark-network-common_2.11-2.2.1.jar:2.2.1] at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103) [spark-network-common_2.11-2.2.1.jar:2.2.1] at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) [spark-network-common_2.11-2.2.1.jar:2.2.1] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) [netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) [netty-all-4.0.23.Final.jar:4.0.23.Final] Exception 2: 10:14:37.906 [Executor task launch worker for task 430478] ERROR org.apache.spark.util.Utils - Exception encountered org.apache.spark.SparkException: Failed to get broadcast_696_piece0 of broadcast_696 at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178) ~[spark-core_2.11-2.2.1.jar:2.2.1] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150) ~[spark-core_2.11-2.2.1.jar:2.2.1] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150) ~[spark-core_2.11-2.2.1.jar:2.2.1] at scala.collection.immutable.List.foreach(List.scala:383) ~[scala-library-2.11.0.jar:?] at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150) ~[spark-core_2.11-2.2.1.jar:2.2.1] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:222) ~[spark-core_2.11-2.2.1.jar:2.2.1] at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) [spark-core_2.11-2.2.1.jar:2.2.1] at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) [spark-core_2.11-2.2.1.jar:2.2.1] at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) [spark-core_2.11-2.2.1.jar:2.2.1] at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) [spark-core_2.11-2.2.1.jar:2.2.1] at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) [spark-core_2.11-2.2.1.jar:2.2.1] at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) [spark-core_2.11-2.2.1.jar:2.2.1] I think exception 2 is caused by exception 1, so the issue is when executor A try to get broadcast from executor B, executor B cannot find in local. It is strange, because broadcast is store in memory and disk, it should not be removed only when driver asked, but driver will remove broadcast only one broadcast variable not used anymore. Could anyone gives some cue on how to find the root cause of this issue? Thanks a lot! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org