[ https://issues.apache.org/jira/browse/SPARK-32086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-32086: ----------------------------------- Labels: pull-request-available (was: ) > RemoveBroadcast RPC failed after executor is shutdown > ----------------------------------------------------- > > Key: SPARK-32086 > URL: https://issues.apache.org/jira/browse/SPARK-32086 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 3.0.0 > Reporter: Wan Kun > Priority: Major > Labels: pull-request-available > > When I run job with spark on yarn, and enable Dynamic Resource Allocation. > There are some exceptions when *RemoveBroadcast* RPC are called from > *BlockManagerMaster* to *BlockManagers* on executors. > > It seems that ***blockManagerInfo.values* in ** *BlockManagerMaster*** has > not been updated after the executor is closed > > > > {code:java} > /** > * Delegate RemoveBroadcast messages to each BlockManager because the master > may not notified > * of all broadcast blocks. If removeFromDriver is false, broadcast blocks > are only removed > * from the executors, but not from the driver. > */ > private def removeBroadcast(broadcastId: Long, removeFromDriver: Boolean): > Future[Seq[Int]] = { > val removeMsg = RemoveBroadcast(broadcastId, removeFromDriver) > -- blockManagerInfo.values did not updated after executor is shutdown > val requiredBlockManagers = blockManagerInfo.values.filter { info => > removeFromDriver || !info.blockManagerId.isDriver > } > val futures = requiredBlockManagers.map { bm => > bm.slaveEndpoint.ask[Int](removeMsg).recover { > case e: IOException => > logWarning(s"Error trying to remove broadcast $broadcastId from block > manager " + > s"${bm.blockManagerId}", e) > 0 // zero blocks were removed > } > }.toSeq > Future.sequence(futures) > } > {code} > > Driver LOG > > > {code:java} > 13:53:47.167 dispatcher-BlockManagerMaster INFO > org.apache.spark.storage.BlockManagerMasterEndpoint: Registering block > manager nta-offline-010.nta.leyantech.com:34187 with 912.3 MiB RAM, > BlockManagerId(93, nta-offline-010.nta.leyantech.com, 34187, None) > ... > 13:55:25.530 block-manager-ask-thread-pool-11 WARN > org.apache.spark.storage.BlockManagerMasterEndpoint: Error trying to remove > broadcast 123 from block manager BlockManagerId(93, > nta-offline-010.nta.leyantech.com, 34187, None) > java.io.IOException: Failed to send RPC RPC 5410545105668177846 to > /192.168.127.10:42478: java.nio.channels.ClosedChannelException > at > org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:363) > at > org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:340) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) > .... > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.nio.channels.ClosedChannelException > at > io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) > ... 22 more > 13:55:25.548 dispatcher-BlockManagerMaster INFO > org.apache.spark.storage.BlockManagerMasterEndpoint: Trying to remove > executor 93 from BlockManagerMaster. > 13:55:25.548 dispatcher-BlockManagerMaster INFO > org.apache.spark.storage.BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(93, nta-offline-010.nta.leyantech.com, 34187, None) > {code} > NodeMangager LOG > > {code:java} > 2020-06-23 13:55:25,363 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_e27_1589526283340_7859_01_000112 transitioned from > RUNNING to KILLING2020-06-23 13:55:25,503 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e27_1589526283340_7859_01_0001122020-06-23 > 13:55:25,511 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code > from container container_e27_1589526283340_7859_01_000112 is : 1432020-06-23 > 13:55:25,637 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_e27_1589526283340_7859_01_000112 transitioned from > KILLING to CONTAINER_CLEANEDUP_AFTER_KILL2020-06-23 13:55:25,643 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : > /data1/yarn/nm/usercache/schedule/appcache/application_1589526283340_7859/container_e27_1589526283340_7859_01_000112 > {code} > Container LOG > {code:java} > 13:54:24.192 Executor task launch worker for task 1857 INFO > org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem > columnStore to file. allocated memory: 377913:55:25.509 SIGTERM handler ERROR > org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL > TERM{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org