[ 
https://issues.apache.org/jira/browse/SPARK-32086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-32086:
-----------------------------------
    Labels: pull-request-available  (was: )

> RemoveBroadcast RPC failed after executor is shutdown
> -----------------------------------------------------
>
>                 Key: SPARK-32086
>                 URL: https://issues.apache.org/jira/browse/SPARK-32086
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 3.0.0
>            Reporter: Wan Kun
>            Priority: Major
>              Labels: pull-request-available
>
> When I run job with spark on yarn, and enable Dynamic Resource Allocation. 
> There are some exceptions when *RemoveBroadcast* RPC are called from 
> *BlockManagerMaster* to *BlockManagers* on executors.
>  
> It seems that ***blockManagerInfo.values* in ** *BlockManagerMaster*** has 
> not been updated after the executor is closed
>  
>  
>  
> {code:java}
> /**
>  * Delegate RemoveBroadcast messages to each BlockManager because the master 
> may not notified
>  * of all broadcast blocks. If removeFromDriver is false, broadcast blocks 
> are only removed
>  * from the executors, but not from the driver.
>  */
> private def removeBroadcast(broadcastId: Long, removeFromDriver: Boolean): 
> Future[Seq[Int]] = {
>   val removeMsg = RemoveBroadcast(broadcastId, removeFromDriver)
>   -- blockManagerInfo.values did not updated after executor is shutdown
>   val requiredBlockManagers = blockManagerInfo.values.filter { info =>
>     removeFromDriver || !info.blockManagerId.isDriver
>   }
>   val futures = requiredBlockManagers.map { bm =>
>     bm.slaveEndpoint.ask[Int](removeMsg).recover {
>       case e: IOException =>
>         logWarning(s"Error trying to remove broadcast $broadcastId from block 
> manager " +
>           s"${bm.blockManagerId}", e)
>         0 // zero blocks were removed
>     }
>   }.toSeq
>   Future.sequence(futures)
> }
> {code}
>  
> Driver LOG
>  
>  
> {code:java}
> 13:53:47.167 dispatcher-BlockManagerMaster INFO 
> org.apache.spark.storage.BlockManagerMasterEndpoint: Registering block 
> manager nta-offline-010.nta.leyantech.com:34187 with 912.3 MiB RAM, 
> BlockManagerId(93, nta-offline-010.nta.leyantech.com, 34187, None)
> ...
> 13:55:25.530 block-manager-ask-thread-pool-11 WARN 
> org.apache.spark.storage.BlockManagerMasterEndpoint: Error trying to remove 
> broadcast 123 from block manager BlockManagerId(93, 
> nta-offline-010.nta.leyantech.com, 34187, None)
> java.io.IOException: Failed to send RPC RPC 5410545105668177846 to 
> /192.168.127.10:42478: java.nio.channels.ClosedChannelException
> at 
> org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:363)
> at 
> org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:340)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
> ....
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.nio.channels.ClosedChannelException
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
> ... 22 more
> 13:55:25.548 dispatcher-BlockManagerMaster INFO 
> org.apache.spark.storage.BlockManagerMasterEndpoint: Trying to remove 
> executor 93 from BlockManagerMaster.
> 13:55:25.548 dispatcher-BlockManagerMaster INFO 
> org.apache.spark.storage.BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(93, nta-offline-010.nta.leyantech.com, 34187, None)
> {code}
> NodeMangager LOG
>  
> {code:java}
> 2020-06-23 13:55:25,363 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_e27_1589526283340_7859_01_000112 transitioned from 
> RUNNING to KILLING2020-06-23 13:55:25,503 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e27_1589526283340_7859_01_0001122020-06-23 
> 13:55:25,511 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_e27_1589526283340_7859_01_000112 is : 1432020-06-23 
> 13:55:25,637 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_e27_1589526283340_7859_01_000112 transitioned from 
> KILLING to CONTAINER_CLEANEDUP_AFTER_KILL2020-06-23 13:55:25,643 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : 
> /data1/yarn/nm/usercache/schedule/appcache/application_1589526283340_7859/container_e27_1589526283340_7859_01_000112
> {code}
> Container LOG
> {code:java}
> 13:54:24.192 Executor task launch worker for task 1857 INFO 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 377913:55:25.509 SIGTERM handler ERROR 
> org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
> TERM{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to