[ https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169901#comment-14169901 ]
Jianshi Huang commented on SPARK-3923: -------------------------------------- I have similar problem in YARN-client mode. Setting spark.akka.heartbeat.interval to 100 fixes the problem. This is a critical bug. Jianshi > All Standalone Mode services time out with each other > ----------------------------------------------------- > > Key: SPARK-3923 > URL: https://issues.apache.org/jira/browse/SPARK-3923 > Project: Spark > Issue Type: Bug > Components: Deploy > Affects Versions: 1.2.0 > Reporter: Aaron Davidson > Priority: Blocker > > I'm seeing an issue where it seems that components in Standalone Mode > (Worker, Master, Driver, and Executor) all seem to time out with each other > after around 1000 seconds. Here is an example log: > {code} > 14/10/13 06:43:55 INFO Master: Registering worker > ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM > 14/10/13 06:43:55 INFO Master: Registering worker > ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM > 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell > 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID > app-20141013064356-0000 > ... precisely 1000 seconds later ... > 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote > system > [akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has > failed, address is now gated for [5000] ms. Reason is: [Disassociated]. > 14/10/13 07:00:35 INFO Master: > akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got > disassociated, removing it. > 14/10/13 07:00:35 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245] > was not delivered. [2] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:35 INFO Master: > akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got > disassociated, removing it. > 14/10/13 07:00:35 INFO Master: Removing worker > worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on > ip-10-0-175-214.us-west-2.compute.internal:42918 > 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1 > 14/10/13 07:00:35 INFO Master: > akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got > disassociated, removing it. > 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote > system > [akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has > failed, address is now gated for [5000] ms. Reason is: [Disassociated]. > 14/10/13 07:00:35 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] > was not delivered. [3] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:35 INFO LocalActorRef: Message > [akka.remote.transport.AssociationHandle$Disassociated] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] > was not delivered. [4] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake > timed out or transport failure detector triggered. > 14/10/13 07:00:36 INFO Master: > akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got > disassociated, removing it. > 14/10/13 07:00:36 INFO LocalActorRef: Message > [akka.remote.transport.AssociationHandle$InboundPayload] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] > was not delivered. [5] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:36 INFO Master: Removing app app-20141013064356-0000 > 14/10/13 07:00:36 WARN ReliableDeliverySupervisor: Association with remote > system > [akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259] has > failed, address is now gated for [5000] ms. Reason is: [Disassociated]. > 14/10/13 07:00:36 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] > was not delivered. [6] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:36 INFO LocalActorRef: Message > [akka.remote.transport.AssociationHandle$Disassociated] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] > was not delivered. [7] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:36 INFO Master: > akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got > disassociated, removing it. > {code} > Note that the driver and master are living on the same machine, and there is > no load to speak of at the time (so no GC). Also everything disconnecting > exactly 1000 seconds after initial connection is pretty suspicious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org