I tried making the scenario a bit simpler: - start 4 nodes - start client, observe no timeouts - kill 1 node - observe node is downed - observe timeouts on client continue indefinitely - kill client - start new client - observe timeouts on newly started client continue indefinitely
The fact that I start a new client means that ClusterClient state longer plays any role. So the problem seems to be with the state of Receptionist on the server-side. I have copied the bits of the logs that I thought most relevant below, they are easily correlated by timestamp. NODE1 (seed, port 2551) ----------------- 10:51:01.315 [VolArbService-akka.actor.default-dispatcher-13] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2551] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2552] 10:51:01.315 [VolArbService-akka.actor.default-dispatcher-13] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2551] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:37209] 10:51:01.315 [VolArbService-akka.actor.default-dispatcher-13] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2551] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:34165] ... 10:51:17.666 [VolArbService-akka.actor.default-dispatcher-15] DEBUG akka.remote.EndpointWriter - Disassociated [akka.tcp://VolArbService@127.0.0.1:2551] -> [akka.tcp://VolArbService@127.0.0.1:34165] 10:51:17.666 [VolArbService-akka.actor.default-dispatcher-15] INFO a.r.RemoteActorRefProvider$RemoteDeadLetterActorRef - Message [com.imc.medusa.server.messages.Update] from Actor[akka://VolArbService/user/receptionist/akka.tcp%3A%2F%2FABC%40127.0.0.1%3A48362%2Ftemp%2F%24Ed#-1362311011] to Actor[akka://VolArbService/deadLetters] was not delivered. [6] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. ... 10:51:22.279 [VolArbService-akka.actor.default-dispatcher-26] INFO Cluster(akka://VolArbService) - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2551] - Leader is marking unreachable node [akka.tcp://VolArbService@127.0.0.1:34165] as [Down] ... 10:51:22.315 [VolArbService-akka.actor.default-dispatcher-14] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2551] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2552] 10:51:22.315 [VolArbService-akka.actor.default-dispatcher-14] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2551] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:37209] 10:51:22.315 [VolArbService-akka.actor.default-dispatcher-14] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2551] - EndHeartbeat to [akka.tcp://VolArbService@127.0.0.1:34165] .. 10:51:24.276 [VolArbService-akka.actor.default-dispatcher-24] INFO Cluster(akka://VolArbService) - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2551] - Leader is removing unreachable node [akka.tcp://VolArbService@127.0.0.1:34165] 10:51:24.277 [VolArbService-akka.actor.default-dispatcher-29] INFO Remoting - Address [akka.tcp://VolArbService@127.0.0.1:34165] is now quarantined, all messages to this address will be delivered to dead letters. NODE2 (seed, port 2552) ----------------- 10:51:00.860 [VolArbService-akka.actor.default-dispatcher-16] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2552] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2551] 10:51:00.861 [VolArbService-akka.actor.default-dispatcher-16] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2552] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:37209] 10:51:00.861 [VolArbService-akka.actor.default-dispatcher-16] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2552] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:34165] ... 10:51:17.729 [VolArbService-akka.actor.default-dispatcher-2] DEBUG akka.remote.EndpointWriter - Disassociated [akka.tcp://VolArbService@127.0.0.1:2552] -> [akka.tcp://VolArbService@127.0.0.1:34165] 10:51:17.730 [VolArbService-akka.actor.default-dispatcher-2] INFO a.r.RemoteActorRefProvider$RemoteDeadLetterActorRef - Message [com.imc.medusa.server.messages.Update] from Actor[akka.tcp://VolArbService@127.0.0.1:2551/user/receptionist/akka.tcp%3A%2F%2FABC%40127.0.0.1%3A48362%2Ftemp%2F%24Hd#-1292428908] to Actor[akka://VolArbService/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. ... 10:51:24.646 [VolArbService-akka.actor.default-dispatcher-16] INFO Remoting - Address [akka.tcp://VolArbService@127.0.0.1:34165] is now quarantined, all messages to this address will be delivered to dead letters. 10:51:24.860 [VolArbService-akka.actor.default-dispatcher-4] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2552] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2551] 10:51:24.860 [VolArbService-akka.actor.default-dispatcher-4] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2552] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:37209] 10:51:24.860 [VolArbService-akka.actor.default-dispatcher-4] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:2552] - EndHeartbeat to [akka.tcp://VolArbService@127.0.0.1:34165] ... NODE3 (port 34165) --------------- 10:51:17.500 [VolArbService-akka.actor.default-dispatcher-24] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:34165] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2551] 10:51:17.500 [VolArbService-akka.actor.default-dispatcher-24] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:34165] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2552] 10:51:17.501 [VolArbService-akka.actor.default-dispatcher-24] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:34165] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:37209] ... ***KILLED by me at 10:51:17:592*** NODE4 (port 37209) -------------- 10:51:01.317 [VolArbService-akka.actor.default-dispatcher-14] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:37209] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2551] 10:51:01.317 [VolArbService-akka.actor.default-dispatcher-14] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:37209] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2552] 10:51:01.317 [VolArbService-akka.actor.default-dispatcher-14] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:37209] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:34165] ... 10:51:17.666 [VolArbService-akka.actor.default-dispatcher-3] DEBUG akka.remote.EndpointWriter - Disassociated [akka.tcp://VolArbService@127.0.0.1:37209] -> [akka.tcp://VolArbService@127.0.0.1:34165] 10:51:17.666 [VolArbService-akka.actor.default-dispatcher-3] INFO a.r.RemoteActorRefProvider$RemoteDeadLetterActorRef - Message [com.imc.medusa.server.messages.Update] from Actor[akka.tcp://VolArbService@127.0.0.1:2551/user/receptionist/akka.tcp%3A%2F%2FABC%40127.0.0.1%3A48362%2Ftemp%2F%24Dd#868424619] to Actor[akka://VolArbService/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. ... 10:51:25.175 [VolArbService-akka.actor.default-dispatcher-16] INFO Remoting - Address [akka.tcp://VolArbService@127.0.0.1:34165] is now quarantined, all messages to this address will be delivered to dead letters. 10:51:25.317 [VolArbService-akka.actor.default-dispatcher-18] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:37209] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2551] 10:51:25.317 [VolArbService-akka.actor.default-dispatcher-18] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:37209] - Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2552] 10:51:25.317 [VolArbService-akka.actor.default-dispatcher-18] DEBUG akka.cluster.ClusterHeartbeatSender - Cluster Node [akka.tcp://VolArbService@127.0.0.1:37209] - EndHeartbeat to [akka.tcp://VolArbService@127.0.0.1:34165] Client ------- *** STARTED new client after NODE3 has been downed *** [DEBUG] [02/06/2014 10:52:02.646] [main] [EventStream] StandardOutLogger started 10:52:02.882 [ABC-akka.actor.default-dispatcher-3] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started [DEBUG] [02/06/2014 10:52:02.888] [main] [EventStream(akka://ABC)] logger log1-Slf4jLogger started [DEBUG] [02/06/2014 10:52:02.888] [main] [EventStream(akka://ABC)] Default Loggers started 10:52:02.889 [ABC-akka.actor.default-dispatcher-3] DEBUG akka.event.EventStream - logger log1-Slf4jLogger started 10:52:02.889 [ABC-akka.actor.default-dispatcher-3] DEBUG akka.event.EventStream - Default Loggers started 10:52:02.930 [ABC-akka.actor.default-dispatcher-3] INFO Remoting - Starting remoting 10:52:03.063 [ABC-akka.actor.default-dispatcher-2] INFO Remoting - Remoting started; listening on addresses :[akka.tcp://ABC@127.0.0.1:42625] 10:52:03.189 [ABC-akka.actor.default-dispatcher-15] DEBUG akka.remote.EndpointWriter - Associated [akka.tcp://ABC@127.0.0.1:42625] -> [akka.tcp://VolArbService@127.0.0.1:2552] 10:52:03.190 [ABC-akka.actor.default-dispatcher-15] DEBUG akka.remote.EndpointWriter - Associated [akka.tcp://ABC@127.0.0.1:42625] -> [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:03.191 [ABC-akka.actor.default-dispatcher-15] DEBUG a.s.Serialization(akka://ABC) - Using serializer[akka.remote.serialization.MessageContainerSerializer] for message [akka.actor.SelectChildName] 10:52:03.197 [ABC-akka.actor.default-dispatcher-4] DEBUG a.s.Serialization(akka://ABC) - Using serializer[akka.serialization.JavaSerializer] for message [akka.contrib.pattern.ClusterReceptionist$Internal$GetContacts$] 10:52:03.238 [ABC-akka.actor.default-dispatcher-3] DEBUG a.s.Serialization(akka://ABC) - Using serializer[akka.serialization.JavaSerializer] for message [akka.actor.Identify] 10:52:03.248 [ABC-akka.actor.default-dispatcher-12] DEBUG akka.remote.EndpointWriter - Associated [akka.tcp://ABC@127.0.0.1:42625] -> [akka.tcp://VolArbService@127.0.0.1:37209] 10:52:03.249 [ABC-akka.actor.default-dispatcher-2] INFO akka.contrib.pattern.ClusterClient - Connected to [akka.tcp://VolArbService@127.0.0.1:2551/user/receptionist] 10:52:03.249 [ABC-akka.actor.default-dispatcher-2] DEBUG a.s.Serialization(akka://ABC) - Using serializer[akka.serialization.JavaSerializer] for message [akka.dispatch.sysmsg.Watch] 10:52:03.250 [ABC-akka.actor.default-dispatcher-2] DEBUG akka.remote.RemoteWatcher - Watching: [akka://ABC/user/$a -> akka.tcp://VolArbService@127.0.0.1:2551/user/receptionist] 10:52:04.085 [ABC-akka.actor.default-dispatcher-2] DEBUG akka.remote.RemoteWatcher - Sending first Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:04.086 [ABC-akka.actor.default-dispatcher-3] DEBUG a.s.Serialization(akka://ABC) - Using serializer[akka.serialization.JavaSerializer] for message [akka.remote.RemoteWatcher$Heartbeat$] 10:52:04.090 [ABC-akka.actor.default-dispatcher-2] DEBUG akka.remote.RemoteWatcher - Received first heartbeat rsp from [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:04.090 [ABC-akka.actor.default-dispatcher-2] DEBUG akka.remote.RemoteWatcher - Re-watch [Actor[akka://ABC/user/$a#-1761251245] -> Actor[akka.tcp://VolArbService@127.0.0.1:2551/user/receptionist#-1054121998]] 10:52:04.091 [ABC-akka.actor.default-dispatcher-2] DEBUG akka.remote.RemoteWatcher - Re-watch [Actor[akka://ABC/system/remote-watcher#449313331] -> Actor[akka.tcp://VolArbService@127.0.0.1:2551/user/receptionist#-1054121998]] 10:52:04.092 [ABC-akka.actor.default-dispatcher-4] DEBUG akka.remote.RemoteWatcher - Watching: [akka://ABC/user/$a -> akka.tcp://VolArbService@127.0.0.1:2551/user/receptionist] 10:52:05.085 [ABC-akka.actor.default-dispatcher-3] DEBUG akka.remote.RemoteWatcher - Sending Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:05.087 [ABC-akka.actor.default-dispatcher-12] DEBUG akka.remote.RemoteWatcher - Received heartbeat rsp from [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:06.085 [ABC-akka.actor.default-dispatcher-12] DEBUG akka.remote.RemoteWatcher - Sending Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:06.089 [ABC-akka.actor.default-dispatcher-12] DEBUG akka.remote.RemoteWatcher - Received heartbeat rsp from [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:06.125 [ABC-akka.actor.default-dispatcher-3] DEBUG a.s.Serialization(akka://ABC) - Using serializer[akka.serialization.JavaSerializer] for message [akka.contrib.pattern.ClusterReceptionist$Internal$Ping$] 10:52:07.085 [ABC-akka.actor.default-dispatcher-4] DEBUG akka.remote.RemoteWatcher - Sending Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:07.088 [ABC-akka.actor.default-dispatcher-3] DEBUG akka.remote.RemoteWatcher - Received heartbeat rsp from [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:08.085 [ABC-akka.actor.default-dispatcher-3] DEBUG akka.remote.RemoteWatcher - Sending Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:08.087 [ABC-akka.actor.default-dispatcher-3] DEBUG akka.remote.RemoteWatcher - Received heartbeat rsp from [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:08.107 [ABC-akka.actor.default-dispatcher-3] DEBUG a.s.Serialization(akka://ABC) - Using serializer[akka.serialization.JavaSerializer] for message [akka.contrib.pattern.DistributedPubSubMediator$Send] 10:52:09.085 [ABC-akka.actor.default-dispatcher-2] DEBUG akka.remote.RemoteWatcher - Sending Heartbeat to [akka.tcp://VolArbService@127.0.0.1:2551] 10:52:09.087 [ABC-akka.actor.default-dispatcher-2] DEBUG akka.remote.RemoteWatcher - Received heartbeat rsp from [akka.tcp://VolArbService@127.0.0.1:2551] *** first message is sent and times out **** T. On Thursday, February 6, 2014 10:53:44 AM UTC+1, Patrik Nordwall wrote: > > > > > On Thu, Feb 6, 2014 at 10:31 AM, Tycho Lamerigts > <tycho.l...@gmail.com<javascript:> > > wrote: > >> In the meantime, I have looked at the logs a bit more. If I do my test >> with 3 nodes, then kill 1 node, the ClusterClient seems to always recover >> correctly, i.e. it continues using the 2 surviving nodes and there are no >> timeouts. If I start with 4 nodes, then kill 1 node, it nearly always >> causes problems. The logs of the 3 surviving nodes show the following error >> forever repeating >> >> 10:27:35.862 [VolArbService-akka.actor.default-dispatcher-23] ERROR >> akka.remote.EndpointWriter - AssociationError [akka.tcp:// >> VolArbService@127.0.0.1:2552] -> [akka.tcp:// >> VolArbService@127.0.0.1:51801]: Error [Association failed with >> [akka.tcp://VolArbService@127.0.0.1:51801]] [ >> akka.remote.EndpointAssociationException: Association failed with >> [akka.tcp://VolArbService@127.0.0.1:51801] >> Caused by: >> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >> Connection refused: /127.0.0.1:51801 >> ] >> >> where port 51801 is the killed node. Despite the fact that the error >> continues forever, I can join a 2 new nodes to the 3 surviving nodes and >> the new nodes correctly report the new cluster size as 4 and 5 using >> Cluster.readView().members().size(). In other words, it seems the killed >> node was correctly downed and yet the associationerrors never cease. >> > > Do you see "Leader is removing unreachable node" in the cluster logs? > > On the client side you should see the following if it is the connected > receptionist node that is killed. > "Lost contact with" > "Connected to" > > /Patrik > > >> >> T. >> >> >> >> On Wednesday, February 5, 2014 9:40:41 PM UTC+1, Björn Antonsson wrote: >> >>> Hi, >>> >>> On 5 February 2014 at 16:42:48, Tycho Lamerigts (tycho.l...@gmail.com) >>> wrote: >>> >>> I have a client that fires many requests to my cluster using >>> ClusterClient's Send(), one request at a time, with a 1 second timeout >>> waiting for a response. While all the cluster nodes are up, the requests >>> are correctly (randomly) distributed across the nodes and promptly receive >>> a response, no timeouts. If I kill a cluster node, then I expect one or two >>> of the requests to timeout because they end up being sent to the now-dead >>> node. After that, I expect ClusterClient to realize that the node has died >>> and I expect to no longer get timeouts (the workload can easily be handled >>> by the remaining nodes). Sometimes it works. Unfortunately, more often than >>> not it doesn't work and I continue getting timed out requests until I >>> restart every node in the cluster and the client. >>> >>> Any idea what causes this behavior? >>> >>> >>> Which version of akka are you using? Is the failing node corrcetly >>> downed? Have you enabled any debug logging to diagnose the behavior? >>> >>> B/ >>> >>> >>> My ClusterClient is initialized with two receptionist addresses. My >>> cluster actually has more than 2 nodes and each node has a receptionist >>> with a registered destination actor. I tried playing with >>> contrib.cluster.receptionist.number-of-contacts but it did not seem to >>> make any difference. >>> -- >>> >>>>>>>>>> Read the docs: http://akka.io/docs/ >>> >>>>>>>>>> Check the FAQ: http://akka.io/faq/ >>> >>>>>>>>>> Search the archives: https://groups.google.com/ >>> group/akka-user >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "Akka User List" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to akka-user+...@googlegroups.com. >>> To post to this group, send email to akka...@googlegroups.com. >>> >>> Visit this group at http://groups.google.com/group/akka-user. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >>> >>> -- >>> Björn Antonsson >>> Typesafe <http://typesafe.com/> – Reactive Apps on the JVM >>> twitter: @bantonsson <http://twitter.com/#!/bantonsson> >>> >>> -- >> >>>>>>>>>> Read the docs: http://akka.io/docs/ >> >>>>>>>>>> Check the FAQ: http://akka.io/faq/ >> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user >> --- >> You received this message because you are subscribed to the Google Groups >> "Akka User List" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to akka-user+...@googlegroups.com <javascript:>. >> To post to this group, send email to akka...@googlegroups.com<javascript:> >> . >> Visit this group at http://groups.google.com/group/akka-user. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > > > -- > > Patrik Nordwall > Typesafe <http://typesafe.com/> - Reactive apps on the JVM > Twitter: @patriknw > > -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: http://akka.io/faq/ >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscr...@googlegroups.com. To post to this group, send email to akka-user@googlegroups.com. Visit this group at http://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/groups/opt_out.