That is an excellent analysis, Jordan. The verbose-heartbeat-logging is
useful for exactly this kind of debugging. You need to find why NODE-1 was
"paused". You said that you might be doing some blocking activity in your
actors. I strongly recommend that you eliminate such blocking or assign a
dedicated dispatcher for the actors that are blocking. Blocking must not be
done on the default-dispatcher, since it might starve other Akka internal
tasks. It is normally not enough to configure akka.cluster.use-dispatcher.
It's better too use dedicated dispatchers for the things in the application
that is blocking, because there might always be some other thing that will
be starved on the default-dispatcher.

Here is to how to configure a dispatcher for blocking:

On Thu, Jan 5, 2017 at 12:25 PM, 'Francesco laTorre' via Akka User List <> wrote:

> Hi Jordan,
> It looks very related to the issue we are facing, with the difference we
> are not able to recover from the UNREACHABLE mark, probably because the
> cluster specs are different : in our scenario we have 3 cluster singletons
> and <outrageous> we use auto-downing </outrageous>.
> Cheers,
> Francesco
On 4 January 2017 at 21:01, Jordan Messec <> wrote:
>> Here is an update:
>> I moved to Akka 2.4.16 and still encountered the problem.
>> Therefore, I turned on "akka.cluster.debug.verbose-heartbeat-logging =
>> on".
>> This allowed me to notice that when nodes started entering UNREACHABLE
>> status from each other, that *outgoing *heartbeat messages (the initial
>> message not the response) were suddenly failing to send from one node, lets
>> call it NODE-1. Some significant time later, NODE-1 would get a handful of:
>> [INFO] [12/30/2016 01:36:06.681] [akka.remote.transport.Protoco
>> lStateActor] [$X{akkaSource:-*}] No response from remote. Transport
>> failure detector triggered. (internal state was Open)
>>  messages. Which I assume means that a heartbeat response was not
>> received within the 'acceptable-heartbeat-response' parameter.
>> Looking at the logs of the other nodes, I can see that around the time
>> that NODE-1 stopped sending outgoing heartbeats, the logs from a peer,
>> NODE-2 stopped receiving responses from its outgoing heartbeats to NODE-1.
>> The first thing NODE-2 does is mark NODE-1 as UNREACHABLE, and then a short
>> bit later, outputs one of the above 'No response from remote' messages.
>> At the same time that NODE-1 resumed outgoing heartbeat messages, NODE-2
>> gets flooded with heartbeat responses from NODE-1 and then shortly moves
>> NODE-1 back to REACHABLE and the cluster heals.
>> It seems that something is causing a hiccup in my nodes which derails the
>> cluster monitoring threads. I am using the "-XX:MaxGCPauseMillis=300"
>> option in my startup script, however looking at the GC logs this doesn't
>> seem to be getting honored. However none of the GC pauses are lasting
>> anywhere near as long as the hiccup in NODE-1. It could be that I am doing
>> some blocking activity in my actors which is conflicting with the heartbeat
>> monitor actor. I have now added the 'akka.cluster.use-dispatcher' lines to
>> my configuration.
>> I'll keep monitoring and report back as I get more information.
On Tuesday, December 27, 2016 at 9:04:32 AM UTC-8, Serg wrote:
>>> Hello Jordan,
>>> I also would like to hear from you if updating to the latest version has
>>> fixed the problem.
>>> We have a similar issue when cluster nodes suddenly become unreachable
>>> (though they are running on the same host, no cpu/memory/GC spikes) and
>>> then shut down themselves for no reason (auto-shutdown is disabled for all
>>> nodes). We are running on Akka 2.4.10, old Netty transport.
On Thursday, December 22, 2016 at 11:59:22 PM UTC+2, Jordan Messec wrote:
>>>> Thank you for your response and time. I have updated to version 2.4.16
>>>> and have Akka debug logging enabled. I will keep a further eye on this and
>>>> update as appropriate.
On Saturday, December 17, 2016 at 3:28:22 AM UTC-8, √ wrote:
>>>>> Hi!
>>>>> Update to most recent version and report back.
>>>>> --
>>>>> Cheers,
>>>>> √
On Dec 17, 2016 08:20, "Jordan Messec" <> wrote:
>>>>>> Hello, I am struggling with a problem I have spent days trying to
>>>>>> resolve. I was hoping someone here may have some input that could help me
>>>>>> look in the right direction.
>>>>>> I am running a small cluster with 3 nodes. Two nodes reside on one
>>>>>> machine, while the third resides on a separate machine. This cluster is
>>>>>> formed between two applications. Call them Web and DataDig. DataDig and 
>>>>>> Web
>>>>>> co-reside on Machine1 and Web is duplicated on machine two.
>>>>>> Both use Akka 2.4.4, with Web's dependencies being transitive through
>>>>>> Play 2.5.4
>>>>>> My problem is that after sometime of running without issue, the nodes
>>>>>> start having trouble communicating with each other. Within 24 hours of
>>>>>> bringing the cluster members online, the logs start to display the
>>>>>> following:
>>>>>> [WARN] [12/16/2016 21:07:32.645] [a.r.ReliableDeliverySupervisor] [
>>>>>> akka.tcp://application@host1:2552/system/endpointManager/re
>>>>>> liableEndpointWriter-akka.tcp%3A%2F%2Fapplication%40host2%3A2552-588]
>>>>>> Association with remote system [akka.tcp://host2:2552] has failed, 
>>>>>> address
>>>>>> is now gated for [5000] ms. Reason: [Disassociated]
>>>>>> A service is used to monitor cluster health, at this time it starts
>>>>>> to report that cluster members are unreachable from each other.
>>>>>> Obviously this starts to cause problems with cluster behavior, and
>>>>>> also results in messages stating the Leader can currently not perform its
>>>>>> duties:
>>>>>> [INFO] [12/16/2016 21:05:48.440] [a.c.Cluster(akka://application)]
>>>>>> [akka.cluster.Cluster(akka://application)] Cluster Node
>>>>>> [akka.tcp://application@host1:2552] - Leader can currently not
>>>>>> perform its duties, reachability status: [akka.tcp://application@host1
>>>>>> :2552 -> akka.tcp://application@host2:2552: Unreachable
>>>>>> [Unreachable] (328), akka.tcp://application@host1:37770 ->
>>>>>> akka.tcp://application@host2:2552: Unreachable [Unreachable] (1),
>>>>>> akka.tcp://application@host2:2552 -> akka.tcp://application@host1:2552:
>>>>>> Reachable [Reachable] (616), akka.tcp://application@host2:2552 ->
>>>>>> akka.tcp://application@host1:37770: Unreachable [Unreachable]
>>>>>> (617)], member status: [akka.tcp://application@host1:2552 Up
>>>>>> seen=true, akka.tcp://application@host1:37770 Up seen=false,
>>>>>> akka.tcp://application@host2:2552 Leaving seen=false]
>>>>>> I have turned on Akka debug logging but the only further messages
>>>>>> around the time of Disassociation I see are:
>>>>>> [DEBUG] [12/16/2016 21:42:58.893] [
>>>>>> dispatcher-24] [akka.tcp://application@host1:
>>>>>> 37770/system/cluster/core/daemon] Cluster Node
>>>>>> [akka.tcp://application@host1:37770] - Receiving gossip from
>>>>>> [UniqueAddress(akka.tcp://application@host2:2552,921200398)]
>>>>>> and
>>>>>> [DEBUG] [12/16/2016 21:06:03.310] [a.r.EndpointWriter] [akka.tcp:
>>>>>> //application@host1:2552/system/endpointManager/re
>>>>>> liableEndpointWriter-akka.tcp%3A%2F%2Fapplication%40host2%3A2552-588/endpointWriter]
>>>>>> Drained buffer with maxWriteCount: 50, fullBackoffCount: 1,
>>>>>> smallBackoffCount: 0, noBackoffCount: 0 , adaptiveBackoff: 1000
>>>>>> Here is the configuration being used for Web:
>>>>>> akka {
>>>>>>   actor {
>>>>>>     provider = "akka.cluster.ClusterActorRefProvider"
>>>>>>   }
>>>>>>   remote {
>>>>>>     secure-cookie = "9C7BBB890AB2C39691FC7B2A34F616C1D87FCC5B"
>>>>>>     require-cookie = on
>>>>>>     netty.tcp {
>>>>>>       hostname = "localhost"
>>>>>>       hostname = *$*{?HOSTNAME}
>>>>>>       port = 2552
>>>>>>     }
>>>>>>     log-remote-lifecycle-events = off
>>>>>>   }
>>>>>>   cluster {
>>>>>>     failure-detector.threshold = 10
>>>>>>     pub-sub {
>>>>>>       name = distributedPubSubMediator
>>>>>>       routing-logic = round-robin
>>>>>>       gossip-interval = 1s
>>>>>>       removed-time-to-live = 60s
>>>>>>       max-delta-elements = 3000
>>>>>>     }
>>>>>>     roles = ["Web"]
>>>>>>     seed-nodes = "akka.tcp://application@host1:2552"
>>>>>>   }
>>>>>>   loglevel = "DEBUG"
>>>>>>   log-dead-letters-during-shutdown = off
>>>>>>   log-dead-letters = off
>>>>>>   extensions = ["akka.cluster.pubsub.DistributedPubSub"]
>>>>>> }
>>>>>> with JAVA_OPTS="
>>>>>> ...
>>>>>> -XX:HeapDumpPath=$HOME/log/ \
>>>>>> -XX:+UseG1GC \
>>>>>> -XX:MaxGCPauseMillis=300 \
>>>>>> -XX:G1HeapWastePercent=20 \
>>>>>> -XX:InitiatingHeapOccupancyPercent=75 \
>>>>>> -XX:ConcGCThreads=32 \
>>>>>> -XX:ParallelGCThreads=48 \
>>>>>> -XX:NewRatio=1 \
>>>>>> -verbose:gc \
>>>>>> -XX:+UseGCLogFileRotation \
>>>>>> -XX:NumberOfGCLogFiles=1 \
>>>>>> -XX:GCLogFileSize=512M \
>>>>>> -XX:+PrintGCDetails \
>>>>>> -XX:+PrintGCTimeStamps \
>>>>>> -Xloggc:$HOME/log/services_web_gc.log
>>>>>> "
>>>>>> With *very* similar config for DataDig.
>>>>>> These hosts are very powerful machines that are not running any other
>>>>>> resource heavy processes (in fact they're barely running anything else at
>>>>>> all). There are a few GC pauses that are longer than I would expect.
>>>>>> Any help is appreciated, and I can provide any further
>>>>>> context/information.
