Thanks for the tip for what to look for, my logs are huge so it's a bit of 
a jungle. Anyway I found this:

10:34:23.701UTC ERROR[system-akka.actor.default-dispatcher-2] Remoting - 
Association to [akka.tcp://system@ip2:port2] with UID [-1637388952] 
irrecoverably failed. Quarantining address.
akka.remote.ResendBufferCapacityReachedException: Resend buffer capacity of 
[1000] has been reached.
    at akka.remote.AckedSendBuffer.buffer(AckedDelivery.scala:121) 
~[akka-remote_2.11-2.3.9.jar:na]
    at 
akka.remote.ReliableDeliverySupervisor.akka$remote$ReliableDeliverySupervisor$$tryBuffer(Endpoint.scala:388)
 
~[akka-remote_2.11-2.3.9.jar:na]
    at 
akka.remote.ReliableDeliverySupervisor.akka$remote$ReliableDeliverySupervisor$$handleSend(Endpoint.scala:372)
 
~[akka-remote_2.11-2.3.9.jar:na]
    at 
akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:279)
 
~[akka-remote_2.11-2.3.9.jar:na]
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465) 
~[akka-actor_2.11-2.3.9.jar:na]
    at 
akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:188) 
~[akka-remote_2.11-2.3.9.jar:na]
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) 
~[akka-actor_2.11-2.3.9.jar:na]
    at akka.actor.ActorCell.invoke(ActorCell.scala:487) 
~[akka-actor_2.11-2.3.9.jar:na]
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) 
~[akka-actor_2.11-2.3.9.jar:na]
    at akka.dispatch.Mailbox.run(Mailbox.scala:221) 
~[akka-actor_2.11-2.3.9.jar:na]
    at akka.dispatch.Mailbox.exec(Mailbox.scala:231) 
~[akka-actor_2.11-2.3.9.jar:na]
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
~[scala-library-2.11.5.jar:na]
    at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
 
~[scala-library-2.11.5.jar:na]
    at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
 
~[scala-library-2.11.5.jar:na]
    at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[scala-library-2.11.5.jar:na]
    at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 
[scala-library-2.11.5.jar:na]

Where ip2 and port2 is the same as in my previous post, and this happened 
on a node (ip3:port3) which also had high load.

After this the ip2:port2 node started to print:
10:34:24.234UTC WARN [system-akka.actor.default-dispatcher-2] Remoting - 
Tried to associate with unreachable remote address 
[akka.tcp://system@ip3:port3]. Address is now gated for 5000 ms, all 
messages to this address will be delivered to dead letters. Reason: The 
remote system has quarantined this system. No further associations to the 
remote system are possible until this system is restarted.

On ip3:port3 I also later see:
10:34:25.180UTC ERROR[system-akka.actor.default-dispatcher-2] Remoting - 
Association to [akka.tcp://system@ip2:port2] with UID [-1637388952] 
irrecoverably failed. Quarantining address.
java.lang.IllegalStateException: Error encountered while processing system 
message acknowledgement buffer: [3 {0, 1, 2, 3}] ack: ACK[2114, {}]
    at 
akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:287)
 
~[akka-remote_2.11-2.3.9.jar:na]
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465) 
~[akka-actor_2.11-2.3.9.jar:na]
    at 
akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:188) 
~[akka-remote_2.11-2.3.9.jar:na]
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) 
~[akka-actor_2.11-2.3.9.jar:na]
    at akka.actor.ActorCell.invoke(ActorCell.scala:487) 
~[akka-actor_2.11-2.3.9.jar:na]
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) 
~[akka-actor_2.11-2.3.9.jar:na]
    at akka.dispatch.Mailbox.run(Mailbox.scala:221) 
~[akka-actor_2.11-2.3.9.jar:na]
    at akka.dispatch.Mailbox.exec(Mailbox.scala:231) 
~[akka-actor_2.11-2.3.9.jar:na]
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
~[scala-library-2.11.5.jar:na]
    at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
 
~[scala-library-2.11.5.jar:na]
    at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
 
~[scala-library-2.11.5.jar:na]
    at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[scala-library-2.11.5.jar:na]
    at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 
[scala-library-2.11.5.jar:na]
Caused by: java.lang.IllegalArgumentException: Highest SEQ so far was 3 but 
cumulative ACK is 2114
    at akka.remote.AckedSendBuffer.acknowledge(AckedDelivery.scala:103) 
~[akka-remote_2.11-2.3.9.jar:na]
    at 
akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:283)
 
~[akka-remote_2.11-2.3.9.jar:na]
    ... 12 common frames omitted


Maybe this explains something? What should I do about this?

On Thursday, January 22, 2015 at 4:05:52 PM UTC+2, Patrik Nordwall wrote:
>
> If it's quarantined it will be removed from cluster. Please include the 
> log entry that says that it is quarantined, if any.
>
> /Patrik
>
> 22 jan 2015 kl. 14:56 skrev Viktor Klang <viktor...@gmail.com 
> <javascript:>>:
>
> Endre, could it be due to pending-to-send system message overflow?
>
> On Thu, Jan 22, 2015 at 11:45 AM, Johannes Berg <jber...@gmail.com 
> <javascript:>> wrote:
>
>> Okay, I increased the load further and now I see the same problem again. 
>> It seems to just have gotten a bit better in that it doesn't happen as 
>> fast, but with enough load it happens.
>>
>> To re-iterate, I have Akka 2.3.9 on all (8) nodes and 
>> auto-down-unreachable-after = off on all nodes and I don't do any manual 
>> downing anywhere, still the leader log prints this:
>>
>> 2015-01-22 10:35:37 +0000 - [INFO] - from Cluster(akka://system) in 
>> system-akka.actor.default-dispatcher-2 
>> Cluster Node [akka.tcp://system@ip1:port1] - Leader is removing 
>> unreachable node [akka.tcp://system@ip2:port2]
>>
>> and the node(s) under load is(are) removed from the cluster 
>> (quarantined). How is this possible?
>>
>> On Wednesday, January 21, 2015 at 5:53:06 PM UTC+2, drewhk wrote:
>>>
>>> Hi Johannes,
>>>
>>> See the milestone here: https://github.com/akka/
>>> akka/issues?q=milestone%3A2.3.9+is%3Aclosed
>>>
>>> The tickets cross reference the PRs, too, so you can look at the code 
>>> changes. The issue that probably hit you is https://github.com/akka/
>>> akka/issues/16623 which manifested as system message delivery errors on 
>>> some systems, but actually was caused by accidentally duplicated internal 
>>> actors (a regression).
>>>
>>> -Endre
>>>
>>> On Wed, Jan 21, 2015 at 4:47 PM, Johannes Berg <jber...@gmail.com> 
>>> wrote:
>>>
>>>> Upgrading to 2.3.9 does indeed seem to solve my problem. At least I 
>>>> haven't experienced them yet.
>>>>
>>>> Now I'm curious what the fixes were, is there somewhere a change 
>>>> summary between versions or where is it listed what bugs have been fixed 
>>>> in 
>>>> which versions?
>>>>
>>>> On Wednesday, January 21, 2015 at 11:31:02 AM UTC+2, drewhk wrote:
>>>>>
>>>>> Hi Johannes,
>>>>>
>>>>> We just released 2.3.9 with important bugfixes. I recommend to update 
>>>>> and see if the problem is still persisting.
>>>>>
>>>>> -Endre
>>>>>
>>>>> On Wed, Jan 21, 2015 at 10:29 AM, Johannes Berg <jber...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> Many connections seem to be formed in the case when the node has been 
>>>>>> marked down for unreachability even though it's still alive and it tries 
>>>>>> to 
>>>>>> connect back into the cluster. The removed node prints:
>>>>>>
>>>>>> "Address is now gated for 5000 ms, all messages to this address will 
>>>>>> be delivered to dead letters. Reason: The remote system has quarantined 
>>>>>> this system. No further associations to the remote system are possible 
>>>>>> until this system is restarted."
>>>>>>
>>>>>> It doesn't seem to close the connections properly even though it 
>>>>>> opens new ones continously.
>>>>>>
>>>>>> Anyway that's a separate issue that I'm not that concerned about 
>>>>>> right now, I've now realized I don't want to use automatic downing 
>>>>>> instead 
>>>>>> I would like to allow nodes to go unreachable and come back to reachable 
>>>>>> even if it takes quite some time and manually stopping the process and 
>>>>>> downing the node in case of an actual crash.
>>>>>>
>>>>>> Consequently I've put
>>>>>>
>>>>>> auto-down-unreachable-after = off
>>>>>>
>>>>>> in the config. Now I have the problem that nodes still are removed, 
>>>>>> this is from the leader node log:
>>>>>>
>>>>>> 08:50:14.087UTC INFO [system-akka.actor.default-dispatcher-4] 
>>>>>> Cluster(akka://system) - Cluster Node [akka.tcp://system@ip1:port1] - 
>>>>>> Leader is removing unreachable node [akka.tcp://system@ip2:port2]
>>>>>>
>>>>>> I can understand my node is marked unreachable beause it's under 
>>>>>> heavy load but I don't understand what could cause it to be removed. I'm 
>>>>>> not doing any manual downing and have the auto-down to off, what else 
>>>>>> could 
>>>>>> trigger the removal?
>>>>>>
>>>>>> Using the akka-cluster script I can see that the node has most other 
>>>>>> nodes marked as unreachable (including the leader) and that it has 
>>>>>> another 
>>>>>> leader than other nodes.
>>>>>>
>>>>>> My test system consists of 8 nodes.
>>>>>>
>>>>>> About the unreachability I'm not having long GC pauses and not 
>>>>>> sending large blobs, but I'm sending very many smaller messages as fast 
>>>>>> as 
>>>>>> I can. If I just hammer it fast enough it will end up unreachable which 
>>>>>> I 
>>>>>> can except, but I need to get it back to reachable.
>>>>>>
>>>>>> On Thursday, December 11, 2014 at 11:22:41 AM UTC+2, Björn Antonsson 
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Johannes,
>>>>>>>
>>>>>>> On 9 December 2014 at 15:29:53, Johannes Berg (jber...@gmail.com) 
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi! I'm doing some load tests in our system and getting problems 
>>>>>>> that some of my nodes are marked as unreachable even though the 
>>>>>>> processes 
>>>>>>> are up. I'm seeing it going a few times from reachable to unreachable 
>>>>>>> and 
>>>>>>> back a few times before staying unreachable saying connection gated for 
>>>>>>> 5000ms and staying silently that way.
>>>>>>>
>>>>>>> Looking at the connections made to one of the seed nodes I see that 
>>>>>>> I have several hundreds of connections from other nodes except the 
>>>>>>> failing 
>>>>>>> ones. Is this normal? There are several (hundreds) just between two 
>>>>>>> nodes. 
>>>>>>> When are connections formed between cluster nodes and when are they 
>>>>>>> taken 
>>>>>>> down?
>>>>>>>
>>>>>>>
>>>>>>> Several hundred connections between two nodes seems very wrong. 
>>>>>>> There should only be one connection between two nodes that communicate 
>>>>>>> over 
>>>>>>> akka remoting or are part of a cluster. How many nodes do you have in 
>>>>>>> your 
>>>>>>> cluster?
>>>>>>>
>>>>>>> If you are using cluster aware routers then there should be one 
>>>>>>> connection between the router node and the rooutee nodes (can be the 
>>>>>>> same 
>>>>>>> connection that is used for the cluster communication).
>>>>>>>
>>>>>>> The connections between the nodes don't get torn down, they stay 
>>>>>>> open, but they are reused for all remoting communication between the 
>>>>>>> nodes.
>>>>>>>
>>>>>>> Also is there some limit on how many connections a node with default 
>>>>>>> settings will accept?
>>>>>>>
>>>>>>> We have auto-down-unreachable-after = 10s set in our config, does 
>>>>>>> this mean if the node is busy and doesn't respond in 10 seconds it 
>>>>>>> becomes 
>>>>>>> unreachable?
>>>>>>>
>>>>>>> Is there any reason why it would stay unreachable and not re-try to 
>>>>>>> join the cluster?
>>>>>>>
>>>>>>>
>>>>>>> The auto down, setting is actually just what it says. I the node is 
>>>>>>> considered unreachable for 10 seconds, it will be moved to DOWN and 
>>>>>>> won't 
>>>>>>> be able to come back into the cluster. The different states of the 
>>>>>>> cluster 
>>>>>>> and the settings are explained in the documentation.
>>>>>>>
>>>>>>> http://doc.akka.io/docs/akka/2.3.7/common/cluster.html
>>>>>>> http://doc.akka.io/docs/akka/2.3.7/scala/cluster-usage.html
>>>>>>>
>>>>>>> If you are having problems with nodes becoming unreachable then you 
>>>>>>> could check if you are doing one of these things:
>>>>>>> 1) sending to large blobs as messages, that effectively block out 
>>>>>>> the heart beats going over the same connection
>>>>>>> 2) having long GC pauses that trigger the failure detector since 
>>>>>>> nodes don't reply to heartbeats
>>>>>>>
>>>>>>> B/
>>>>>>>
>>>>>>> We are using Akka 2.3.6 and using cluster aware routers quite much 
>>>>>>> with a lot of remote messages going around.
>>>>>>>
>>>>>>> Anyone that can shed some light on this or that can point me at some 
>>>>>>> documentation about these things?
>>>>>>> --
>>>>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/c
>>>>>>> urrent/additional/faq.html
>>>>>>> >>>>>>>>>> Search the archives: https://groups.google.com/grou
>>>>>>> p/akka-user
>>>>>>> ---
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "Akka User List" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to akka-user+...@googlegroups.com.
>>>>>>> To post to this group, send email to akka...@googlegroups.com.
>>>>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Björn Antonsson
>>>>>>> Typesafe <http://typesafe.com/> – Reactive Apps on the JVM
>>>>>>> twitter: @bantonsson <http://twitter.com/#!/bantonsson>
>>>>>>>
>>>>>>>  -- 
>>>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/c
>>>>>> urrent/additional/faq.html
>>>>>> >>>>>>>>>> Search the archives: https://groups.google.com/grou
>>>>>> p/akka-user
>>>>>> --- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "Akka User List" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to akka-user+...@googlegroups.com.
>>>>>> To post to this group, send email to akka...@googlegroups.com.
>>>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/
>>>> current/additional/faq.html
>>>> >>>>>>>>>> Search the archives: https://groups.google.com/
>>>> group/akka-user
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Akka User List" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to akka-user+...@googlegroups.com.
>>>> To post to this group, send email to akka...@googlegroups.com.
>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>> >>>>>>>>>> Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Akka User List" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to akka-user+...@googlegroups.com <javascript:>.
>> To post to this group, send email to akka...@googlegroups.com 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Cheers,
> √
>  
> -- 
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ: 
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> --- 
> You received this message because you are subscribed to the Google Groups 
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to akka-user+...@googlegroups.com <javascript:>.
> To post to this group, send email to akka...@googlegroups.com 
> <javascript:>.
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to