Re: [akka-user] High Quaratine Frequency of akka-remote

Endre Varga Tue, 05 May 2015 23:35:22 -0700

Hi Zuchen,

On Tue, May 5, 2015 at 10:30 PM, Zhuchen Wang <zcx.w...@gmail.com> wrote:


> Here is the log.
> 2015-05-05 09:51:15,029 WARN
> [channelservice-akka.actor.default-dispatcher-4] Association with remote
> system [akka.tcp://system@host2] has failed, address is now gated for
> [3000] ms. Reason: [Disassociated]
>
> 2015-05-05 09:51:17,697 WARN
> [channelservice-akka.actor.default-dispatcher-54] Detected unreachable:
> [akka.tcp://system@host2]
>

The above line means that remote deathwatch detected the remote system to
be unreachable. This means that heartbeats were missing for too long time.


> 2015-05-05 09:51:17,699 WARN
> [channelservice-akka.actor.default-dispatcher-56] Association to
> [akka.tcp://system@host2] having UID [-648515237] is irrecoverably
> failed. UID is now quarantined and all messages to this UID will be
> delivered to dead letters. Remote actorsystem must be restarted to recover
> from this situation.
>
> 2015-05-05 09:51:17,731 WARN
> [channelservice-akka.actor.default-dispatcher-3] AssociationError
> [akka.tcp://system@host1] -> [akka.tcp://system@host2]: Error [Invalid
> address: akka.tcp://system@host2] [
> akka.remote.InvalidAssociation: Invalid address: akka.tcp://system@host2
> Caused by: akka.remote.transport.Transport$InvalidAssociationException:
> The remote system has a UID that has been quarantined. Association aborted.
> ]
>
> The Gated event usually happens first.
>
> There is no backpressuring right now. Doesn't heartbeats have higher
> priority than the normal remote messages?
>

To some extent yes, but every message shares the same TCP connection. Do
you send very large messages remotely? Have you tried increasing the
dispatcher threadpool for remoting?


>
> We did see ResendBufferCapacityReachedExceptionreached exception before
> and we increased the buffer size then. Does this means the receiver is
> overwhelmed?
>

This means that there are a lot of system messages, and the default buffer
size of 1000 was not enough to contain all of the unacknowledged system
messages. Your log indicates that this is no longer the problem, as the
Quarantining happens because of missing heartbeats.


>
> GC should not be a problem in this case. We have been monitoring the GC
> overhead.
>
> I am almost about to remove the remote watch code to rule out the
> Quarantine event entirely.
>

While you can remove the watch, it will not fix the underlying issue. The
log snippet you pasted is too short, it does not contain all of the
information. Can you turn on debug logging and gather a longer log?

Also, I recommend adding a simple ping-pong actor on both systems and
periodically log the round-trip time between a ping and its corresponding
pong, and print it to the console.

-Endre


>
>
>
>
> On Tuesday, May 5, 2015 at 12:27:17 PM UTC-7, drewhk wrote:
>>
>> Also, are you sure that you are backpressuring the sender properly and
>> not overwhelming remoting itself? If remoting is building up buffer size
>> due to it not being able to send messages fast enough, then heartbeats can
>> get delayed arbitrarily long (although we take some measures to mitigate
>> that).
>>
>> You can also try incresing the dispatcher thread pool size for remoting
>> and Netty.
>>
>> You should also look into GC activity, since you mentioned that you see
>> this under load. Many cases similar to yours turn out to be caused by actor
>> mailbox buildup (lack of backpressure) and resulting high GC pauses.
>>
>> We can give much deeper help with access to source, but that is a
>> commercial service.
>>
>> -Endre
>>
>> On Tue, May 5, 2015 at 9:13 PM, Endre Varga <endre...@typesafe.com>
>> wrote:
>>
>>> What is the actual log message when the quarantine happens? Can you show
>>> snippets of your logs around the quarantine event? Can it be that your
>>> system message redelivery buffer gets filled because of Terminated messages?
>>>
>>> Without seeing a log snippet it is impossible to say anything more
>>> concrete.
>>>
>>> -Endre
>>>
>>> On Tue, May 5, 2015 at 9:11 PM, Zhuchen Wang <zcx....@gmail.com> wrote:
>>>
>>>> Upgrading to akka 2.3.10 doesn't help a lot.
>>>>
>>>> As I mentioned in
>>>> https://groups.google.com/forum/#!topic/akka-user/NGLi9GTZ42o, we do
>>>> not actually rely on akka to form the cluster.
>>>>
>>>> We use Zookeeper to do cluster management and partition allocation but
>>>> use akka-remote to communicate between nodes.
>>>>
>>>> Let's say we have node1, node2, node3 and partition P conatins (node1
>>>> and node2)
>>>>
>>>> Each node has a partitionManager actor.
>>>>
>>>> In node1
>>>> partitionManager will have a child actor
>>>> akka://node1/actorsystem/partitionManager/P and a ActorSelectionRoutee for
>>>> akka://node2/actorsystem/partitionManager/P
>>>>
>>>> In node2
>>>> partitionManager will have a child actor
>>>> akka://node2/actorsystem/partitionManager/P and a ActorSelectionRoutee for
>>>> akka://node1/actorsystem/partitionManager/P
>>>>
>>>> In node3
>>>> partitionManager will have 2 ActorSelectionRoutees for
>>>> akka://node1/actorsystem/partitionManager/P and
>>>> akka://node2/actorsystem/partitionManager/P
>>>>
>>>> All the actors are started locally thus no remote deployment is
>>>> involved.
>>>>
>>>> Channels can be created under a partition and channel actor is
>>>> replicated under all partition actors
>>>>
>>>> For example chnl1
>>>>
>>>> There will be akka://node1/actorsystem/partitionManager/P/chnl1 and
>>>> akka://node2/actorsystem/partitionManager/P/chnl1 created in node1 and 
>>>> node2
>>>>
>>>> Now subscribers can subscribe to the channel. If the subscribers come
>>>> to node1 and node2 there will be no remote involving.
>>>>
>>>> If subscribers come to node3, the partitionManager will pick up on
>>>> ActorSelectionRoutee to forward the subscription.
>>>>
>>>> In this case we have remote death watch involved.
>>>>
>>>> akka://node3/actorsystem/subA *watches*
>>>> akka://node1/actorsystem/partitionManager/P/chnl1 and vis versa because if
>>>> the channel actor dies the subscribers can be notified and do a
>>>> re-subscribe to another partition member and in a graceful stop case,
>>>> channel actor  needs to wait for all subscribers get terminated and stop
>>>> itself.
>>>>
>>>> Now the main logic is creating channel, subscribing to channel,
>>>> publishing to channel and stopping channel.
>>>>
>>>> In this use case, we get the Quarantined event almost daily.
>>>>
>>>> And our settings for the failure detector is
>>>>
>>>> watch-failure-detector {
>>>>    heartbeat-interval = 10s
>>>>    acceptable-heartbeat-pause = 30s
>>>>    min-std-deviation = 200ms
>>>>    threshold = 12.0
>>>> }
>>>>
>>>> Thanks,
>>>>
>>>>  --
>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>> >>>>>>>>>> Check the FAQ:
>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>> >>>>>>>>>> Search the archives:
>>>> https://groups.google.com/group/akka-user
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Akka User List" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to akka-user+...@googlegroups.com.
>>>> To post to this group, send email to akka...@googlegroups.com.
>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>  --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to akka-user+unsubscr...@googlegroups.com.
> To post to this group, send email to akka-user@googlegroups.com.
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] High Quaratine Frequency of akka-remote

Reply via email to