On Mon, Oct 12, 2015 at 1:48 PM, Mike Kotsur <micc...@gmail.com> wrote:
> Hello Akka community, > > I'm observing strange behaviour with cluster sharding rebalancing. The > setup I have is: > > - Cluster with 3 seed nodes; > - state-store-mode = "ddata"; auto-down-unreachable-after = 10s; > rebalance-threshold = 3; remember-entities = off; > > - The source code <https://github.com/mkotsur/akkaClusterPoc/> and config > file > > <https://github.com/mkotsur/akkaClusterPoc/blob/master/src/main/resources/application.conf> > are available on github; > > Now I'll try to explain the problem without going into details of > the business logic, but if it's necessary I can certainly explain that as > well. First, I start a single node, and create 6 actors there, they all > fall into different shard regions. This works well: > > 2015-10-12 13:24:41,004 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-22 - Watching > Actor[akka://ClusterSystem/system/sharding/ping/f/f#1610049114]. > Total watchees [6]. > 2015-10-12 13:24:41,004 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-22 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/a/a#973627457] > 2015-10-12 13:24:41,004 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-22 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/b/b#1102107752] > 2015-10-12 13:24:41,004 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-22 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/c/c#1988221987] > 2015-10-12 13:24:41,004 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-22 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/e/e#-409924933] > 2015-10-12 13:24:41,004 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-22 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/d/d#992486220] > 2015-10-12 13:24:41,004 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-22 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/f/f#1610049114] > > Now I start the second node, and because of therebalance-threshold = 3, > some actors are moving to another node and getting activated there by > RestarterActor. > I know that I could do remember-entities = off instead, this was a > conscious decision in order to be able to use distributed data only and > avoid using akka-persistence. > > *This is the view of Node 1:* > 2015-10-12 13:31:18,155 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2552/system/sharding/ping/a/a#980947620] > 2015-10-12 13:31:18,156 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2552/system/sharding/ping/b/b#1343238772] > 2015-10-12 13:31:18,156 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/c/c#1988221987] > 2015-10-12 13:31:18,156 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/e/e#-409924933] > 2015-10-12 13:31:18,156 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/d/d#992486220] > 2015-10-12 13:31:18,156 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/f/f#1610049114] > > > *This is the view of Node 2:* > 2015-10-12 13:31:18,154 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/a/a#980947620] > 2015-10-12 13:31:18,154 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka: > //ClusterSystem/system/sharding/ping/b/b#1343238772] > 2015-10-12 13:31:18,155 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping/c/c#1988221987] > 2015-10-12 13:31:18,155 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping/e/e#-409924933] > 2015-10-12 13:31:18,155 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping/d/d#992486220] > 2015-10-12 13:31:18,155 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-4 - --> Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping/f/f#1610049114] > > > > From this I conclude that regions ''a" and "b" have migrated to Node 2, > and regions "c".."f" remain on Node 1. Now, if I shut down the second node > (which is *not* the leader and doesn't host ShardCoordinator actor), > everything works fine: "a" and "b" get restarted on Node 1 and messages to > those are routed properly. > > The problem arises when I kill Node 1. This is what happens then: > > 2015-10-12 13:37:30,653 WARN a.remote.ReliableDeliverySupervisor > ClusterSystem-akka.remote.default-remote-dispatcher-5 - Association with > remote system [akka.tcp://ClusterSystem@127.0.0.1:2551] has failed, > address is now gated for [5000] ms. Reason: [Disassociated] > 2015-10-12 13:37:31,010 INFO a.r. > RemoteActorRefProvider$RemoteDeadLetterActorRef ClusterSystem-akka.actor. > default-dispatcher-15 - Message [akka.cluster. > ClusterHeartbeatSender$Heartbeat] from > Actor[akka://ClusterSystem/system/cluster/core/daemon/heartbeatSender#-1248042282] > to Actor[akka://ClusterSystem/deadLetters] was not delivered. [2] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 2015-10-12 13:37:31,110 INFO a.r. > RemoteActorRefProvider$RemoteDeadLetterActorRef ClusterSystem-akka.actor. > default-dispatcher-18 - Message [akka.cluster.ddata. > Replicator$Internal$Status] from Actor[akka: > //ClusterSystem/system/ddataReplicator#-2138832112] to > Actor[akka://ClusterSystem/deadLetters] was not delivered. [3] dead letters > encountered. This logging can be turned off or adjusted with configuration > settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 2015-10-12 13:37:31,122 INFO a.r. > RemoteActorRefProvider$RemoteDeadLetterActorRef ClusterSystem-akka.actor. > default-dispatcher-18 - Message [akka.cluster.pubsub. > DistributedPubSubMediator$Internal$Status] from > Actor[akka://ClusterSystem/system/distributedPubSubMediator#-1074047830] > to Actor[akka://ClusterSystem/deadLetters] was not delivered. [4] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 2015-10-12 13:37:31,651 INFO a.r. > RemoteActorRefProvider$RemoteDeadLetterActorRef ClusterSystem-akka.actor. > default-dispatcher-18 - Message [akka.cluster.GossipStatus] from Actor[ > akka://ClusterSystem/system/cluster/core/daemon#-1714087228] to > Actor[akka://ClusterSystem/deadLetters] was not delivered. [5] dead letters > encountered. This logging can be turned off or adjusted with configuration > settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 2015-10-12 13:37:32,010 INFO a.r. > RemoteActorRefProvider$RemoteDeadLetterActorRef ClusterSystem-akka.actor. > default-dispatcher-18 - Message [akka.cluster. > ClusterHeartbeatSender$Heartbeat] from > Actor[akka://ClusterSystem/system/cluster/core/daemon/heartbeatSender#-1248042282] > to Actor[akka://ClusterSystem/deadLetters] was not delivered. [6] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 2015-10-12 13:37:32,120 INFO a.r. > RemoteActorRefProvider$RemoteDeadLetterActorRef ClusterSystem-akka.actor. > default-dispatcher-14 - Message [akka.cluster.pubsub. > DistributedPubSubMediator$Internal$Status] from > Actor[akka://ClusterSystem/system/distributedPubSubMediator#-1074047830] > to Actor[akka://ClusterSystem/deadLetters] was not delivered. [7] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 2015-10-12 13:37:32,651 INFO a.r. > RemoteActorRefProvider$RemoteDeadLetterActorRef ClusterSystem-akka.actor. > default-dispatcher-15 - Message [akka.cluster.GossipStatus] from Actor[ > akka://ClusterSystem/system/cluster/core/daemon#-1714087228] to > Actor[akka://ClusterSystem/deadLetters] was not delivered. [8] dead letters > encountered. This logging can be turned off or adjusted with configuration > settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 2015-10-12 13:37:33,012 INFO a.r. > RemoteActorRefProvider$RemoteDeadLetterActorRef ClusterSystem-akka.actor. > default-dispatcher-15 - Message [akka.cluster. > ClusterHeartbeatSender$Heartbeat] from > Actor[akka://ClusterSystem/system/cluster/core/daemon/heartbeatSender#-1248042282] > to Actor[akka://ClusterSystem/deadLetters] was not delivered. [9] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 2015-10-12 13:37:33,112 INFO a.r. > RemoteActorRefProvider$RemoteDeadLetterActorRef ClusterSystem-akka.actor. > default-dispatcher-15 - Message [akka.cluster.ddata. > Replicator$Internal$Status] from Actor[akka: > //ClusterSystem/system/ddataReplicator#-2138832112] to > Actor[akka://ClusterSystem/deadLetters] was not delivered. [10] dead > letters encountered, no more dead letters will be logged. This logging can > be turned off or adjusted with configuration settings > 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. > 2015-10-12 13:37:34,654 WARN akka.cluster.ClusterCoreDaemon ClusterSystem > -akka.actor.default-dispatcher-18 - Cluster Node [akka.tcp:// > ClusterSystem@127.0.0.1:2552] - Marking node(s) as UNREACHABLE > [Member(address = akka.tcp://ClusterSystem@127.0.0.1:2551, status = Up)] > 2015-10-12 13:37:36,016 WARN a.remote.ReliableDeliverySupervisor > ClusterSystem-akka.remote.default-remote-dispatcher-16 - Association with > remote system [akka.tcp://ClusterSystem@127.0.0.1:2551] has failed, > address is now gated for [5000] ms. Reason: [Association failed with > [akka.tcp://ClusterSystem@127.0.0.1:2551]] Caused by: [Connection > refused: /127.0.0.1:2551] > 2015-10-12 13:37:41,113 WARN a.remote.ReliableDeliverySupervisor > ClusterSystem-akka.remote.default-remote-dispatcher-16 - Association with > remote system [akka.tcp://ClusterSystem@127.0.0.1:2551] has failed, > address is now gated for [5000] ms. Reason: [Association failed with > [akka.tcp://ClusterSystem@127.0.0.1:2551]] Caused by: [Connection > refused: /127.0.0.1:2551] > 2015-10-12 13:37:44,671 INFO a.c.Cluster(akka://ClusterSystem) > ClusterSystem-akka.actor.default-dispatcher-20 - Cluster Node [akka.tcp:// > ClusterSystem@127.0.0.1:2552] - Leader is auto-downing unreachable node > [akka.tcp://ClusterSystem@127.0.0.1:2551] > 2015-10-12 13:37:44,671 INFO a.c.Cluster(akka://ClusterSystem) > ClusterSystem-akka.actor.default-dispatcher-20 - Cluster Node [akka.tcp:// > ClusterSystem@127.0.0.1:2552] - Marking unreachable node [akka.tcp:// > ClusterSystem@127.0.0.1:2551] as [Down] > 2015-10-12 13:37:45,655 INFO a.c.Cluster(akka://ClusterSystem) > ClusterSystem-akka.actor.default-dispatcher-14 - Cluster Node [akka.tcp:// > ClusterSystem@127.0.0.1:2552] - Leader is removing unreachable node > [akka.tcp://ClusterSystem@127.0.0.1:2551] > 2015-10-12 13:37:45,658 INFO a.c.s.ClusterSingletonManager ClusterSystem- > akka.actor.default-dispatcher-4 - Younger observed OldestChanged: [Some( > akka.tcp://ClusterSystem@127.0.0.1:2551) -> myself] > 2015-10-12 13:37:45,660 INFO a.c.s.ClusterSingletonManager ClusterSystem- > akka.actor.default-dispatcher-4 - ClusterSingletonManager state change [ > Younger -> BecomingOldest] > 2015-10-12 13:37:45,661 WARN akka.remote.Remoting ClusterSystem-akka. > remote.default-remote-dispatcher-16 - Association to [akka.tcp:// > ClusterSystem@127.0.0.1:2551] having UID [419477301] is irrecoverably > failed. UID is now quarantined and all messages to this UID will be > delivered to dead letters. Remote actorsystem must be restarted to recover > from this situation. > 2015-10-12 13:37:45,661 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-3 - Death watched termination > of Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping/c/c#1988221987]. Total > watchees [5]. > 2015-10-12 13:37:45,661 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-3 - Death watched termination > of Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping/e/e#-409924933]. Total > watchees [4]. > 2015-10-12 13:37:45,662 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-3 - Death watched termination > of Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping/d/d#992486220]. Total > watchees [3]. > 2015-10-12 13:37:45,662 INFO c.x.shool.cluster.RestarterActor > ClusterSystem-akka.actor.default-dispatcher-3 - Death watched termination > of Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping/f/f#1610049114]. Total > watchees [2]. > 2015-10-12 13:37:45,662 INFO a.c.s.ClusterSingletonManager ClusterSystem- > akka.actor.default-dispatcher-4 - Previous oldest [akka.tcp:// > ClusterSystem@127.0.0.1:2551] removed > 2015-10-12 13:37:45,662 INFO a.c.s.ClusterSingletonManager ClusterSystem- > akka.actor.default-dispatcher-4 - Singleton manager starting singleton > actor [akka://ClusterSystem/system/sharding/pingCoordinator/singleton] > 2015-10-12 13:37:45,664 INFO a.c.s.ClusterSingletonManager ClusterSystem- > akka.actor.default-dispatcher-4 - ClusterSingletonManager state change [ > BecomingOldest -> Oldest] > 2015-10-12 13:37:45,665 WARN a.remote.ReliableDeliverySupervisor > ClusterSystem-akka.remote.default-remote-dispatcher-16 - Association with > remote system [akka.tcp://ClusterSystem@127.0.0.1:2551] has failed, > address is now gated for [5000] ms. Reason: [Association failed with > [akka.tcp://ClusterSystem@127.0.0.1:2551]] Caused by: [Connection > refused: /127.0.0.1:2551] > 2015-10-12 13:37:46,193 WARN a.remote.ReliableDeliverySupervisor > ClusterSystem-akka.remote.default-remote-dispatcher-16 - Association with > remote system [akka.tcp://ClusterSystem@127.0.0.1:2551] has failed, > address is now gated for [5000] ms. Reason: [Association failed with > [akka.tcp://ClusterSystem@127.0.0.1:2551]] Caused by: [Connection > refused: /127.0.0.1:2551] > 2015-10-12 13:37:46,194 INFO a.c.sharding.DDataShardCoordinator > ClusterSystem-akka.actor.default-dispatcher-4 - Sharding Coordinator was > moved to the active state State(Map(e -> Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping#-1326393304], f -> > Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping#-1326393304], a -> > Actor[akka://ClusterSystem/system/sharding/ping#-565165058], b -> > Actor[akka://ClusterSystem/system/sharding/ping#-565165058], c -> > Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping#-1326393304], d -> > Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping#-1326393304] > ),Map(Actor[akka.tcp:// > ClusterSystem@127.0.0.1:2551/system/sharding/ping#-1326393304] -> > Vector(e, f, c, d), > Actor[akka://ClusterSystem/system/sharding/ping#-565165058] -> Vector(a, > b)),Set(),Set()) > > > The last message here already indicates the problem: Sharding Coordinator > was successfully moved, but its state still contains shard regions on Node > 1 (2551), which is down. This results in failures to deliver messages to > the actors that live in "c" ... "f". > Those are supposed to be removed by the watch in the coordinator, but I'm expecting them to be removed immediately since 2551 is no longer a cluster member. Please create an issue <https://github.com/akka/akka/issues>, and attach a log with DEBUG loglevel. Regards, Patrik > > Help is highly appreciated. Thanks in advance. > > -- > >>>>>>>>>> Read the docs: http://akka.io/docs/ > >>>>>>>>>> Check the FAQ: > http://doc.akka.io/docs/akka/current/additional/faq.html > >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user > --- > You received this message because you are subscribed to the Google Groups > "Akka User List" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to akka-user+unsubscr...@googlegroups.com. > To post to this group, send email to akka-user@googlegroups.com. > Visit this group at http://groups.google.com/group/akka-user. > For more options, visit https://groups.google.com/d/optout. > -- Patrik Nordwall Typesafe <http://typesafe.com/> - Reactive apps on the JVM Twitter: @patriknw -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscr...@googlegroups.com. To post to this group, send email to akka-user@googlegroups.com. Visit this group at http://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.