[
https://issues.apache.org/jira/browse/HELIX-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053664#comment-17053664
]
Craig Murphey commented on HELIX-822:
-------------------------------------
We'll give 0.9.x a shot, then let you know if that resolves our issue. Thanks
again!
> OnlineOffline cluster stops rebalancing
> ---------------------------------------
>
> Key: HELIX-822
> URL: https://issues.apache.org/jira/browse/HELIX-822
> Project: Apache Helix
> Issue Type: Bug
> Components: helix-core
> Affects Versions: 0.8.x
> Reporter: Craig Murphey
> Priority: Major
> Attachments: Screen Shot 2020-03-05 at 11.28.53 AM.png
>
>
> We recently upgraded our controller to use 0.8.4, then downgraded it back to
> 0.8.2. After this and after some time after a controller is elected master,
> we've seen our LiveInstanceChangeListener not get called for a live instance
> update.
> On the controller, we have a thread that's spun up on controller start that
> constantly logs the external state and it sees the instance count decrease.
> At the same time as the expected notification to the listener, we do see a
> large amount of zknodes being created and deleted.
> !Screen Shot 2020-03-05 at 11.28.53 AM.png!
> Upon inspection of our instances with helix-admin.sh, we found we have many
> more instances, than we have live-instances (20 live instance, 60-100
> instances). This is because we register the participant with hostname, which
> can change over time.
> Looking into these instances, we found many of the non-live instances have
> many messages left over.
> We are able to mitigate the issue by restarting the master controller
> manually.
> How do left over instances affect the overall cluster health? Is it possible
> that the controller is trying to tell offline instances that their resource
> is dropped, which is preventing the controller from issuing the live instance
> change event?
> Here's a snapshot of what we saw in zk:
>
> {noformat}
> So, in DCA, there are a lot of messages in Zookeeper for instances that are
> not live ->
> $ zkcli -h dlmzk ls /DLM/INSTANCES | awk -F \' '{print $2}' | while read
> host; do echo -n "$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES
> | awk -F \' '{print $2}' | grep -v "^$" | wc -l ;done | sort -nk 3
> agent1016-dca1_8274 : 0
> agent1053-dca1_8274 : 0
> agent1100-dca1_8274 : 0
> agent1346-dca1_8274 : 0
> agent1397-dca1_8274 : 0
> agent1406-dca1_8274 : 0
> agent1412-dca1_8274 : 0
> agent1549-dca1_8274 : 0
> agent1558-dca1_8274 : 0
> agent1573-dca1_8274 : 0
> agent1584-dca1_8274 : 0
> agent211-dca1_8274 : 0
> agent2124-dca1_8274 : 0
> agent2148-dca1_8274 : 0
> agent2149-dca1_8274 : 0
> agent2153-dca1_8274 : 0
> agent2184-dca1_8274 : 0
> agent21-dca1_8274 : 0
> agent2287-dca1_8274 : 0
> agent2713-dca1_8274 : 0
> agent2763-dca1_8274 : 0
> agent27-dca1_8274 : 0
> agent2878-dca1_8274 : 0
> agent2900-dca1_8274 : 0
> agent2930-dca1_8274 : 0
> agent31-dca1_8274 : 0
> agent3372-dca1_8274 : 0
> agent3376-dca1_8274 : 0
> agent3435-dca1_8274 : 0
> agent3436-dca1_8274 : 0
> agent3473-dca1_8274 : 0
> agent3543-dca1_8274 : 0
> agent3564-dca1_8274 : 0
> agent3572-dca1_8274 : 0
> agent3601-dca1_8274 : 0
> agent3646-dca1_8274 : 0
> agent3647-dca1_8274 : 0
> agent3648-dca1_8274 : 0
> agent3651-dca1_8274 : 0
> agent3671-dca1_8274 : 0
> agent3677-dca1_8274 : 0
> agent3678-dca1_8274 : 0
> agent3699-dca1_8274 : 0
> agent3714-dca1_8274 : 0
> agent3726-dca1_8274 : 0
> agent3991-dca1_8274 : 0
> agent4070-dca1_8274 : 0
> agent4096-dca1_8274 : 0
> agent4121-dca1_8274 : 0
> agent4545-dca1_8274 : 0
> agent4581-dca1_8274 : 0
> agent4601-dca1_8274 : 0
> agent4612-dca1_8274 : 0
> agent4649-dca1_8274 : 0
> agent4650-dca1_8274 : 0
> agent4651-dca1_8274 : 0
> agent4664-dca1_8274 : 0
> agent4672-dca1_8274 : 0
> agent4678-dca1_8274 : 0
> agent46-dca1_8274 : 0
> agent4702-dca1_8274 : 0
> agent4722-dca1_8274 : 0
> agent4726-dca1_8274 : 0
> agent4729-dca1_8274 : 0
> agent4730-dca1_8274 : 0
> agent5233-dca1_8274 : 0
> agent5261-dca1_8274 : 0
> agent5284-dca1_8274 : 0
> agent63-dca1_8274 : 0
> agent6444-dca1_8274 : 0
> agent79-dca1_8274 : 0
> agent83-dca1_8274 : 0
> agent84-dca1_8274 : 0
> agent90-dca1_8274 : 0
> appdocker1204-dca1_8274 : 0
> appdocker1454-dca1_8274 : 0
> appdocker1858-dca1_8274 : 0
> appdocker1950-dca1_8274 : 0
> appdocker1966-dca1_8274 : 0
> appdocker1970-dca1_8274 : 0
> appdocker1985-dca1_8274 : 0
> appdocker2012-dca1_8274 : 0
> appdocker2046-dca1_8274 : 0
> appdocker255-dca1_8274 : 0
> appdocker30-dca1_8274 : 0
> appdocker507-dca1_8274 : 0
> appdocker568-dca1_8274 : 0
> appdocker580-dca1_8274 : 0
> appdocker61-dca1_8274 : 0
> appdocker661-dca1_8274 : 0
> appdocker693-dca1_8274 : 0
> appdocker77-dca1_8274 : 0
> appdocker791-dca1_8274 : 0
> appdocker874-dca1_8274 : 0
> appdocker909-dca1_8274 : 0
> appdocker949-dca1_8274 : 0
> compute1699-dca1_8274 : 0
> compute2072-dca1_8274 : 0
> compute228-dca1_8274 : 0
> compute2527-dca1_8274 : 0
> compute2541-dca1_8274 : 0
> compute2579-dca1_8274 : 0
> compute2608-dca1_8274 : 0
> compute2792-dca1_8274 : 0
> compute2822-dca1_8274 : 0
> compute2842-dca1_8274 : 0
> compute2849-dca1_8274 : 0
> compute2862-dca1_8274 : 0
> compute2928-dca1_8274 : 0
> compute2937-dca1_8274 : 0
> compute2946-dca1_8274 : 0
> compute295-dca1_8274 : 0
> compute2964-dca1_8274 : 0
> compute2999-dca1_8274 : 0
> compute3026-dca1_8274 : 0
> compute3045-dca1_8274 : 0
> compute3209-dca1_8274 : 0
> compute3217-dca1_8274 : 0
> compute3244-dca1_8274 : 0
> compute3247-dca1_8274 : 0
> compute3363-dca1_8274 : 0
> compute3373-dca1_8274 : 0
> compute3383-dca1_8274 : 0
> compute3385-dca1_8274 : 0
> compute3391-dca1_8274 : 0
> compute3413-dca1_8274 : 0
> compute3449-dca1_8274 : 0
> compute3452-dca1_8274 : 0
> compute3525-dca1_8274 : 0
> compute3526-dca1_8274 : 0
> compute3530-dca1_8274 : 0
> compute3546-dca1_8274 : 0
> compute3571-dca1_8274 : 0
> compute3584-dca1_8274 : 0
> compute3600-dca1_8274 : 0
> compute3621-dca1_8274 : 0
> compute3678-dca1_8274 : 0
> compute3691-dca1_8274 : 0
> compute3695-dca1_8274 : 0
> compute36-dca1_8274 : 0
> compute3750-dca1_8274 : 0
> compute3770-dca1_8274 : 0
> compute3809-dca1_8274 : 0
> compute3846-dca1_8274 : 0
> compute3857-dca1_8274 : 0
> compute3919-dca1_8274 : 0
> compute3985-dca1_8274 : 0
> compute4033-dca1_8274 : 0
> compute4036-dca1_8274 : 0
> compute4103-dca1_8274 : 0
> compute4141-dca1_8274 : 0
> compute4161-dca1_8274 : 0
> compute4191-dca1_8274 : 0
> compute4239-dca1_8274 : 0
> compute42-dca1_8274 : 0
> compute4305-dca1_8274 : 0
> compute4339-dca1_8274 : 0
> compute4396-dca1_8274 : 0
> compute4474-dca1_8274 : 0
> compute4502-dca1_8274 : 0
> compute4532-dca1_8274 : 0
> compute4548-dca1_8274 : 0
> compute4716-dca1_8274 : 0
> compute4764-dca1_8274 : 0
> compute4817-dca1_8274 : 0
> compute4873-dca1_8274 : 0
> compute4887-dca1_8274 : 0
> compute4900-dca1_8274 : 0
> compute4924-dca1_8274 : 0
> compute4962-dca1_8274 : 0
> compute4966-dca1_8274 : 0
> compute4967-dca1_8274 : 0
> compute4980-dca1_8274 : 0
> compute4994-dca1_8274 : 0
> compute4998-dca1_8274 : 0
> compute5303-dca1_8274 : 0
> compute5338-dca1_8274 : 0
> compute5659-dca1_8274 : 0
> compute5661-dca1_8274 : 0
> compute5675-dca1_8274 : 0
> compute5698-dca1_8274 : 0
> compute5710-dca1_8274 : 0
> compute5933-dca1_8274 : 0
> compute5978-dca1_8274 : 0
> compute6011-dca1_8274 : 0
> compute6034-dca1_8274 : 0
> compute6089-dca1_8274 : 0
> compute6269-dca1_8274 : 0
> compute6339-dca1_8274 : 0
> compute6358-dca1_8274 : 0
> compute6366-dca1_8274 : 0
> compute6432-dca1_8274 : 0
> compute6716-dca1_8274 : 0
> compute6717-dca1_8274 : 0
> compute6767-dca1_8274 : 0
> compute6791-dca1_8274 : 0
> compute6825-dca1_8274 : 0
> compute6892-dca1_8274 : 0
> compute68-dca1_8274 : 0
> compute6905-dca1_8274 : 0
> compute6937-dca1_8274 : 0
> compute6992-dca1_8274 : 0
> compute6994-dca1_8274 : 0
> compute7029-dca1_8274 : 0
> compute7179-dca1_8274 : 0
> compute73-dca1_8274 : 0
> compute7582-dca1_8274 : 0
> compute7586-dca1_8274 : 0
> compute7601-dca1_8274 : 0
> compute7614-dca1_8274 : 0
> compute7700-dca1_8274 : 0
> compute7832-dca1_8274 : 0
> compute7837-dca1_8274 : 0
> compute8696-dca1_8274 : 0
> compute8697-dca1_8274 : 0
> compute8786-dca1_8274 : 0
> compute8864-dca1_8274 : 0
> compute8868-dca1_8274 : 0
> mpdocker01-dca1_8274 : 0
> mpdocker02-dca1_8274 : 0
> mpdocker03-dca1_8274 : 0
> mpdocker04-dca1_8274 : 0
> mpdocker05-dca1_8274 : 0
> mpdocker06-dca1_8274 : 0
> mpdocker07-dca1_8274 : 0
> mpdocker08-dca1_8274 : 0
> mpdocker09-dca1_8274 : 0
> agent1601-dca1_8274 : 2
> agent201-dca1_8274 : 2
> agent1415-dca1_8274 : 3
> agent4605-dca1_8274 : 3
> agent5212-dca1_8274 : 3
> agent5236-dca1_8274 : 3
> agent5242-dca1_8274 : 3
> compute4763-dca1_8274 : 3
> compute4916-dca1_8274 : 3
> compute6933-dca1_8274 : 3
> compute6984-dca1_8274 : 3
> compute7713-dca1_8274 : 3
> agent2213-dca1_8274 : 5
> agent3394-dca1_8274 : 5
> agent3618-dca1_8274 : 5
> agent4574-dca1_8274 : 5
> agent4677-dca1_8274 : 5
> agent47-dca1_8274 : 5
> compute2824-dca1_8274 : 5
> compute3640-dca1_8274 : 5
> compute3861-dca1_8274 : 5
> compute7159-dca1_8274 : 5
> compute7600-dca1_8274 : 5
> compute7839-dca1_8274 : 5
> compute2985-dca1_8274 : 6
> compute3615-dca1_8274 : 6
> compute4692-dca1_8274 : 6
> agent2209-dca1_8274 : 8
> agent2214-dca1_8274 : 8
> compute3710-dca1_8274 : 8
> compute6329-dca1_8274 : 8
> agent5265-dca1_8274 : 9
> compute7746-dca1_8274 : 13
> agent5179-dca1_8274 : 14
> agent4548-dca1_8274 : 15
> agent3611-dca1_8274 : 20
> agent3721-dca1_8274 : 23
> compute3764-dca1_8274 : 23
> agent3989-dca1_8274 : 30
> agent4145-dca1_8274 : 51
> compute3781-dca1_8274 : 55
> agent2168-dca1_8274 : 60
> agent5352-dca1_8274 : 68
> agent3533-dca1_8274 : 78
> compute4857-dca1_8274 : 78
> compute2982-dca1_8274 : 110
> agent4552-dca1_8274 : 113
> appdocker1082-dca1_8274 : 135
> appdocker538-dca1_8274 : 137
> compute1620-dca1_8274 : 512
> All LIVEINSTANCES do not have any message ->
> $ zkcli -h dlmzk ls /DLM/LIVEINSTANCES | awk -F \' '{print $2}' | while read
> host; do echo -n "$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES
> | awk -F \' '{print $2}' | grep -v "^$" | wc -l ;done | sort -nk 3
> agent1412-dca1_8274 : 0
> agent1584-dca1_8274 : 0
> agent2149-dca1_8274 : 0
> agent3435-dca1_8274 : 0
> agent3473-dca1_8274 : 0
> agent3564-dca1_8274 : 0
> agent3572-dca1_8274 : 0
> agent3677-dca1_8274 : 0
> agent4070-dca1_8274 : 0
> agent4096-dca1_8274 : 0
> agent6444-dca1_8274 : 0
> compute3045-dca1_8274 : 0
> compute3525-dca1_8274 : 0
> compute3678-dca1_8274 : 0
> compute4239-dca1_8274 : 0
> compute4305-dca1_8274 : 0
> compute4967-dca1_8274 : 0
> compute4980-dca1_8274 : 0
> compute6716-dca1_8274 : 0
> compute6992-dca1_8274 : 0
> {noformat}
>
> Current Version: 0.8.2
> StateModel: OfflineOnline
> {code:java}
> ./helix-admin.sh -zkSvr dlmzk --listStateModel DLM OnlineOffline
> StateModelDefinition: { "id" : "OnlineOffline", "mapFields" : {
> "DROPPED.meta" : { "count" : "-1" }, "OFFLINE.meta" : { "count" : "-1" },
> "OFFLINE.next" : { "DROPPED" : "DROPPED", "ONLINE" : "ONLINE" },
> "ONLINE.meta" : { "count" : "R" }, "ONLINE.next" : { "DROPPED" : "OFFLINE",
> "OFFLINE" : "OFFLINE" } }, "listFields" : { "STATE_PRIORITY_LIST" : [
> "ONLINE", "OFFLINE", "DROPPED" ], "STATE_TRANSITION_PRIORITYLIST" : [
> "OFFLINE-ONLINE", "ONLINE-OFFLINE", "OFFLINE-DROPPED" ] }, "simpleFields" : {
> "INITIAL_STATE" : "OFFLINE" } }
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)