[
https://issues.apache.org/jira/browse/HELIX-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053171#comment-17053171
]
Junkai Xue commented on HELIX-822:
----------------------------------
[~craigmurphey] thanks for reporting this. How you start your controller? If by
native Helix leaderelection mode, then that would be the known issue. We do
have some fixes in later releases.
BTW, we moved our issue tracking from Jira to Helix github issue.
> OnlineOffline cluster stops rebalancing
> ---------------------------------------
>
> Key: HELIX-822
> URL: https://issues.apache.org/jira/browse/HELIX-822
> Project: Apache Helix
> Issue Type: Bug
> Components: helix-core
> Affects Versions: 0.8.x
> Reporter: Craig Murphey
> Priority: Major
> Attachments: Screen Shot 2020-03-05 at 11.28.53 AM.png
>
>
> We recently upgraded our controller to use 0.8.4, then downgraded it back to
> 0.8.2. After this and after some time after a controller is elected master,
> we've seen our LiveInstanceChangeListener not get called for a live instance
> update.
> On the controller, we have a thread that's spun up on controller start that
> constantly logs the external state and it sees the instance count decrease.
> At the same time as the expected notification to the listener, we do see a
> large amount of zknodes being created and deleted.
> !Screen Shot 2020-03-05 at 11.28.53 AM.png!
> Upon inspection of our instances with helix-admin.sh, we found we have many
> more instances, than we have live-instances (20 live instance, 60-100
> instances). This is because we register the participant with hostname, which
> can change over time.
> Looking into these instances, we found many of the non-live instances have
> many messages left over.
> We are able to mitigate the issue by restarting the master controller
> manually.
> How do left over instances affect the overall cluster health? Is it possible
> that the controller is trying to tell offline instances that their resource
> is dropped, which is preventing the controller from issuing the live instance
> change event?
> Here's a snapshot of what we saw in zk:
>
> {noformat}
> So, in DCA, there are a lot of messages in Zookeeper for instances that are
> not live ->
> $ zkcli -h dlmzk ls /DLM/INSTANCES | awk -F \' '{print $2}' | while read
> host; do echo -n "$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES
> | awk -F \' '{print $2}' | grep -v "^$" | wc -l ;done | sort -nk 3
> agent1016-dca1_8274 : 0
> agent1053-dca1_8274 : 0
> agent1100-dca1_8274 : 0
> agent1346-dca1_8274 : 0
> agent1397-dca1_8274 : 0
> agent1406-dca1_8274 : 0
> agent1412-dca1_8274 : 0
> agent1549-dca1_8274 : 0
> agent1558-dca1_8274 : 0
> agent1573-dca1_8274 : 0
> agent1584-dca1_8274 : 0
> agent211-dca1_8274 : 0
> agent2124-dca1_8274 : 0
> agent2148-dca1_8274 : 0
> agent2149-dca1_8274 : 0
> agent2153-dca1_8274 : 0
> agent2184-dca1_8274 : 0
> agent21-dca1_8274 : 0
> agent2287-dca1_8274 : 0
> agent2713-dca1_8274 : 0
> agent2763-dca1_8274 : 0
> agent27-dca1_8274 : 0
> agent2878-dca1_8274 : 0
> agent2900-dca1_8274 : 0
> agent2930-dca1_8274 : 0
> agent31-dca1_8274 : 0
> agent3372-dca1_8274 : 0
> agent3376-dca1_8274 : 0
> agent3435-dca1_8274 : 0
> agent3436-dca1_8274 : 0
> agent3473-dca1_8274 : 0
> agent3543-dca1_8274 : 0
> agent3564-dca1_8274 : 0
> agent3572-dca1_8274 : 0
> agent3601-dca1_8274 : 0
> agent3646-dca1_8274 : 0
> agent3647-dca1_8274 : 0
> agent3648-dca1_8274 : 0
> agent3651-dca1_8274 : 0
> agent3671-dca1_8274 : 0
> agent3677-dca1_8274 : 0
> agent3678-dca1_8274 : 0
> agent3699-dca1_8274 : 0
> agent3714-dca1_8274 : 0
> agent3726-dca1_8274 : 0
> agent3991-dca1_8274 : 0
> agent4070-dca1_8274 : 0
> agent4096-dca1_8274 : 0
> agent4121-dca1_8274 : 0
> agent4545-dca1_8274 : 0
> agent4581-dca1_8274 : 0
> agent4601-dca1_8274 : 0
> agent4612-dca1_8274 : 0
> agent4649-dca1_8274 : 0
> agent4650-dca1_8274 : 0
> agent4651-dca1_8274 : 0
> agent4664-dca1_8274 : 0
> agent4672-dca1_8274 : 0
> agent4678-dca1_8274 : 0
> agent46-dca1_8274 : 0
> agent4702-dca1_8274 : 0
> agent4722-dca1_8274 : 0
> agent4726-dca1_8274 : 0
> agent4729-dca1_8274 : 0
> agent4730-dca1_8274 : 0
> agent5233-dca1_8274 : 0
> agent5261-dca1_8274 : 0
> agent5284-dca1_8274 : 0
> agent63-dca1_8274 : 0
> agent6444-dca1_8274 : 0
> agent79-dca1_8274 : 0
> agent83-dca1_8274 : 0
> agent84-dca1_8274 : 0
> agent90-dca1_8274 : 0
> appdocker1204-dca1_8274 : 0
> appdocker1454-dca1_8274 : 0
> appdocker1858-dca1_8274 : 0
> appdocker1950-dca1_8274 : 0
> appdocker1966-dca1_8274 : 0
> appdocker1970-dca1_8274 : 0
> appdocker1985-dca1_8274 : 0
> appdocker2012-dca1_8274 : 0
> appdocker2046-dca1_8274 : 0
> appdocker255-dca1_8274 : 0
> appdocker30-dca1_8274 : 0
> appdocker507-dca1_8274 : 0
> appdocker568-dca1_8274 : 0
> appdocker580-dca1_8274 : 0
> appdocker61-dca1_8274 : 0
> appdocker661-dca1_8274 : 0
> appdocker693-dca1_8274 : 0
> appdocker77-dca1_8274 : 0
> appdocker791-dca1_8274 : 0
> appdocker874-dca1_8274 : 0
> appdocker909-dca1_8274 : 0
> appdocker949-dca1_8274 : 0
> compute1699-dca1_8274 : 0
> compute2072-dca1_8274 : 0
> compute228-dca1_8274 : 0
> compute2527-dca1_8274 : 0
> compute2541-dca1_8274 : 0
> compute2579-dca1_8274 : 0
> compute2608-dca1_8274 : 0
> compute2792-dca1_8274 : 0
> compute2822-dca1_8274 : 0
> compute2842-dca1_8274 : 0
> compute2849-dca1_8274 : 0
> compute2862-dca1_8274 : 0
> compute2928-dca1_8274 : 0
> compute2937-dca1_8274 : 0
> compute2946-dca1_8274 : 0
> compute295-dca1_8274 : 0
> compute2964-dca1_8274 : 0
> compute2999-dca1_8274 : 0
> compute3026-dca1_8274 : 0
> compute3045-dca1_8274 : 0
> compute3209-dca1_8274 : 0
> compute3217-dca1_8274 : 0
> compute3244-dca1_8274 : 0
> compute3247-dca1_8274 : 0
> compute3363-dca1_8274 : 0
> compute3373-dca1_8274 : 0
> compute3383-dca1_8274 : 0
> compute3385-dca1_8274 : 0
> compute3391-dca1_8274 : 0
> compute3413-dca1_8274 : 0
> compute3449-dca1_8274 : 0
> compute3452-dca1_8274 : 0
> compute3525-dca1_8274 : 0
> compute3526-dca1_8274 : 0
> compute3530-dca1_8274 : 0
> compute3546-dca1_8274 : 0
> compute3571-dca1_8274 : 0
> compute3584-dca1_8274 : 0
> compute3600-dca1_8274 : 0
> compute3621-dca1_8274 : 0
> compute3678-dca1_8274 : 0
> compute3691-dca1_8274 : 0
> compute3695-dca1_8274 : 0
> compute36-dca1_8274 : 0
> compute3750-dca1_8274 : 0
> compute3770-dca1_8274 : 0
> compute3809-dca1_8274 : 0
> compute3846-dca1_8274 : 0
> compute3857-dca1_8274 : 0
> compute3919-dca1_8274 : 0
> compute3985-dca1_8274 : 0
> compute4033-dca1_8274 : 0
> compute4036-dca1_8274 : 0
> compute4103-dca1_8274 : 0
> compute4141-dca1_8274 : 0
> compute4161-dca1_8274 : 0
> compute4191-dca1_8274 : 0
> compute4239-dca1_8274 : 0
> compute42-dca1_8274 : 0
> compute4305-dca1_8274 : 0
> compute4339-dca1_8274 : 0
> compute4396-dca1_8274 : 0
> compute4474-dca1_8274 : 0
> compute4502-dca1_8274 : 0
> compute4532-dca1_8274 : 0
> compute4548-dca1_8274 : 0
> compute4716-dca1_8274 : 0
> compute4764-dca1_8274 : 0
> compute4817-dca1_8274 : 0
> compute4873-dca1_8274 : 0
> compute4887-dca1_8274 : 0
> compute4900-dca1_8274 : 0
> compute4924-dca1_8274 : 0
> compute4962-dca1_8274 : 0
> compute4966-dca1_8274 : 0
> compute4967-dca1_8274 : 0
> compute4980-dca1_8274 : 0
> compute4994-dca1_8274 : 0
> compute4998-dca1_8274 : 0
> compute5303-dca1_8274 : 0
> compute5338-dca1_8274 : 0
> compute5659-dca1_8274 : 0
> compute5661-dca1_8274 : 0
> compute5675-dca1_8274 : 0
> compute5698-dca1_8274 : 0
> compute5710-dca1_8274 : 0
> compute5933-dca1_8274 : 0
> compute5978-dca1_8274 : 0
> compute6011-dca1_8274 : 0
> compute6034-dca1_8274 : 0
> compute6089-dca1_8274 : 0
> compute6269-dca1_8274 : 0
> compute6339-dca1_8274 : 0
> compute6358-dca1_8274 : 0
> compute6366-dca1_8274 : 0
> compute6432-dca1_8274 : 0
> compute6716-dca1_8274 : 0
> compute6717-dca1_8274 : 0
> compute6767-dca1_8274 : 0
> compute6791-dca1_8274 : 0
> compute6825-dca1_8274 : 0
> compute6892-dca1_8274 : 0
> compute68-dca1_8274 : 0
> compute6905-dca1_8274 : 0
> compute6937-dca1_8274 : 0
> compute6992-dca1_8274 : 0
> compute6994-dca1_8274 : 0
> compute7029-dca1_8274 : 0
> compute7179-dca1_8274 : 0
> compute73-dca1_8274 : 0
> compute7582-dca1_8274 : 0
> compute7586-dca1_8274 : 0
> compute7601-dca1_8274 : 0
> compute7614-dca1_8274 : 0
> compute7700-dca1_8274 : 0
> compute7832-dca1_8274 : 0
> compute7837-dca1_8274 : 0
> compute8696-dca1_8274 : 0
> compute8697-dca1_8274 : 0
> compute8786-dca1_8274 : 0
> compute8864-dca1_8274 : 0
> compute8868-dca1_8274 : 0
> mpdocker01-dca1_8274 : 0
> mpdocker02-dca1_8274 : 0
> mpdocker03-dca1_8274 : 0
> mpdocker04-dca1_8274 : 0
> mpdocker05-dca1_8274 : 0
> mpdocker06-dca1_8274 : 0
> mpdocker07-dca1_8274 : 0
> mpdocker08-dca1_8274 : 0
> mpdocker09-dca1_8274 : 0
> agent1601-dca1_8274 : 2
> agent201-dca1_8274 : 2
> agent1415-dca1_8274 : 3
> agent4605-dca1_8274 : 3
> agent5212-dca1_8274 : 3
> agent5236-dca1_8274 : 3
> agent5242-dca1_8274 : 3
> compute4763-dca1_8274 : 3
> compute4916-dca1_8274 : 3
> compute6933-dca1_8274 : 3
> compute6984-dca1_8274 : 3
> compute7713-dca1_8274 : 3
> agent2213-dca1_8274 : 5
> agent3394-dca1_8274 : 5
> agent3618-dca1_8274 : 5
> agent4574-dca1_8274 : 5
> agent4677-dca1_8274 : 5
> agent47-dca1_8274 : 5
> compute2824-dca1_8274 : 5
> compute3640-dca1_8274 : 5
> compute3861-dca1_8274 : 5
> compute7159-dca1_8274 : 5
> compute7600-dca1_8274 : 5
> compute7839-dca1_8274 : 5
> compute2985-dca1_8274 : 6
> compute3615-dca1_8274 : 6
> compute4692-dca1_8274 : 6
> agent2209-dca1_8274 : 8
> agent2214-dca1_8274 : 8
> compute3710-dca1_8274 : 8
> compute6329-dca1_8274 : 8
> agent5265-dca1_8274 : 9
> compute7746-dca1_8274 : 13
> agent5179-dca1_8274 : 14
> agent4548-dca1_8274 : 15
> agent3611-dca1_8274 : 20
> agent3721-dca1_8274 : 23
> compute3764-dca1_8274 : 23
> agent3989-dca1_8274 : 30
> agent4145-dca1_8274 : 51
> compute3781-dca1_8274 : 55
> agent2168-dca1_8274 : 60
> agent5352-dca1_8274 : 68
> agent3533-dca1_8274 : 78
> compute4857-dca1_8274 : 78
> compute2982-dca1_8274 : 110
> agent4552-dca1_8274 : 113
> appdocker1082-dca1_8274 : 135
> appdocker538-dca1_8274 : 137
> compute1620-dca1_8274 : 512
> All LIVEINSTANCES do not have any message ->
> $ zkcli -h dlmzk ls /DLM/LIVEINSTANCES | awk -F \' '{print $2}' | while read
> host; do echo -n "$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES
> | awk -F \' '{print $2}' | grep -v "^$" | wc -l ;done | sort -nk 3
> agent1412-dca1_8274 : 0
> agent1584-dca1_8274 : 0
> agent2149-dca1_8274 : 0
> agent3435-dca1_8274 : 0
> agent3473-dca1_8274 : 0
> agent3564-dca1_8274 : 0
> agent3572-dca1_8274 : 0
> agent3677-dca1_8274 : 0
> agent4070-dca1_8274 : 0
> agent4096-dca1_8274 : 0
> agent6444-dca1_8274 : 0
> compute3045-dca1_8274 : 0
> compute3525-dca1_8274 : 0
> compute3678-dca1_8274 : 0
> compute4239-dca1_8274 : 0
> compute4305-dca1_8274 : 0
> compute4967-dca1_8274 : 0
> compute4980-dca1_8274 : 0
> compute6716-dca1_8274 : 0
> compute6992-dca1_8274 : 0
> {noformat}
>
> Current Version: 0.8.2
> StateModel: OfflineOnline
> {code:java}
> ./helix-admin.sh -zkSvr dlmzk --listStateModel DLM OnlineOffline
> StateModelDefinition: { "id" : "OnlineOffline", "mapFields" : {
> "DROPPED.meta" : { "count" : "-1" }, "OFFLINE.meta" : { "count" : "-1" },
> "OFFLINE.next" : { "DROPPED" : "DROPPED", "ONLINE" : "ONLINE" },
> "ONLINE.meta" : { "count" : "R" }, "ONLINE.next" : { "DROPPED" : "OFFLINE",
> "OFFLINE" : "OFFLINE" } }, "listFields" : { "STATE_PRIORITY_LIST" : [
> "ONLINE", "OFFLINE", "DROPPED" ], "STATE_TRANSITION_PRIORITYLIST" : [
> "OFFLINE-ONLINE", "ONLINE-OFFLINE", "OFFLINE-DROPPED" ] }, "simpleFields" : {
> "INITIAL_STATE" : "OFFLINE" } }
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)