[
https://issues.apache.org/jira/browse/HELIX-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414392#comment-16414392
]
ASF GitHub Bot commented on HELIX-683:
--------------------------------------
GitHub user zhan849 opened a pull request:
https://github.com/apache/helix/pull/162
[HELIX-683] clean monitoring cache upon helix controller enable monitoring
In this PR I added methods to clear monitoring records in cache when we
enable cluster status monitoring. I also added tests to reproduce situation
that a resource missed top state, controller lost leadership, resource regain
top state, controller regain leadership, which will cause a metrics reporting
problem
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zhan849/helix
harry/controller-monitor-cache-cleanup
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/helix/pull/162.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #162
----
commit 373da77547fa1ea4a39c760e80da75e9d453d4f5
Author: Harry Zhang <zhan849@...>
Date: 2018-03-26T19:14:07Z
[HELIX-683] clean monitoring cache upon helix controller enable monitoring
----
> Clean monitoring cache upon helix controller enable monitoring
> --------------------------------------------------------------
>
> Key: HELIX-683
> URL: https://issues.apache.org/jira/browse/HELIX-683
> Project: Apache Helix
> Issue Type: Bug
> Reporter: Hao Zhang
> Priority: Major
>
> We found a bug in reporting cluster status, partition masterless duration.
> The root cause is that the duration is calculated based on controller cache.
> And currently, this cache is not cleaned when leadership is changed. As a
> result, if controller A start a mastership handoff but was interrupted once,
> the start time will be kept in cache until next mastership handoff on the
> same partition happens. Then the later handoff duration will be calculated
> based on the stale start time. This could be super large.
> To fix it, we might consider clean cache when leadership changed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)