[ 
https://issues.apache.org/jira/browse/HELIX-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094393#comment-15094393
 ] 

Marco P. commented on HELIX-621:
--------------------------------

I'm still digging. There is something going on but I'm not clear on what 
exactly yet.

Here's snippets from the spectator.

Creation:
{code}
HelixManagerFactory.getZKHelixManager(
                clusterName,
                id,
                InstanceType.SPECTATOR,
                zkConnectString);
{code}

Then a listener is added that just prints the set of live instances:
{code}
    @Override
    public void onLiveInstanceChange(List<LiveInstance> liveInstances, 
NotificationContext changeContext) {
        Set<String> liveInstanceIds = new HashSet<>();
        for (LiveInstance liveInstance : liveInstances) {
            liveInstanceIds.add(liveInstance.getInstanceName());
        }
        System.out.println("Live instances: " + liveInstanceIds);
    }
{code}

This works as expected most of the time:

bq. Live instances: [Participant-1, Participant-2, Participant-3]

Then I kill one, say, Participant-1, and I get a notification which prints:

bq. Live instances: [Participant-2, Participant-3]

However, in some cases after killing a participant, I do get a notification 
(meaning a watch fired!), but it still prints all 3 participants:

bq. Live instances: [Participant-1, Participant-2, Participant-3]

If I check in Zookeeper, under LIVEINSTANCES, the node for the killed node is 
gone.

So it seems that the watch fires correctly, but the notification is still 
returning a stale list, rather than the most recent Zookeeper state.
I cannot understand how this can happen, unless there is some caching somewhere 
that I'm not seeing.

Any idea/pointers for things to look out for?


> Missing listener notification of LiveInstances changes (and possibly other 
> state change)
> ----------------------------------------------------------------------------------------
>
>                 Key: HELIX-621
>                 URL: https://issues.apache.org/jira/browse/HELIX-621
>             Project: Apache Helix
>          Issue Type: Bug
>          Components: helix-core
>    Affects Versions: 0.6.5
>            Reporter: Marco P.
>
> I noticed sometimes my LiveInstanceChangeListener was not notified of an 
> instance disconnecting.
> Digging a little bit I found out:
>  - A reliable way to consistently reproduce this problem
>  - The problem does not seem to be limited to LiveInstances, it can happen to 
> other listeners using the same strategy
> This is bad as an application relies on notifications, and its view of the 
> system (LiveInstances or else) can get very outdated.
> The problem at the core is this logic:
> 1) Set watch W on some path P
> 2) Event E1 modifies P triggering W
> 3) The callback for W re-sets W on P
> If however a second Event E2 modifies between 2 and 3, W will not trigger 
> (until P is modified again).
> An example of why this is bad:
>  - 2 live instances L1, L2 and a spectator S watching them.
> 1) L1 disconnects
> 2) S's watch on LIVEINSTANCES fires
> 3) S reads the children of LIVEINSTANCES: {L2}
> 3) L2 disconnects
> 4) S's notifies LiveInstanceChangeListeners and goes back to watching 
> LIVEINSTANCES
> The application receives a notification that the live instances now consist 
> of {L2}. 
> And no further notification until another instance joins.
> The reality is that no instances are live.
> Again, this is not limited to LIVEINSTANCES, although that's the one I can 
> reliably reproduce.
> Fixing this is not trivial, it requires firing the watch again when 
> re-setting it IF the version of the watched node change since the last time 
> the watch fired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to