Resending. Reply to all.

We can probably fix the first 2 issues within 2 weeks, considering the
additional test and validation required.
For issue 1, we can make the original reset into 2 methods. For new session
handling, we should not interrupt. For client closing, we shall interrupt
thread and shut down.
For issue 2, we need to try catch for zookeeper NPE in addition.

Issue 3 will take more time since we need to change both ZkClient and event
handler. There may be some interfaces need to be updated. Moreover, it
changes the current ZkClient behavior. So we'd better run it in the test
environment for a longer time.

With the ephemeral node's owner fixed, the 3rd issue does not impact
correctness. So maybe we can plan for fixing the first 2 issues first? And
then plan for the 3rd issue in the next release? If that's the case, we
shall have a release candidate after 2 weeks.

Best Regards,
Jiajun


On Mon, Jan 7, 2019 at 3:14 PM kishore g <[email protected]> wrote:

> I think the pending issues are the ones that are affecting us. What does
> it take to fix those issues?
>
> On Mon, Jan 7, 2019 at 2:54 PM Wang Jiajun <[email protected]> wrote:
>
>> Hi Kishore,
>>
>> Hope you are doing well.
>> Since last time we met to discuss potential ZkClient improvements in
>> Helix, we have completed the fix of one issue. However, the resolving of
>> the whole list will take more time, given Pinot is still waiting for the
>> new release, I'd like to hear your opinion that whether we shall release
>> 0.8.3 based on the current situation.
>>
>> Fixed issues:
>>
>>    1. For an Ephemeral node, the source of truth should be the owner
>>    session Id instead of the node content.
>>    This fixes the leader election issue we found in Pinot cluster.
>>
>> Pending issues:
>>
>>    1. ZkClient should not interrupt the callback handling during session
>>    reestablishment or other reset logic. Interrupt for shutdown should only
>>    happen when things are closed. For fixing this problem, we need to think
>>    about how to handle thread leaking.
>>    2. ZkConnection.getZookeeper() == null potentially cause
>>    retryUntilConnect to terminate earlier than expected. Should keep waiting
>>    for this error.
>>    3. The ZkClient event should keep a session Id. The event processor
>>    can discard expired event.
>>
>> Best Regards,
>> Jiajun
>>
>

Reply via email to