Hi Kishore, I have sent a pull request to fix the first 2 issues. https://github.com/apache/helix/pull/297 As for the 3rd one, it requires a much larger scope of change. And actually, it does not break any logic now after we fixed the ephemeral node owner validate logic. We think it can be scheduled for future release.
Best Regards, Jiajun On Mon, Jan 7, 2019 at 3:57 PM Wang Jiajun <[email protected]> wrote: > Resending. Reply to all. > > We can probably fix the first 2 issues within 2 weeks, considering the > additional test and validation required. > For issue 1, we can make the original reset into 2 methods. For new > session handling, we should not interrupt. For client closing, we shall > interrupt thread and shut down. > For issue 2, we need to try catch for zookeeper NPE in addition. > > Issue 3 will take more time since we need to change both ZkClient and > event handler. There may be some interfaces need to be updated. Moreover, > it changes the current ZkClient behavior. So we'd better run it in the test > environment for a longer time. > > With the ephemeral node's owner fixed, the 3rd issue does not impact > correctness. So maybe we can plan for fixing the first 2 issues first? And > then plan for the 3rd issue in the next release? If that's the case, we > shall have a release candidate after 2 weeks. > > Best Regards, > Jiajun > > > On Mon, Jan 7, 2019 at 3:14 PM kishore g <[email protected]> wrote: > >> I think the pending issues are the ones that are affecting us. What does >> it take to fix those issues? >> >> On Mon, Jan 7, 2019 at 2:54 PM Wang Jiajun <[email protected]> >> wrote: >> >>> Hi Kishore, >>> >>> Hope you are doing well. >>> Since last time we met to discuss potential ZkClient improvements in >>> Helix, we have completed the fix of one issue. However, the resolving of >>> the whole list will take more time, given Pinot is still waiting for the >>> new release, I'd like to hear your opinion that whether we shall release >>> 0.8.3 based on the current situation. >>> >>> Fixed issues: >>> >>> 1. For an Ephemeral node, the source of truth should be the owner >>> session Id instead of the node content. >>> This fixes the leader election issue we found in Pinot cluster. >>> >>> Pending issues: >>> >>> 1. ZkClient should not interrupt the callback handling during >>> session reestablishment or other reset logic. Interrupt for shutdown >>> should >>> only happen when things are closed. For fixing this problem, we need to >>> think about how to handle thread leaking. >>> 2. ZkConnection.getZookeeper() == null potentially cause >>> retryUntilConnect to terminate earlier than expected. Should keep waiting >>> for this error. >>> 3. The ZkClient event should keep a session Id. The event processor >>> can discard expired event. >>> >>> Best Regards, >>> Jiajun >>> >>
