Santiago Thanks for the info. I will definitely explore your technique.
--ming On May 4, 2013, at 11:32 AM, Santiago Perez <[email protected]> wrote: > Hi Ming, > > We also have some issues when long GC pauses cause ZK expiration. In our use > case we found a way to detect the expiration had occurred by registering a > ControllerChangeListener in HelixManager and catching a change notification > of type INIT (NotificationContext.Type.INIT) after our service is already > started (IOW when we see it for the second time). > > What we do when we detect this situation is call HelixManager.disconnect and > then HelixManager.connect, essentially withdrawing and reconnecting the > participant. This causes all the appropriate transitions to be triggered. Not > sure if this would help your use case but at least it gives you a way to > intercept this behavior and take the necessary measures to keep your cluster > in shape. > > While we're on the topic I'd love to get a clear understanding of the > expected set of transitions that should occur when zk session expires and a > new one is created. > > Cheers, > Santiago > > > On Sat, May 4, 2013 at 11:29 AM, kishore g <[email protected]> wrote: > Hi Ming, > > Need some more details, > 1. How long was the GC, what is the session timeout in zk. > > Behavior you are seeing is expected, what is happening is due to GC and > losing zookeeper session we call the transitions so that partition goes back > to OFFLINE state. > > What is the behavior you are looking for when there is GC. > > a. You dont want to lose mastership ? or > b. Its ok to lose mastership but you dont want to become master again ? > > One question regarding your application, is it possible your application can > recover after long GC pause? > > Dont think this is related to HELIX-79, in that case there were consecutive > GC's and I think we have a patch for that issue. > > Thanks, > Kishore G > > > On Sat, May 4, 2013 at 6:32 AM, Ming Fang <[email protected]> wrote: > We're experiencing a potentially showstopper issue with how Helix is dealing > with very long GCs. > Our system is using the Master Slave model. > A simple test when running just the Master under extreme load, causing > seconds of GC. > Under long GC condition the Master gets transitioned to Slave then to Offline. > After the GC, we get transited back to Slave then to Master. > > I found this Jira that may be related HELIX-79. > We're scheduled to go live with our system next week. > Are there any quick workarounds for this problem? > > > >
