Santiago

Thanks for the info. I will definitely explore your technique.

--ming

On May 4, 2013, at 11:32 AM, Santiago Perez <[email protected]> wrote:

> Hi Ming,
> 
> We also have some issues when long GC pauses cause ZK expiration. In our use 
> case we found a way to detect the expiration had occurred by registering a 
> ControllerChangeListener in HelixManager and catching a change notification 
> of type INIT (NotificationContext.Type.INIT) after our service is already 
> started (IOW when we see it for the second time).
> 
> What we do when we detect this situation is call HelixManager.disconnect and 
> then HelixManager.connect, essentially withdrawing and reconnecting the 
> participant. This causes all the appropriate transitions to be triggered. Not 
> sure if this would help your use case but at least it gives you a way to 
> intercept this behavior and take the necessary measures to keep your cluster 
> in shape.
> 
> While we're on the topic I'd love to get a clear understanding of the 
> expected set of transitions that should occur when zk session expires and a 
> new one is created. 
> 
> Cheers,
> Santiago
> 
> 
> On Sat, May 4, 2013 at 11:29 AM, kishore g <[email protected]> wrote:
> Hi Ming,
> 
> Need some more details,
> 1. How long was the GC, what is the session timeout in zk.
> 
> Behavior you are seeing is expected, what is happening is due to GC and 
> losing zookeeper session we call the transitions so that partition goes back 
> to OFFLINE state. 
> 
> What is the behavior you are looking for when there is GC.
> 
> a. You dont want to lose mastership ? or
> b. Its ok to lose mastership but you dont want to become master again ?
> 
> One question regarding your application, is it possible your application can 
> recover after long GC pause?
> 
> Dont think this is related to HELIX-79, in that case there were consecutive 
> GC's and I think we have a patch for that issue.
> 
> Thanks,
> Kishore G
> 
> 
> On Sat, May 4, 2013 at 6:32 AM, Ming Fang <[email protected]> wrote:
> We're experiencing a potentially showstopper issue with how Helix is dealing 
> with very long GCs.
> Our system is using the Master Slave model.
> A simple test when running just the Master under extreme load, causing 
> seconds of GC.
> Under long GC condition the Master gets transitioned to Slave then to Offline.
> After the GC, we get transited back to Slave then to Master.
> 
> I found this Jira that may be related HELIX-79.
> We're scheduled to go live with our system next week.
> Are there any quick workarounds for this problem?
> 
> 
> 
> 

Reply via email to