[ https://issues.apache.org/jira/browse/SLIDER-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633382#comment-15633382 ]
Gour Saha commented on SLIDER-1161: ----------------------------------- [~sumit.nigam] I would encourage you to take this up and provide a fix if you already have a solution or work on one. We can make you a contributor and assign the JIRA to you. > Improve regionserver status check in HBase Slider app package > ------------------------------------------------------------- > > Key: SLIDER-1161 > URL: https://issues.apache.org/jira/browse/SLIDER-1161 > Project: Slider > Issue Type: Improvement > Components: app-package > Affects Versions: Slider 0.80 > Environment: RHEL-6 (64 Bit) > Reporter: Sandeep Nemuri > > *PROBLEM* : > Using slider for launching Hbase containers. > Following is the problem statement and details : > 1. Assume region server went into a big pause and lost its heartbeat with > zookeeper > 2. HMaster notices this and marks the region server as DEAD > 3. However, slider agent continues to 'ps' the region server process in every > heartbeat.monitor.interval (45000ms in my case) and because it is just > checking for region server process being alive, it does not consider it dead > 4. After that big delay, region server finally recovers and goes to HMaster > 5. HMaster informs region server YouAreAlreadyDeadException > 6. Now, this region server brings itself down and slider also notices that > process is no longer running. > 7. Slider now launches a new region server. > The issue as clearly mentioned in steps above is that there can be a huge > delay between step 4 and 6. This means that we are now operating with lesser > region servers and this puts more and more load on existing region servers. > The issue can be solved if slider would sync up with HMaster to find whether > region server is alive or not. That way, it would immediately know that > HMaster has already marked a region server as dead and will then bring down > the region server and launch a new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)