[ 
https://issues.apache.org/jira/browse/SLIDER-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633382#comment-15633382
 ] 

Gour Saha commented on SLIDER-1161:
-----------------------------------

[~sumit.nigam] I would encourage you to take this up and provide a fix if you 
already have a solution or work on one. We can make you a contributor and 
assign the JIRA to you.

> Improve regionserver status check in HBase Slider app package
> -------------------------------------------------------------
>
>                 Key: SLIDER-1161
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1161
>             Project: Slider
>          Issue Type: Improvement
>          Components: app-package
>    Affects Versions: Slider 0.80
>         Environment: RHEL-6 (64 Bit)
>            Reporter: Sandeep Nemuri
>
> *PROBLEM* :
> Using slider for launching Hbase containers.
> Following is the problem statement and details :
> 1. Assume region server went into a big pause and lost its heartbeat with 
> zookeeper 
> 2. HMaster notices this and marks the region server as DEAD 
> 3. However, slider agent continues to 'ps' the region server process in every 
> heartbeat.monitor.interval (45000ms in my case) and because it is just 
> checking for region server process being alive, it does not consider it dead 
> 4. After that big delay, region server finally recovers and goes to HMaster 
> 5. HMaster informs region server YouAreAlreadyDeadException 
> 6. Now, this region server brings itself down and slider also notices that 
> process is no longer running. 
> 7. Slider now launches a new region server.
> The issue as clearly mentioned in steps above is that there can be a huge 
> delay between step 4 and 6. This means that we are now operating with lesser 
> region servers and this puts more and more load on existing region servers.
> The issue can be solved if slider would sync up with HMaster to find whether 
> region server is alive or not. That way, it would immediately know that 
> HMaster has already marked a region server as dead and will then bring down 
> the region server and launch a new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to