[ 
https://issues.apache.org/jira/browse/HBASE-5353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207994#comment-13207994
 ] 

Nicolas Spiegelberg commented on HBASE-5353:
--------------------------------------------

There's a lot of discussion here, so I might have missed something... How is 
searching for the master a problem?  We already have bin/get-active-master.rb , 
a jruby script that will search ZK for the current owner of the master lock.  
Combining that with 'bin/hbase start master --backup', and you have a full 
solution to let your monitoring scripts ping the correct server for UI 
information.  The downtime for the master is dominated by the ZK ephemeral node 
timeout more than process setup.

The common 'annoyance' that we see internally is on the developer side (not 
monitoring side).  You see that servers are down, so you checkout the web ui.  
The master went down, backup master took over, then autoremediation restarted 
the downed server in backup mode.  This means that the UI is inaccessible from 
the normal location.  DNS propagation takes a lot longer than restarting a 
process, so that's not really an option for us.  Because of this, I think a 
more important feature is to have the backup masters setup a web server with an 
HTTP redirect to the active master's UI.  
                
> HA/Distributed HMaster via RegionServers
> ----------------------------------------
>
>                 Key: HBASE-5353
>                 URL: https://issues.apache.org/jira/browse/HBASE-5353
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.94.0
>            Reporter: Jesse Yates
>            Priority: Minor
>
> Currently, the HMaster node(s) must be considered a 'special' node (though 
> not a single point of failover), meaning that the node must be protected more 
> than the other cluster machines or at least specially monitored. Minimally, 
> we always need to ensure that the master is running, rather than letting the 
> system handle that internally. It should be possible to instead have the 
> HMaster be much more available, either in a distributed sense (meaning a bit 
> rewrite) or multiple, dynamically created instances combined with the hot 
> fail-over of masters. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to