[ 
https://issues.apache.org/jira/browse/HBASE-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882237#comment-13882237
 ] 

Feng Honghua commented on HBASE-10296:
--------------------------------------

sorry for the late reply :-)

[~lhofhansl], [~apurtell] : thanks for the clarifying to [~lhofhansl]'s 
proposal that 'all servers serve both master and regionserver roles as needed', 
I can now understand what he really meant, really interesting:-). But if we 
eventually decide to replace ZK with a consensus algorithm(paxos, zab or raft) 
running within masters, it is less appealing to make all servers run both 
master and regionserver roles as needed:
* by this replacing, we reduced the total machines from X masters(typically 3?) 
plus Y zookeepers(typically 3 or 5) to only Y masters (as zookeeper, typically 
3 or 5) by incorporating the master logic (create/remove tables, failover, 
balance...) and the zookeeper logic (replicating states) to the same set of 
processes/servers. after this replacing and incorporating, the standby masters 
are not that 'standby' as previous master which just stay idle to wait for the 
active master to die and then compete to take it over, wasting the standby 
masters' machine resource most of the time, these standby masters now 
participate in consensus making, state replicating/persisting, making snapshot 
and etc. we don't mind schedule some separate machines for these tasks, just as 
we don't mind schedule separate machines for followers within a zookeeper 
assemble and don't ever think about to reuse the follower/standby zookeeper 
servers to serve as regionserver, right?
* it's preferred to keep the membership of the consensus algorithm to be fixed, 
within a fixed small set of predesignated machines. it can noticeably 
complicate the total design/implementation to support dynamic the consensus 
algorithm's membership, if we permit all servers of a HBase cluster to be able 
to play the role of master

I also agree with you two on that a total P2P architect like Dynamo/Cassandra 
will have a much more difficult time when handling failover.

bq. What would the steps involved moving off zk to a group of masters keeping 
consensus look like?
[~stack] : the rough steps I can think about for now is as below:
# implement a robust/reliable consensus lib(paxos/zab/raft)
# redesign the master based on this consensus lib. now we don't need to write 
out the HBase/master states such as region-assign-status, replication info, 
table info to an far-away/outside persistent/reliable storage such as zookeeper 
or another HBase system table, we just replicate them among the masters, master 
itself is the only truth about these states.
# HBASE-5487 aims for a master redesign to store the states to a system table, 
though can avoid the state maintenance problem derived from missed event by 
zookeeper's watch/notify mechanism, it still will have the problem of 
keeping/maintaining truth in two different sites(master memory and the system 
table) and we still need to be very careful at implementing. it's always a 
headache when the state reader/maintainer(master) and state persistence layer 
are not in the same server;
# HBASE-10295 aims for moving replication info from zookeeper to another system 
table. if we achieve using consensus lib within master, we can represent 
replication info as just another in-memory data structure, but not a different 
system table. (personally I don't think using zookeeper node to store 
replication info is an as severe problem as region-assign-status, since 
replication-aware logic has the inherent idempotence: it cares about  only 
what's the final state of some replication info when it's changed, but not how 
it changes to the final state(the state transition process). on the contrary 
region assignment logic is more like a state machine, it does care about a 
state is transitioned from a valid previous state(it looks like transition from 
an invalid state when some event/state missed) otherwise the code can be pretty 
tricky and hard to understand/maintain. another concern for moving replication 
info from zookeeper to another system table is that it's hard for HBase table 
to represent a deep tree-like hierarchical structure (HBase table can only 
naturally represent not more than 3-layer structure (via row + cf + qualifier), 
 zookeeper and in-memory data structures don't have such limited hierarchical 
layer problem)

> Replace ZK with a paxos running within master processes to provide better 
> master failover performance and state consistency
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10296
>                 URL: https://issues.apache.org/jira/browse/HBASE-10296
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: master, Region Assignment, regionserver
>            Reporter: Feng Honghua
>
> Currently master relies on ZK to elect active master, monitor liveness and 
> store almost all of its states, such as region states, table info, 
> replication info and so on. And zk also plays as a channel for 
> master-regionserver communication(such as in region assigning) and 
> client-regionserver communication(such as replication state/behavior change). 
> But zk as a communication channel is fragile due to its one-time watch and 
> asynchronous notification mechanism which together can leads to missed 
> events(hence missed messages), for example the master must rely on the state 
> transition logic's idempotence to maintain the region assigning state 
> machine's correctness, actually almost all of the most tricky inconsistency 
> issues can trace back their root cause to the fragility of zk as a 
> communication channel.
> Replace zk with paxos running within master processes have following benefits:
> 1. better master failover performance: all master, either the active or the 
> standby ones, have the same latest states in memory(except lag ones but which 
> can eventually catch up later on). whenever the active master dies, the newly 
> elected active master can immediately play its role without such failover 
> work as building its in-memory states by consulting meta-table and zk.
> 2. better state consistency: master's in-memory states are the only truth 
> about the system,which can eliminate inconsistency from the very beginning. 
> and though the states are contained by all masters, paxos guarantees they are 
> identical at any time.
> 3. more direct and simple communication pattern: client changes state by 
> sending requests to master, master and regionserver talk directly to each 
> other by sending request and response...all don't bother to using a 
> third-party storage like zk which can introduce more uncertainty, worse 
> latency and more complexity.
> 4. zk can only be used as liveness monitoring for determining if a 
> regionserver is dead, and later on we can eliminate zk totally when we build 
> heartbeat between master and regionserver.
> I know this might looks like a very crazy re-architect, but it deserves deep 
> thinking and serious discussion for it, right?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to