Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hbase/MasterRewrite" page has been changed by stack. http://wiki.apache.org/hadoop/Hbase/MasterRewrite?action=diff&rev1=9&rev2=10 -------------------------------------------------- * [[#scope|Design Scope]] * [[#design|Design]] * [[#all|Move all region state transitions to zookeeper]] + * [[#clean|Region State changes are clean, minimal, and comprehensive]] + * [[#state|Cluster/Table State Changes via ZK]] + * [[#schema|Schema]] * [[#distinct|In Zookeeper, a State and a Schema section]] - * [[#clean|State changes are clean, minimal, and comprehensive]] - * [[#schema|Schema]] * [[#balancer|Load Assignment/Balancer]] * [[#root|Remove -ROOT-]] * [[#root|Remove Heartbeat]] @@ -35, +36 @@ <<Anchor(problems)>> == Problems with current Master == - There is a good list in the [[https://issues.apache.org/jira/secure/ManageLinks.jspa?id=12434794|Issue Links]] section of HBASE-1816. + There is a list in the [[https://issues.apache.org/jira/secure/ManageLinks.jspa?id=12434794|Issue Links]] section of HBASE-1816. <<Anchor(scope)>> == Design Scope == 1. Rewrite of Master is for HBase 0.21 1. Design for: - 1. Regionserver loading (TODO: These numbers don't make sense -- jgray do you remember what they were about?) + 1. A cluster of 200 regionservers... (TODO: These numbers don't make sense -- jgray do you remember what they were about? See misc section below) - 1. 200 regionservers - 1. 32 outstanding wal logs per regionserver - 1. 200 regions per regionserver being written to - 1. 2GB or 30 hour full log roll - 1. 10MB/sec write speed - 1. 1.2M edits per 2G - 1. 7k writes/second across cluster (?) -- whats this? Wrong. - 1. 1.2M edits per 30 hours? - 1. 100 writes/sec across cluster (?) -- Whats this? Wrong? <<Anchor(design)>> == Design == <<Anchor(all)>> === Move all region state transitions to zookeeper === - Run state transitions by changing state in zookeeper rather than inside in Master. + Run state transitions by changing state in zookeeper rather than in Master. - Keep up a region transition trail; regions move through states from unassigned to opening to open, etc. A region can't jump states as in going from unassigned to open. + Keep up a region transition trail; regions move through states from ''unassigned'' to ''opening'' to ''open'', etc. A region can't jump states as in going from ''unassigned'' to ''open''. + + Master (or client) moves regions between states. Watchers on RegionServers notice additions and act on it. Master (or client) can do transitions in bulk; e.g. assign a regionserver 50 regions to open on startup. Effect is that Master "pushes" work out to regionservers rather than wait on them to heartbeat. A problem we have in current master is that states do not form a circle. Once a region is open, master stops keeping state; region state is moved to .META. table once assigned with its condition checked periodically by .META. table scan. Makes for confusion and evil such as region double assignment because there are race condition potholes as we move from one system -- internal state maps in master -- to the other during update to state in .META. Current thinking is to keep region lifecycle all up in zookeeper but that won't scale. Postulate 100k regions -- 100TB at 1G regions -- each with two or three possible states each with watchers for state change is too much to put in a zk cluster. TODO: how to manage transition from zk to .META.? - <<Anchor(distinct)>> - === In Zookeeper, a State and a Schema section === - Two locations in zk; one for schema and then one for state. No connection. For example, could have hierarchy in zk as follows: - - {{{/hbase/tables/name/schema/{family1, family2} - /hbase/tables/name/state/{attributes [read-only, enabled, nocompact, noflush]} - /hbase/regionservers/openregions/{list of regions...} - /hbase/regionserver/to_open/{list of regions....} - /hbase/regionservers/to_close/{list of regions...} - <<Anchor(clean)>> - === State changes are clean, minimal, and comprehensive === + === Region State changes are clean, minimal, and comprehensive === Currently, moving a region from opening to open may involve a region compaction -- i.e. a change to content in filesystem. Better if modification of filesystem content was done when no question of ownership involved. In current o.a.h.h.master.RegionManager.RegionState inner class, here are possible states: @@ -87, +71 @@ private volatile boolean offlined = false;}}} Its incomplete. + <<Anchor(state)>> + === Cluster/Table State === + Move cluster and table state to zookeeper. + + Do shutdown via ZK. + + Remove table state change mechanism from master and instead update state in zk. Clients can set table state. Watchers in regionservers react to table state changes. This will make it so table state changes are near instantaneous; e.g. setting table read-only, disabling compactions/flushes and all without having to take table offline. + + + <<Anchor(distinct)>> + === In Zookeeper, a State and a Schema section === + Two locations in zk; one for schema and then one for state. No connection. For example, could have hierarchy in zk as follows: + + {{{/hbase/tables/name/schema/{family1, family2} + /hbase/tables/name/state/{attributes [read-only, enabled, nocompact, noflush]} + /hbase/regionservers/openregions/{list of regions...} + /hbase/regionserver/to_open/{list of regions....} + /hbase/regionservers/to_close/{list of regions...} + + <<Anchor(schema)>> === Schema Edits === - Move Table Schema from .META. + Move Table Schema from .META. Rather than storing complete Schema in each region, stored once in ZK. + + Edit Schema with tables online by making change in ZK. Watchers inform watching RegionServers of changes. Make it so can add families, amend table and family configurations, etc. <<Anchor(balancer)>> === Region Assignment/Balancer === @@ -113, +119 @@ * To Open Queue * Regionservers watch their own to open queues /hbase/rsopen/region(extra_info, which hlogs to replay or it’s a split, etc) + Safe-mode assignment + * Collect all regions to assign + * Randomize and assign out in bulk, one msg per RS + * NO MORE SAFE-MODE + * Region assignment is always + * Look at all regions to be assigned + * Make a single decision for the assignment of all of these + <<Anchor(root)>> === Remove -ROOT- === Remove -ROOT- from filesystem; have it only live up in zk (Possible now Region Historian feature has been removed). @@ -135, +149 @@ == Miscellaneous == * At meetup we talked of moving .META. to zk and adding a getClosest to zk code base. Thats been punted on for now. + * At meetup we had design numbers but they don't make sense now we've lost the context + 1. 200 regionservers + 1. 32 outstanding wal logs per regionserver + 1. 200 regions per regionserver being written to + 1. 2GB or 30 hour full log roll + 1. 10MB/sec write speed + 1. 1.2M edits per 2G + 1. 7k writes/second across cluster (?) -- whats this? Wrong. + 1. 1.2M edits per 30 hours? + 1. 100 writes/sec across cluster (?) -- Whats this? Wrong? - - Administrative functions - * Hadoop RPC listeners on Master and Regionservers -- master can now push messages - * Clients and Master can talk to RS - - Safe-mode assignment - * Collect all regions to assign - * Randomize and assign out in bulk, one msg per RS - * NO MORE SAFE-MODE - * Region assignment is always - * Look at all regions to be assigned - * Make a single decision for the assignment of all of these - - - No more ROOT - - Worker pool region closing - * Parallel flushes - - No more CHANGE TABLE STATE - * Process server shutdown after RS crash - * Separate META scan? - * To figure out regions on an RS - * Separate map of RS -> region - * Trade-off between two-writes during assignment - - Table Schema Information - * Online schema edits? - * If complex, punt to 0.22 - * Rather than storing with each region, stored once in ZK -