[Hadoop Wiki] Update of "ZooKeeper/HBaseMeetingNotes" b y PatrickHunt

Apache Wiki Fri, 23 Oct 2009 11:59:32 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "ZooKeeper/HBaseMeetingNotes" page has been changed by PatrickHunt.
http://wiki.apache.org/hadoop/ZooKeeper/HBaseMeetingNotes

--------------------------------------------------

New page:
== Meeting notes - ZooKeeper use in HBase ==

=== 10/23/2009 phunt, mahadev, jdcryans, stack ===
Reviewed issues identified by HBase team. Also discussed how they might better 
troubleshoot when problems occur. Mentioned that they (hbase) should feel free 
to ping the zk user list if they have questions. JIRAs are great too if they 
have specific areas of ideas/concerns.

HBase team is going to create some JIRAs they identified (improve logging esp)

Phunt also asked the HBase team to fill in some usecases for how they are 
currently using zk, this will allow us (zk) to better understand how they are 
using, allowing us to provide better advice, vet their choices, and in some 
cases we may even test to verify under real conditions.

Some questions identified:

 * Stack - impression is that zk is too flakey and sensitive at the moment in 
hbase deploys (i'm talking particularly about the new-user experience).  I'd 
think a single server with its Nk reads/second with fairly rare updates should 
be able to host a big hbase cluster with a session timeout of 10/20 seconds but 
we are finding that we are recommending quorums of 3-5 on dedicated machines 
with sessions of 60 seconds because too often masters and regionservers timeout 
against zk.  GC on RS can be a killer; pauses of tens of seconds.  CMS helps 
some.   What is the experience of other users of zk?  Do you have 
recommendations for us regards deploy?  When fellas are having trouble in the 
field with session timeouts, how do you debug/approach this issue?

zk: biggest issues we typically see are; gc causing vm to stall (client or 
server, we have no control over the jvm unfort, in hadoop i've heard they can 
see gc pauses of over 60 seconds), server disk io causing stalls (sequential 
consistency means that reads have to wait for writes to complete), network 
connectivity problems. Perhaps use of JVM tools such as gc monitoring and 
jvisualvm may help to track down. Over-provising a host will obviously slow 
down processing (running high cpu/disk processes on the same box(es) as 
zookeeper).

also see: https://issues.apache.org/jira/browse/ZOOKEEPER-545

 * hbase - We'd like to know what kind of limits a 3 nodes quorum have. What's 
the reasonable amount of znodes, reads and writes per sec it can support? How 
many watches?

zk: this is the easiest to see 
http://hadoop.apache.org/zookeeper/docs/current/zookeeperOver.html#Performance 
however this is for optimum conditions (dedicated server class host with 
sufficient memory and dedicated spindle for the log) and a large number of 
clients. With smaller number of clients using synchronous operations you are 
probably limited more by network latency than anything.

 * hbase - How expensive is a WatchEvent?  What if we had a watcher per region 
hosted by a regionserver?  Lets say up to 1k regions on a regionserver.  Thats 
1k watchers.  Lets say cluster of 1-200 regionservers.  Thats 200k watchers the 
zk cluster needs to control.

zk: very inexpensive by design. the intent was to support large numbers of 
watchers. I've tested a single client creating 100k ephemeral nodes on session 
a, then session b setting sync watches on all nodes, then session c sessing 
async watches on all nodes, then closing session a. This was all done on a 
1core 5yrs old laptop, 200k watches were delivered in < 10 sec. Granted that's 
all the zk cluster was doing, but it gives you some idea.

 * hbase - Would like to have queues up in zk.  Queues of regions moving 
through state changes from unassigned to opening to open.  Would like to keep 
total list of all regions and then list them by regionserver assigned to?  What 
would you suggest?


Background:

Below are some notes from our master rewrite design doc.  Its from the section 
where we talk of moving all state and schemas to zk:

 * Tables are offlined, onlined, made read-only, and dropped (Add freeze of 
flushes and compactions state to facilitate snapshotting). Currently HBase 
Master does this by messaging regionservers. Instead move state to zookeeper. 
Let regionservers watch for changes and react. Allow that a cluster may have up 
to 100 tables. Tables are made of regions. There may be thousands of regions 
per table. A regionserver could be carrying a region from each of the 100 
tables. TODO: Should regionserver have a table watcher or a watcher per region?

 * Tables have schema. Tables are made of column families. Column families have 
schema/attributes. Column families can be added and removed. Currently the 
schema is written into a column in the .META. catalog family. Move all schema 
to zookeeper. Regionservers would have watchers on schema and would react to 
changes. TODO: A watcher per column family or a watcher per table or a watcher 
on the parent directory for schema?

 * Run region state transitions -- i.e. opening, closing -- by changing state 
in zookeeper rather than in Master maps as is currently done.

 * Keep up a region transition trail; regions move through states from 
unassigned to opening to open, etc. A region can't jump states as in going from 
unassigned to open.

 * Master (or client) moves regions between states. Watchers on RegionServers 
notice changes and act on it. Master (or client) can do transitions in bulk; 
e.g. assign a regionserver 50 regions to open on startup. Effect is that Master 
"pushes" work out to regionservers rather than wait on them to heartbeat.

 * A problem we have in current master is that states do not make a circle. 
Once a region is open, master stops keeping account of a regions' state; region 
state is now kept out in the .META. catalog table with its condition checked 
periodically by .META. table scan. State spanning two systems currently makes 
for confusion and evil such as region double assignment because there are race 
condition potholes as we move from one system -- internal state maps in master 
-- to the other during update to state in .META. Current thinking is to keep 
region lifecycle all up in zookeeper but that won't scale. Postulate 100k 
regions -- 100TB at 1G regions -- each with two or three possible states each 
with watchers for state change. My guess is that this is too much to put in zk. 
TODO: how to manage transition from zk to .META.?

 * State and Schema are distinct in zk. No interactions.

[Hadoop Wiki] Update of "ZooKeeper/HBaseMeetingNotes" b y PatrickHunt

Reply via email to