[ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923372#action_12923372 ]
HBase Review Board commented on HBASE-2998: ------------------------------------------- Message from: st...@duboce.net ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1057/ ----------------------------------------------------------- (Updated 2010-10-21 01:54:34.658192) Review request for hbase, Jean-Daniel Cryans and Jonathan Gray. Changes ------- New patch includes faster assign of regions on startup (Uses async create/exists-set-watcher). Getting this working helps w/ rolling restart tests. Assign and watcher set for 2k regions runs fast now... used to be 90 seconds for 2k regions over 10 servers ... now its a matter of seconds for total bulk assign of all regions in just over a minute. This patch is not yet ready. I need to test more. Summary ------- Fix 'hbase zkcli' so it reads zk ensemble location from hbase config/zoo.cfg. This fixes rolling restart. Patch also includes fix so rolling restarts work on new master. A src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperMainServerArg.java Test for new TZMSA class. M src/main/java/org/apache/hadoop/hbase/zookeeper/ZKServerTool.java Minor edit of javadoc. A src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperMainServerArg.java Tool to emit what ZooKeeperMain wants for a server argument. M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java (isAbort): Added. M src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java Shutdown hook now needs to startup region shutdowns since new master changed how shutdown sequence runs. M src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java Don't do opens if server is stopped. M src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java Minor formatting. M bin/hbase Run new ZKMSA tool to figure '-server host:port' to pass ZKM M bin/hbase-daemon.sh Make default wait be longer. This addresses bug hbase-2998. http://issues.apache.org/jira/browse/hbase-2998 Diffs (updated) ----- trunk/bin/hbase 1025815 trunk/bin/hbase-daemon.sh 1025815 trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1025815 trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1025815 trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java 1025815 trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1025815 trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1025815 trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1025815 trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1025815 trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKServerTool.java 1025815 trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java 1025815 trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWatcher.java 1025815 Diff: http://review.cloudera.org/r/1057/diff Testing ------- Thanks, stack > rolling-restart.sh shouldn't rely on zoo.cfg > -------------------------------------------- > > Key: HBASE-2998 > URL: https://issues.apache.org/jira/browse/HBASE-2998 > Project: HBase > Issue Type: Bug > Reporter: Jean-Daniel Cryans > Assignee: stack > Priority: Critical > Fix For: 0.90.0 > > Attachments: 2998.txt > > > I tried the rolling-restart script on our dev environment, which is > configured with zoo.cfg for zookeeper, and it worked pretty well. Then I > tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered > some downtime (no biggie tho, nothing critical was running). When the script > calls this line: > {code} > bin/hbase zkcli stat $zmaster > {code} > It directly runs a ZooKeeperMain which isn't modified to read from the HBase > configuration files. What happens next if ZK isn't running on the master node > is that it receives a ConnectionRefused, ignores it, procedes to restart the > master (which waits on the znode), and the starts restarting the region > servers. They can't shutdown properly under 60 seconds, since they need a > master, so they get killed. What follows is pretty ugly and pretty much > requires a whole restart. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.