[jira] [Commented] (HBASE-4120) isolation and allocation
[ https://issues.apache.org/jira/browse/HBASE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424501#comment-13424501 ] gaojinchao commented on HBASE-4120: --- Hi Liu jia Are you working for this issue now? when do you plan to finish? isolation and allocation Key: HBASE-4120 URL: https://issues.apache.org/jira/browse/HBASE-4120 Project: HBase Issue Type: New Feature Components: master, regionserver Affects Versions: 0.90.2, 0.90.3, 0.90.4, 0.92.0 Reporter: Liu Jia Assignee: Liu Jia Fix For: 0.96.0 Attachments: Design_document_for_HBase_isolation_and_allocation.pdf, Design_document_for_HBase_isolation_and_allocation_Revised.pdf, HBase_isolation_and_allocation_user_guide.pdf, Performance_of_Table_priority.pdf, Simple_YCSB_Tests_For_TablePriority_Trunk_and_0.90.4.pdf, System Structure.jpg, TablePriority.patch, TablePriority_v12.patch, TablePriority_v12.patch, TablePriority_v15_with_coprocessor.patch, TablePriority_v16_with_coprocessor.patch, TablePriority_v17.patch, TablePriority_v17.patch, TablePriority_v8.patch, TablePriority_v8.patch, TablePriority_v8_for_trunk.patch, TablePrioriy_v9.patch The HBase isolation and allocation tool is designed to help users manage cluster resource among different application and tables. When we have a large scale of HBase cluster with many applications running on it, there will be lots of problems. In Taobao there is a cluster for many departments to test their applications performance, these applications are based on HBase. With one cluster which has 12 servers, there will be only one application running exclusively on this server, and many other applications must wait until the previous test finished. After we add allocation manage function to the cluster, applications can share the cluster and run concurrently. Also if the Test Engineer wants to make sure there is no interference, he/she can move out other tables from this group. In groups we use table priority to allocate resource, when system is busy; we can make sure high-priority tables are not affected lower-priority tables Different groups can have different region server configurations, some groups optimized for reading can have large block cache size, and others optimized for writing can have large memstore size. Tables and region servers can be moved easily between groups; after changing the configuration, a group can be restarted alone instead of restarting the whole cluster. git entry : https://github.com/ICT-Ope/HBase_allocation . We hope our work is helpful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4246) Cluster with too many regions cannot withstand some master failover scenarios
[ https://issues.apache.org/jira/browse/HBASE-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398201#comment-13398201 ] gaojinchao commented on HBASE-4246: --- The version is 0.90.X, I have asked the customer up jute.maxbuffer to 64M. Cluster with too many regions cannot withstand some master failover scenarios - Key: HBASE-4246 URL: https://issues.apache.org/jira/browse/HBASE-4246 Project: HBase Issue Type: Bug Components: master, zookeeper Affects Versions: 0.90.4 Reporter: Todd Lipcon Priority: Critical Fix For: 0.96.0 We ran into the following sequence of events: - master startup failed after only ROOT had been assigned (for another reason) - restarted the master without restarting other servers. Since there was at least one region assigned, it went through the failover code path - master scanned META and inserted every region into /hbase/unassigned in ZK. - then, it called listChildren on the /hbase/unassigned znode, and crashed with Packet len6080218 is out of range! since the IPC response was larger than the default maximum. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4246) Cluster with too many regions cannot withstand some master failover scenarios
[ https://issues.apache.org/jira/browse/HBASE-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397214#comment-13397214 ] gaojinchao commented on HBASE-4246: --- Hi, It also happpened in our cluster when we restarted whole cluster(it has 129723 regions). 2012-06-19 19:29:00,961 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:2-0x137ed2eb936fb85 Creating (or updating) unassigned node for 80400ccd4a1f3438cc23774ca8a88d17 with OFFLINE state 2012-06-19 19:29:00,965 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=172-16-6-2:2, region=80400ccd4a1f3438cc23774ca8a88d17 2012-06-19 19:29:00,966 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:2-0x137ed2eb936fb85 Creating (or updating) unassigned node for 7f1a56641906ae0a6cc6919bd927df76 with OFFLINE state 2012-06-19 19:29:00,969 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=172-16-6-2:2, region=7f1a56641906ae0a6cc6919bd927df76 2012-06-19 19:29:01,070 WARN org.apache.zookeeper.ClientCnxn: Session 0x137ed2eb936fb85 for server 172-16-6-1/172.16.6.1:2181, unexpected error, closing socket connection and attempting reconnect 2012-06-19 19:29:01,070 WARN org.apache.zookeeper.ClientCnxn: Session 0x137ed2eb936fb85 for server 172-16-6-1/172.16.6.1:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Packet len4670048 is out of range! at org.apache.zookeeper.ClientCnxn$SendThread.readLength(ClientCnxn.java:721) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:880) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145) 2012-06-19 19:29:01,174 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: master:2-0x137ed2eb936fb85 Unable to list children of znode /hbase/unassigned Cluster with too many regions cannot withstand some master failover scenarios - Key: HBASE-4246 URL: https://issues.apache.org/jira/browse/HBASE-4246 Project: HBase Issue Type: Bug Components: master, zookeeper Affects Versions: 0.90.4 Reporter: Todd Lipcon Priority: Critical Fix For: 0.96.0 We ran into the following sequence of events: - master startup failed after only ROOT had been assigned (for another reason) - restarted the master without restarting other servers. Since there was at least one region assigned, it went through the failover code path - master scanned META and inserted every region into /hbase/unassigned in ZK. - then, it called listChildren on the /hbase/unassigned znode, and crashed with Packet len6080218 is out of range! since the IPC response was larger than the default maximum. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6055) Snapshots in HBase 0.96
[ https://issues.apache.org/jira/browse/HBASE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290027#comment-13290027 ] gaojinchao commented on HBASE-6055: --- Fine, Thanks, I will take some time for this feature. Snapshots in HBase 0.96 --- Key: HBASE-6055 URL: https://issues.apache.org/jira/browse/HBASE-6055 Project: HBase Issue Type: New Feature Components: client, master, regionserver, zookeeper Reporter: Jesse Yates Assignee: Jesse Yates Fix For: 0.96.0 Attachments: Snapshots in HBase.docx Continuation of HBASE-50 for the current trunk. Since the implementation has drastically changed, opening as a new ticket. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6055) Snapshots in HBase 0.96
[ https://issues.apache.org/jira/browse/HBASE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289900#comment-13289900 ] gaojinchao commented on HBASE-6055: --- Hi Jesse I am considering the solution which don't use Hlog. The way is only handling the memstore and asynchronous flush the memstore to Hfile. when the region server is down, we can finish flushing Hfile by replay editLog. Do you think whether it is feasible or not? If we can do, there are several relatively large benefits: 1. restore the snapshot is easier 2. We can achieve an incremental backup by HFile Snapshots in HBase 0.96 --- Key: HBASE-6055 URL: https://issues.apache.org/jira/browse/HBASE-6055 Project: HBase Issue Type: New Feature Components: client, master, regionserver, zookeeper Reporter: Jesse Yates Assignee: Jesse Yates Fix For: 0.96.0 Attachments: Snapshots in HBase.docx Continuation of HBASE-50 for the current trunk. Since the implementation has drastically changed, opening as a new ticket. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6055) Snapshots in HBase 0.96
[ https://issues.apache.org/jira/browse/HBASE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284366#comment-13284366 ] gaojinchao commented on HBASE-6055: --- Hi Jesse, Are you working this feature? I am interested in it. I will study your code. one question, When we are creating snapshots, Do we need stop the balance? Snapshots in HBase 0.96 --- Key: HBASE-6055 URL: https://issues.apache.org/jira/browse/HBASE-6055 Project: HBase Issue Type: New Feature Components: client, master, regionserver, zookeeper Reporter: Jesse Yates Assignee: Jesse Yates Fix For: 0.96.0 Attachments: Snapshots in HBase.docx Continuation of HBASE-50 for the current trunk. Since the implementation has drastically changed, opening as a new ticket. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6055) Snapshots in HBase 0.96
[ https://issues.apache.org/jira/browse/HBASE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13283894#comment-13283894 ] gaojinchao commented on HBASE-6055: --- This is a very useful feature. :0 Snapshots in HBase 0.96 --- Key: HBASE-6055 URL: https://issues.apache.org/jira/browse/HBASE-6055 Project: HBase Issue Type: New Feature Components: client, master, regionserver, zookeeper Reporter: Jesse Yates Assignee: Jesse Yates Fix For: 0.96.0 Attachments: Snapshots in HBase.docx Continuation of HBASE-50 for the current trunk. Since the implementation has drastically changed, opening as a new ticket. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5546) Master assigns region in the original region server when opening region failed
[ https://issues.apache.org/jira/browse/HBASE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272971#comment-13272971 ] gaojinchao commented on HBASE-5546: --- +1, Good job! Master assigns region in the original region server when opening region failed Key: HBASE-5546 URL: https://issues.apache.org/jira/browse/HBASE-5546 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0 Reporter: gaojinchao Assignee: Ashutosh Jindal Priority: Minor Fix For: 0.96.0 Attachments: hbase-5546.patch, hbase-5546_1.patch Master assigns region in the original region server when RS_ZK_REGION_FAILED_OPEN envent was coming. Maybe we should choose other region server. [2012-03-07 10:14:21,750] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:14:31,826] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:14:41,903] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:14:51,975] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:15:02,056] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:15:12,167] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:15:22,231] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:15:32,303] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:15:42,375] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:15:52,447] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:16:02,528] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:16:12,600] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 [2012-03-07 10:16:22,676] [DEBUG] [main-EventThread] [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, region=c70e98bdca98a0657a56436741523053 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4340) Hbase can't balance if ServerShutdownHandler encountered exception
[ https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13102203#comment-13102203 ] gaojinchao commented on HBASE-4340: --- Thanks for your work. Ted. I want to patch through to review, and then make a trunk patch. All test case passed need two hours. :) Hbase can't balance if ServerShutdownHandler encountered exception -- Key: HBASE-4340 URL: https://issues.apache.org/jira/browse/HBASE-4340 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4340_branch90.patch Version: 0.90.4 Cluster : 40 boxes As I saw below logs. It said that balance couldn't work because of a dead RS. I dug deeply and found two issues: 1. shutdownhandler didn't clear numProcessing deal with some exceptions. It seems whatever exceptions we should clear the flag or close master. 2. dead regionserver(s): [158-1-130-12,20020,1314971097929] is inaccurate. The dead sever should be 158-1-130-10,20020,1315068597979 //master logs: 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster:
[jira] [Commented] (HBASE-4340) Hbase can't balance.
[ https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13101180#comment-13101180 ] gaojinchao commented on HBASE-4340: --- Yes, All test cases have passed. Hbase can't balance. Key: HBASE-4340 URL: https://issues.apache.org/jira/browse/HBASE-4340 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4340_branch90.patch Version: 0.90.4 Cluster : 40 boxes As I saw below logs. It said that balance couldn't work because of a dead RS. I dug deeply and found two issues: 1. shutdownhandler didn't clear numProcessing deal with some exceptions. It seems whatever exceptions we should clear the flag or close master. 2. dead regionserver(s): [158-1-130-12,20020,1314971097929] is inaccurate. The dead sever should be 158-1-130-10,20020,1315068597979 //master logs: 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running
[jira] [Commented] (HBASE-4212) TestMasterFailover fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13101213#comment-13101213 ] gaojinchao commented on HBASE-4212: --- @Stack, Thanks for your review. In our environment, it often fails, so we skip this case(for my case is that all test cases are performed automatically every day). The step for opening a root region: step A: Master tells Region server to open root region. step B: Region server opens root region and sets zk node(rootServerZNodezk). It is finished means that catalogtracker can works. step C: Region server updates the zk node(assignmentZNode) tells master that root has opened(some cases may fail, but we have told the root could be used). step D: Master deletes the zk node (assignmentZNode) and adds root region to online set. In my case, master skipped the step D because delayed. master forced root region online in processFailover. So zk node couldn't be deleted and failover case failed. finishInitialization code: // Make sure root and meta assigned before proceeding. assignRootAndMeta(); // Is this fresh start with no regions assigned or are we a master joining // an already-running cluster? If regionsCount == 0, then for sure a // fresh start. TOOD: Be fancier. If regionsCount == 2, perhaps the // 2 are .META. and -ROOT- and we should fall into the fresh startup // branch below. For now, do processFailover. if (regionCount == 0) { LOG.info(Master startup proceeding: cluster startup); this.assignmentManager.cleanoutUnassigned(); this.assignmentManager.assignAllUserRegions(); } else { LOG.info(Master startup proceeding: master failover); this.assignmentManager.processFailover(); } processFailover code: HServerInfo hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation()); regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi); hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation()); regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi); TestMasterFailover fails occasionally - Key: HBASE-4212 URL: https://issues.apache.org/jira/browse/HBASE-4212 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4212_TrunkV1.patch, HBASE-4212_branch90V1.patch It seems a bug. The root in RIT can't be moved.. In the failover process, it enforces root on-line. But not clean zk node. test will wait forever. void processFailover() throws KeeperException, IOException, InterruptedException { // we enforce on-line root. HServerInfo hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation()); regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi); hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation()); regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi); It seems that we should wait finished as meta region int assignRootAndMeta() throws InterruptedException, IOException, KeeperException { int assigned = 0; long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 1000); // Work on ROOT region. Is it in zk in transition? boolean rit = this.assignmentManager. processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO); if (!catalogTracker.verifyRootRegionLocation(timeout)) { this.assignmentManager.assignRoot(); this.catalogTracker.waitForRoot(); //we need add this code and guarantee that the transition has completed this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO); assigned++; } logs: 2011-08-16 07:45:40,715 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,716 INFO [PostOpenDeployTasks:70236052] catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as C4S2.site:47710 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved
[jira] [Updated] (HBASE-4340) Hbase can't balance.
[ https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4340: -- Attachment: HBASE-4340_branch90.patch Hbase can't balance. Key: HBASE-4340 URL: https://issues.apache.org/jira/browse/HBASE-4340 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4340_branch90.patch Version: 0.90.4 Cluster : 40 boxes As I saw below logs. It said that balance couldn't work because of a dead RS. I dug deeply and found two issues: 1. shutdownhandler didn't clear numProcessing deal with some exceptions. It seems whatever exceptions we should clear the flag or close master. 2. dead regionserver(s): [158-1-130-12,20020,1314971097929] is inaccurate. The dead sever should be 158-1-130-10,20020,1315068597979 //master logs: 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s):
[jira] [Commented] (HBASE-4340) Hbase can't balance.
[ https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13100207#comment-13100207 ] gaojinchao commented on HBASE-4340: --- I have made a patch, Please review. Hbase can't balance. Key: HBASE-4340 URL: https://issues.apache.org/jira/browse/HBASE-4340 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4340_branch90.patch Version: 0.90.4 Cluster : 40 boxes As I saw below logs. It said that balance couldn't work because of a dead RS. I dug deeply and found two issues: 1. shutdownhandler didn't clear numProcessing deal with some exceptions. It seems whatever exceptions we should clear the flag or close master. 2. dead regionserver(s): [158-1-130-12,20020,1314971097929] is inaccurate. The dead sever should be 158-1-130-10,20020,1315068597979 //master logs: 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running
[jira] [Updated] (HBASE-4340) Hbase can't balance.
[ https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4340: -- Status: Patch Available (was: Open) Hbase can't balance. Key: HBASE-4340 URL: https://issues.apache.org/jira/browse/HBASE-4340 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4340_branch90.patch Version: 0.90.4 Cluster : 40 boxes As I saw below logs. It said that balance couldn't work because of a dead RS. I dug deeply and found two issues: 1. shutdownhandler didn't clear numProcessing deal with some exceptions. It seems whatever exceptions we should clear the flag or close master. 2. dead regionserver(s): [158-1-130-12,20020,1314971097929] is inaccurate. The dead sever should be 158-1-130-10,20020,1315068597979 //master logs: 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s):
[jira] [Commented] (HBASE-2158) Change how high/low global limit works; start taking on writes as soon as we dip below high limit rather than block until low limit as we currently do.
[ https://issues.apache.org/jira/browse/HBASE-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098645#comment-13098645 ] gaojinchao commented on HBASE-2158: --- when memory reach the low limit, we should start a emergency Flusher. So I think It is diffcult to reach the high limit and if we reach it ,we will flush one by one. if (fqe == null || fqe instanceof WakeupFlushThread) { if (isAboveLowWaterMark()) { LOG.info(Flush thread woke up with memory above low water.); if (!flushOneForGlobalPressure()) { // Wasn't able to flush any region, but we're above low water mark // This is unlikely to happen, but might happen when closing the // entire server - another thread is flushing regions. We'll just // sleep a little bit to avoid spinning, and then pretend that // we flushed one, so anyone blocked will check again lock.lock(); try { Thread.sleep(1000); flushOccurred.signalAll(); } finally { lock.unlock(); } } // Enqueue another one of these tokens so we'll wake up again wakeupFlushThread(); } continue; } Change how high/low global limit works; start taking on writes as soon as we dip below high limit rather than block until low limit as we currently do. --- Key: HBASE-2158 URL: https://issues.apache.org/jira/browse/HBASE-2158 Project: HBase Issue Type: Improvement Reporter: stack A Ryan Rawson suggestion. See HBASE-2149 for more context. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-4340) Hbase can't balance.
Hbase can't balance. Key: HBASE-4340 URL: https://issues.apache.org/jira/browse/HBASE-4340 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: gaojinchao Fix For: 0.90.5 Version: 0.90.4 Cluster : 40 boxes As I saw below logs. It said that balance couldn't work because of a dead RS. I dug deeply and found two issues: 1. shutdownhandler didn't clear numProcessing deal with some exceptions. It seems whatever exceptions we should clear the flag or close master. 2. dead regionserver(s): [158-1-130-12,20020,1314971097929] is inaccurate. The dead sever should be 158-1-130-10,20020,1315068597979 //master logs: 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] // the exception logs :. 2011-09-03 18:13:27,550 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=158-1-133-11,20020,1315069437236, region=0db4088d75c58dd22f93f389d90ba6cc 2011-09-03 18:13:27,550 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event
[jira] [Assigned] (HBASE-4340) Hbase can't balance.
[ https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao reassigned HBASE-4340: - Assignee: gaojinchao Hbase can't balance. Key: HBASE-4340 URL: https://issues.apache.org/jira/browse/HBASE-4340 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.5 Version: 0.90.4 Cluster : 40 boxes As I saw below logs. It said that balance couldn't work because of a dead RS. I dug deeply and found two issues: 1. shutdownhandler didn't clear numProcessing deal with some exceptions. It seems whatever exceptions we should clear the flag or close master. 2. dead regionserver(s): [158-1-130-12,20020,1314971097929] is inaccurate. The dead sever should be 158-1-130-10,20020,1315068597979 //master logs: 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929] // the exception logs
[jira] [Assigned] (HBASE-3521) region be merged with others automatically when all data in the region has expired and removed, or region gets too small.
[ https://issues.apache.org/jira/browse/HBASE-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao reassigned HBASE-3521: - Assignee: gaojinchao region be merged with others automatically when all data in the region has expired and removed, or region gets too small. - Key: HBASE-3521 URL: https://issues.apache.org/jira/browse/HBASE-3521 Project: HBase Issue Type: Improvement Components: master, regionserver, scripts Affects Versions: 0.90.0 Reporter: zhoushuaifeng Assignee: gaojinchao Priority: Minor We have test a cluster which have more than 30,000 regions, max size of a region is 512MB. At this situation, data no more growing, but remove some old data and insert new, and regions will be more and more. And some regions may be very small or empty. This occupies too much heapsize, and will be more if regions cannot be merged. This will limit hbase running for a long time. A script that does a survey to remove empty regions, or pick out adjacent small regions that then does the online merge up seems like it would be useful. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3521) region be merged with others automatically when all data in the region has expired and removed, or region gets too small.
[ https://issues.apache.org/jira/browse/HBASE-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098560#comment-13098560 ] gaojinchao commented on HBASE-3521: --- Thanks a lot.This is what I want. :) region be merged with others automatically when all data in the region has expired and removed, or region gets too small. - Key: HBASE-3521 URL: https://issues.apache.org/jira/browse/HBASE-3521 Project: HBase Issue Type: Improvement Components: master, regionserver, scripts Affects Versions: 0.90.0 Reporter: zhoushuaifeng Assignee: gaojinchao Priority: Minor We have test a cluster which have more than 30,000 regions, max size of a region is 512MB. At this situation, data no more growing, but remove some old data and insert new, and regions will be more and more. And some regions may be very small or empty. This occupies too much heapsize, and will be more if regions cannot be merged. This will limit hbase running for a long time. A script that does a survey to remove empty regions, or pick out adjacent small regions that then does the online merge up seems like it would be useful. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2158) Change how high/low global limit works; start taking on writes as soon as we dip below high limit rather than block until low limit as we currently do.
[ https://issues.apache.org/jira/browse/HBASE-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098586#comment-13098586 ] gaojinchao commented on HBASE-2158: --- I agree.this issue should be closed. Change how high/low global limit works; start taking on writes as soon as we dip below high limit rather than block until low limit as we currently do. --- Key: HBASE-2158 URL: https://issues.apache.org/jira/browse/HBASE-2158 Project: HBase Issue Type: Improvement Reporter: stack A Ryan Rawson suggestion. See HBASE-2149 for more context. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4132) Extend the WALActionsListener API to accomodate log archival
[ https://issues.apache.org/jira/browse/HBASE-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093403#comment-13093403 ] gaojinchao commented on HBASE-4132: --- Whether we can add the following api? /** * The WAL needs to be archived. It is going to be moved from oldPath to * newPath. * * @param oldPath * the path to the old hlog * @param newPath * the path to the new hlog * @return true if default behavior should be bypassed, false otherwise */ boolean preArchiveLog(Path oldPath, Path newPath) throws IOException; /** * The WAL has been archived. It is moved from oldPath to newPath. * * @param oldPath * the path to the old hlog * @param newPath * the path to the new hlog * @param archivalWasSuccessful * true, if the archival was successful */ void postArchiveLog(Path oldPath, Path newPath, boolean archivalWasSuccessful) throws IOException; Extend the WALActionsListener API to accomodate log archival Key: HBASE-4132 URL: https://issues.apache.org/jira/browse/HBASE-4132 Project: HBase Issue Type: Improvement Components: regionserver Reporter: dhruba borthakur Fix For: 0.92.0 Attachments: walArchive.txt The WALObserver interface exposes the log roll events. It would be nice to extend it to accomodate log archival events as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13092583#comment-13092583 ] gaojinchao commented on HBASE-4124: --- @Ted thanks for your work. sn has checked about null above statement. if (sn == null) { LOG.warn(Region in transition + regionInfo.getEncodedName() + references a null server; letting RIT timeout so will be + assigned elsewhere); break; } ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Assignee: gaojinchao Fix For: 0.90.5 Attachments: 4124-trunk.v2, HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4124: -- Attachment: HBASE-4124_TrunkV2.patch I am runing all the test cases. My new modification is more clear. ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, HBASE-4124_TrunkV2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4134) The total number of regions was more than the actual region count after the hbck fix
[ https://issues.apache.org/jira/browse/HBASE-4134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4134: -- Fix Version/s: (was: 0.92.0) 0.94.0 The total number of regions was more than the actual region count after the hbck fix Key: HBASE-4134 URL: https://issues.apache.org/jira/browse/HBASE-4134 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: feng xu Fix For: 0.94.0 1. I found the problem(some regions were multiply assigned) while running hbck to check the cluster's health. Here's the result: {noformat} ERROR: Region test1,230778,1311216270050.fff783529fcd983043610eaa1cc5c2fe. is listed in META on region server 158-1-91-101:20020 but is multiply assigned to region servers 158-1-91-101:20020, 158-1-91-105:20020 ERROR: Region test1,252103,1311216293671.fff9ed2cb69bdce535451a07686c0db5. is listed in META on region server 158-1-91-101:20020 but is multiply assigned to region servers 158-1-91-101:20020, 158-1-91-105:20020 ERROR: Region test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. is listed in META on region server 158-1-91-103:20020 but is multiply assigned to region servers 158-1-91-103:20020, 158-1-91-105:20020 Summary: -ROOT- is okay. Number of regions: 1 Deployed on: 158-1-91-105:20020 .META. is okay. Number of regions: 1 Deployed on: 158-1-91-103:20020 test1 is okay. Number of regions: 25297 Deployed on: 158-1-91-101:20020 158-1-91-103:20020 158-1-91-105:20020 14829 inconsistencies detected. Status: INCONSISTENT {noformat} 2. Then I tried to use hbck -fix to fix the problem. Everything seemed ok. But I found that the total number of regions reported by load balancer (35029) was more than the actual region count(25299) after the fixing. Here's the related logs snippet: {noformat} 2011-07-22 02:19:02,866 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing. servers=3 regions=25299 average=8433.0 mostloaded=8433 2011-07-22 03:06:11,832 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing. servers=3 regions=35029 average=11676.333 mostloaded=11677 leastloaded=11676 {noformat} 3. I tracked one region's behavior during the time. Taking the region of test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. as example: (1) It was assigned to 158-1-91-101 at first. (2) HBCK sent closing request to RegionServer. And RegionServer closed it silently without notice to HMaster. (3) The region was still carried by RS 158-1-91-103 which was known to HMaster. (4) HBCK will trigger a new assignment. The fact is, the region was assigned again, but the old assignment information still remained in AM#regions,AM#servers. That's why the problem of region count was larger than the actual number occurred. {noformat} Line 178967: 2011-07-22 02:47:51,247 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/52782c0241a598b3e37ca8729da0 (region=test1,282187,1311216322104.52782c0241a598b3e37ca8729da0., server=HBCKServerName, state=M_ZK_REGION_OFFLINE) Line 178968: 2011-07-22 02:47:51,247 INFO org.apache.hadoop.hbase.master.AssignmentManager: Handling HBCK triggered transition=M_ZK_REGION_OFFLINE, server=HBCKServerName, region=52782c0241a598b3e37ca8729da0 Line 178969: 2011-07-22 02:47:51,248 INFO org.apache.hadoop.hbase.master.AssignmentManager: HBCK repair is triggering assignment of region=test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. Line 178970: 2011-07-22 02:47:51,248 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. so generated a random one; hri=test1,282187,1311216322104.52782c0241a598b3e37ca8729da0., src=, dest=158-1-91-101,20020,1311231878544; 3 (online=3, exclude=null) available servers Line 178971: 2011-07-22 02:47:51,248 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. to 158-1-91-101,20020,1311231878544 Line 178983: 2011-07-22 02:47:51,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=158-1-91-101,20020,1311231878544, region=52782c0241a598b3e37ca8729da0 Line 179001: 2011-07-22 02:47:51,318 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=158-1-91-101,20020,1311231878544, region=52782c0241a598b3e37ca8729da0 Line 179002: 2011-07-22 02:47:51,319 DEBUG
[jira] [Commented] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13092618#comment-13092618 ] gaojinchao commented on HBASE-4124: --- All test cases passed. Thanks. ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, HBASE-4124_TrunkV2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4124: -- Attachment: HBASE-4124_TrunkV1.patch I have made a patch. I found two test case(TestAdmin and RollLoging) can't pass. I use the raw trunk as well ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4124: -- Attachment: HBASE-4124_Branch90V4.patch According to review, modified the comments. Thanks for Ted's careful review. ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, HBASE-4124_Branch90V4.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits
[ https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-3845: -- Attachment: HBASE-3845_branch90V2.patch According to review, modified the code. data loss because lastSeqWritten can miss memstore edits Key: HBASE-3845 URL: https://issues.apache.org/jira/browse/HBASE-3845 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: Prakash Khemani Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.92.0 Attachments: 0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch, HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch, HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch, HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_branch90V1.patch, HBASE-3845_branch90V2.patch, HBASE-3845_trunk_2.patch, HBASE-3845_trunk_3.patch (I don't have a test case to prove this yet but I have run it by Dhruba and Kannan internally and wanted to put this up for some feedback.) In this discussion let us assume that the region has only one column family. That way I can use region/memstore interchangeably. After a memstore flush it is possible for lastSeqWritten to have a log-sequence-id for a region that is not the earliest log-sequence-id for that region's memstore. HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure that we only keep track of the earliest log-sequence-number that is present in the memstore. Every time the memstore is flushed we remove the region's entry in lastSequenceWritten and wait for the next append to populate this entry again. This is where the problem happens. step 1: flusher.prepare() snapshots the memstore under HRegion.updatesLock.writeLock(). step 2 : as soon as the updatesLock.writeLock() is released new entries will be added into the memstore. step 3 : wal.completeCacheFlush() is called. This method removes the region's entry from lastSeqWritten. step 4: the next append will create a new entry for the region in lastSeqWritten(). But this will be the log seq id of the current append. All the edits that were added in step 2 are missing. == as a temporary measure, instead of removing the region's entry in step 3 I will replace it with the log-seq-id of the region-flush-event. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work started] (HBASE-3933) Hmaster throw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HBASE-3933 started by gaojinchao. Hmaster throw NullPointerException -- Key: HBASE-3933 URL: https://issues.apache.org/jira/browse/HBASE-3933 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Attachments: Hmastersetup0.90 NullPointerException while hmaster starting. {code} java.lang.NullPointerException at java.util.TreeMap.getEntry(TreeMap.java:324) at java.util.TreeMap.get(TreeMap.java:255) at org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512) at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606) at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283) {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090853#comment-13090853 ] gaojinchao commented on HBASE-3933: --- I study the TRUNK. It has fixed. So we can close this issue. Trunk code: // Wait for region servers to report in. this.serverManager.waitForRegionServers(status); // Check zk for regionservers that are up but didn't register for (ServerName sn: this.regionServerTracker.getOnlineServers()) { if (!this.serverManager.isServerOnline(sn)) { // Not registered; add it. LOG.info(Registering server found up in zk: + sn); this.serverManager.recordNewServer(sn, HServerLoad.EMPTY_HSERVERLOAD); } } Hmaster throw NullPointerException -- Key: HBASE-3933 URL: https://issues.apache.org/jira/browse/HBASE-3933 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Attachments: Hmastersetup0.90 NullPointerException while hmaster starting. {code} java.lang.NullPointerException at java.util.TreeMap.getEntry(TreeMap.java:324) at java.util.TreeMap.get(TreeMap.java:255) at org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512) at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606) at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283) {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits
[ https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090882#comment-13090882 ] gaojinchao commented on HBASE-3845: --- @Stack Please review the patch and give some suggestion. :) data loss because lastSeqWritten can miss memstore edits Key: HBASE-3845 URL: https://issues.apache.org/jira/browse/HBASE-3845 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: Prakash Khemani Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.92.0 Attachments: 0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch, HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch, HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch, HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_branch90V1.patch, HBASE-3845_branch90V2.patch, HBASE-3845_trunk_2.patch, HBASE-3845_trunk_3.patch (I don't have a test case to prove this yet but I have run it by Dhruba and Kannan internally and wanted to put this up for some feedback.) In this discussion let us assume that the region has only one column family. That way I can use region/memstore interchangeably. After a memstore flush it is possible for lastSeqWritten to have a log-sequence-id for a region that is not the earliest log-sequence-id for that region's memstore. HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure that we only keep track of the earliest log-sequence-number that is present in the memstore. Every time the memstore is flushed we remove the region's entry in lastSequenceWritten and wait for the next append to populate this entry again. This is where the problem happens. step 1: flusher.prepare() snapshots the memstore under HRegion.updatesLock.writeLock(). step 2 : as soon as the updatesLock.writeLock() is released new entries will be added into the memstore. step 3 : wal.completeCacheFlush() is called. This method removes the region's entry from lastSeqWritten. step 4: the next append will create a new entry for the region in lastSeqWritten(). But this will be the log seq id of the current append. All the edits that were added in step 2 are missing. == as a temporary measure, instead of removing the region's entry in step 3 I will replace it with the log-seq-id of the region-flush-event. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13091474#comment-13091474 ] gaojinchao commented on HBASE-4124: --- @ Ted I am making a patch for TRUNK. But I have some questions about TRUNK. It seems a bug. In function assign, when we get the return value ALREADY_OPENED . should we update the meta table ? or we do this on region server. hmaster code: RegionOpeningState regionOpenState = serverManager.sendRegionOpen(plan .getDestination(), state.getRegion()); if (regionOpenState == RegionOpeningState.ALREADY_OPENED) { region server code: if we don't update the meta ,the client may access to the old server. HRegion onlineRegion = this.getFromOnlineRegions(region.getEncodedName()); if (null != onlineRegion) { LOG.warn(Attempted open of + region.getEncodedName() + but already online on this server); return RegionOpeningState.ALREADY_OPENED; } ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, HBASE-4124_Branch90V4.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4124: -- Attachment: HBASE-4124_Branch90V3.patch ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090141#comment-13090141 ] gaojinchao commented on HBASE-4124: --- @Ted Does it need a patch for Trunk? There is a big change, I need some time to study it. ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090677#comment-13090677 ] gaojinchao commented on HBASE-4124: --- @Ted I have run all the tests. Thanks for your work. ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090698#comment-13090698 ] gaojinchao commented on HBASE-4124: --- @ram How come we have a dead RS if we dont kill the RS gao: If you stop the cluster, The meta will handle the server information. if the master is also killed how can the regions be assigned to some other RS gao: When master startup, it collects the regions on a same region server and call sendRegionOpen(destination, regions). If the region is relatively large number, when region server opens the reigons needs a long time. when master crash, the new master may reopen the regions on another region server. ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits
[ https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-3845: -- Attachment: HBASE-3845_branch90V1.patch data loss because lastSeqWritten can miss memstore edits Key: HBASE-3845 URL: https://issues.apache.org/jira/browse/HBASE-3845 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: Prakash Khemani Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.92.0 Attachments: 0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch, HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch, HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch, HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_branch90V1.patch, HBASE-3845_trunk_2.patch, HBASE-3845_trunk_3.patch (I don't have a test case to prove this yet but I have run it by Dhruba and Kannan internally and wanted to put this up for some feedback.) In this discussion let us assume that the region has only one column family. That way I can use region/memstore interchangeably. After a memstore flush it is possible for lastSeqWritten to have a log-sequence-id for a region that is not the earliest log-sequence-id for that region's memstore. HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure that we only keep track of the earliest log-sequence-number that is present in the memstore. Every time the memstore is flushed we remove the region's entry in lastSequenceWritten and wait for the next append to populate this entry again. This is where the problem happens. step 1: flusher.prepare() snapshots the memstore under HRegion.updatesLock.writeLock(). step 2 : as soon as the updatesLock.writeLock() is released new entries will be added into the memstore. step 3 : wal.completeCacheFlush() is called. This method removes the region's entry from lastSeqWritten. step 4: the next append will create a new entry for the region in lastSeqWritten(). But this will be the log seq id of the current append. All the edits that were added in step 2 are missing. == as a temporary measure, instead of removing the region's entry in step 3 I will replace it with the log-seq-id of the region-flush-event. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits
[ https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090740#comment-13090740 ] gaojinchao commented on HBASE-3845: --- @RAM I have run all the unit tests, Please help to review it firstly. Thanks. I will construct the scene to verify today. data loss because lastSeqWritten can miss memstore edits Key: HBASE-3845 URL: https://issues.apache.org/jira/browse/HBASE-3845 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: Prakash Khemani Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.92.0 Attachments: 0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch, HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch, HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch, HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_branch90V1.patch, HBASE-3845_trunk_2.patch, HBASE-3845_trunk_3.patch (I don't have a test case to prove this yet but I have run it by Dhruba and Kannan internally and wanted to put this up for some feedback.) In this discussion let us assume that the region has only one column family. That way I can use region/memstore interchangeably. After a memstore flush it is possible for lastSeqWritten to have a log-sequence-id for a region that is not the earliest log-sequence-id for that region's memstore. HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure that we only keep track of the earliest log-sequence-number that is present in the memstore. Every time the memstore is flushed we remove the region's entry in lastSequenceWritten and wait for the next append to populate this entry again. This is where the problem happens. step 1: flusher.prepare() snapshots the memstore under HRegion.updatesLock.writeLock(). step 2 : as soon as the updatesLock.writeLock() is released new entries will be added into the memstore. step 3 : wal.completeCacheFlush() is called. This method removes the region's entry from lastSeqWritten. step 4: the next append will create a new entry for the region in lastSeqWritten(). But this will be the log seq id of the current append. All the edits that were added in step 2 are missing. == as a temporary measure, instead of removing the region's entry in step 3 I will replace it with the log-seq-id of the region-flush-event. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao reassigned HBASE-4124: - Assignee: gaojinchao ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4124: -- Fix Version/s: 0.90.5 ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Fix For: 0.90.5 Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4124: -- Attachment: HBASE-4124_Branch90V2.patch ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088146#comment-13088146 ] gaojinchao commented on HBASE-4124: --- I have finished the test. I discribe the scene: step 1: startup cluster step 2: abort the master when finish call sendRegionOpen(destination, regions) step 3: startup cluster again. above steps will reproduce the issue. when master is failover. the meta records the dead server,but the region is processing for a living region server. ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088147#comment-13088147 ] gaojinchao commented on HBASE-4124: --- sorry.step 3: startup master again . ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4124: -- Attachment: HBASE-4124_Branch90V2.patch ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4124: -- Attachment: (was: HBASE-4124_Branch90V2.patch) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4124: -- Attachment: HBASE-4124_Branch90V2.patch ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4124: -- Attachment: (was: HBASE-4124_Branch90V2.patch) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088173#comment-13088173 ] gaojinchao commented on HBASE-4124: --- I have added a test case for opening a region. ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Attachments: HBASE-4124_Branch90V1_trial.patch, HBASE-4124_Branch90V2.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-4212) TestMasterFailover fails occasionally
TestMasterFailover fails occasionally - Key: HBASE-4212 URL: https://issues.apache.org/jira/browse/HBASE-4212 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.4 Reporter: gaojinchao Fix For: 0.90.5 It seems a bug. The root in RIT can't be moved.. In the failover process, it enforces root on-line. But not clean zk node. test will wait forever. void processFailover() throws KeeperException, IOException, InterruptedException { // we enforce on-line root. HServerInfo hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation()); regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi); hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation()); regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi); It seems that we should wait finished as meta region int assignRootAndMeta() throws InterruptedException, IOException, KeeperException { int assigned = 0; long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 1000); // Work on ROOT region. Is it in zk in transition? boolean rit = this.assignmentManager. processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO); if (!catalogTracker.verifyRootRegionLocation(timeout)) { this.assignmentManager.assignRoot(); this.catalogTracker.waitForRoot(); //we need add this code and guarantee that the transition has completed this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO); assigned++; } logs: 2011-08-16 07:45:40,715 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,716 INFO [PostOpenDeployTasks:70236052] catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as C4S2.site:47710 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, server=C4S2.site,47710,1313495126115, region=70236052/-ROOT- 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING 2011-08-16 07:45:40,740 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENED 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENED, server=C4S2.site,47710,1313495126115, region=70236052/-ROOT- //.It said that zk node can't be cleaned because of we have
[jira] [Updated] (HBASE-4212) TestMasterFailover fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4212: -- Attachment: HBASE-4212_branch90V1.patch TestMasterFailover fails occasionally - Key: HBASE-4212 URL: https://issues.apache.org/jira/browse/HBASE-4212 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.4 Reporter: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4212_branch90V1.patch It seems a bug. The root in RIT can't be moved.. In the failover process, it enforces root on-line. But not clean zk node. test will wait forever. void processFailover() throws KeeperException, IOException, InterruptedException { // we enforce on-line root. HServerInfo hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation()); regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi); hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation()); regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi); It seems that we should wait finished as meta region int assignRootAndMeta() throws InterruptedException, IOException, KeeperException { int assigned = 0; long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 1000); // Work on ROOT region. Is it in zk in transition? boolean rit = this.assignmentManager. processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO); if (!catalogTracker.verifyRootRegionLocation(timeout)) { this.assignmentManager.assignRoot(); this.catalogTracker.waitForRoot(); //we need add this code and guarantee that the transition has completed this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO); assigned++; } logs: 2011-08-16 07:45:40,715 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,716 INFO [PostOpenDeployTasks:70236052] catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as C4S2.site:47710 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, server=C4S2.site,47710,1313495126115, region=70236052/-ROOT- 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING 2011-08-16 07:45:40,740 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052 and set watcher;
[jira] [Commented] (HBASE-4212) TestMasterFailover fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086199#comment-13086199 ] gaojinchao commented on HBASE-4212: --- I have made a patch. Please review it. Thanks. TestMasterFailover fails occasionally - Key: HBASE-4212 URL: https://issues.apache.org/jira/browse/HBASE-4212 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.4 Reporter: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4212_branch90V1.patch It seems a bug. The root in RIT can't be moved.. In the failover process, it enforces root on-line. But not clean zk node. test will wait forever. void processFailover() throws KeeperException, IOException, InterruptedException { // we enforce on-line root. HServerInfo hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation()); regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi); hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation()); regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi); It seems that we should wait finished as meta region int assignRootAndMeta() throws InterruptedException, IOException, KeeperException { int assigned = 0; long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 1000); // Work on ROOT region. Is it in zk in transition? boolean rit = this.assignmentManager. processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO); if (!catalogTracker.verifyRootRegionLocation(timeout)) { this.assignmentManager.assignRoot(); this.catalogTracker.waitForRoot(); //we need add this code and guarantee that the transition has completed this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO); assigned++; } logs: 2011-08-16 07:45:40,715 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,716 INFO [PostOpenDeployTasks:70236052] catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as C4S2.site:47710 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, server=C4S2.site,47710,1313495126115, region=70236052/-ROOT- 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING 2011-08-16 07:45:40,740 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode
[jira] [Commented] (HBASE-4212) TestMasterFailover fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086202#comment-13086202 ] gaojinchao commented on HBASE-4212: --- I test 10 times and logs said that META is assigned after root has finished. 2011-08-17 05:06:51,419 DEBUG [MASTER_OPEN_REGION-C4S2.site:47578-0] zookeeper.ZKUtil(1109): master:47578-0x131d6fe02e50009 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, server=C4S2.site,60960,1313571996605, state=RS_ZK_REGION_OPENED 2011-08-17 05:06:51,425 DEBUG [Thread-755-EventThread] zookeeper.ZooKeeperWatcher(252): master:47578-0x131d6fe02e50009 Received ZooKeeper Event, type=NodeDeleted, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-17 05:06:51,425 DEBUG [MASTER_OPEN_REGION-C4S2.site:47578-0] zookeeper.ZKAssign(420): master:47578-0x131d6fe02e50009 Successfully deleted unassigned node for region 70236052 in expected state RS_ZK_REGION_OPENED 2011-08-17 05:06:51,426 INFO [Master:0;C4S2.site:47578] master.HMaster(437): -ROOT- assigned=1, rit=false, location=C4S2.site:60960 2011-08-17 05:06:51,426 DEBUG [MASTER_OPEN_REGION-C4S2.site:47578-0] handler.OpenedRegionHandler(108): Opened region -ROOT-,,0.70236052 on C4S2.site,60960,1313571996605 2011-08-17 05:06:51,427 DEBUG [Master:0;C4S2.site:47578] zookeeper.ZKUtil(553): master:47578-0x131d6fe02e50009 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error) 2011-08-17 05:06:51,429 INFO [Master:0;C4S2.site:47578] catalog.CatalogTracker(421): Passed metaserver is null TestMasterFailover fails occasionally - Key: HBASE-4212 URL: https://issues.apache.org/jira/browse/HBASE-4212 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.4 Reporter: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4212_branch90V1.patch It seems a bug. The root in RIT can't be moved.. In the failover process, it enforces root on-line. But not clean zk node. test will wait forever. void processFailover() throws KeeperException, IOException, InterruptedException { // we enforce on-line root. HServerInfo hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation()); regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi); hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation()); regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi); It seems that we should wait finished as meta region int assignRootAndMeta() throws InterruptedException, IOException, KeeperException { int assigned = 0; long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 1000); // Work on ROOT region. Is it in zk in transition? boolean rit = this.assignmentManager. processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO); if (!catalogTracker.verifyRootRegionLocation(timeout)) { this.assignmentManager.assignRoot(); this.catalogTracker.waitForRoot(); //we need add this code and guarantee that the transition has completed this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO); assigned++; } logs: 2011-08-16 07:45:40,715 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,716 INFO [PostOpenDeployTasks:70236052] catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as C4S2.site:47710 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, server=C4S2.site,47710,1313495126115, region=70236052/-ROOT- 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to transition node
[jira] [Updated] (HBASE-4212) TestMasterFailover fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4212: -- Assignee: gaojinchao Status: Patch Available (was: Open) TestMasterFailover fails occasionally - Key: HBASE-4212 URL: https://issues.apache.org/jira/browse/HBASE-4212 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4212_branch90V1.patch It seems a bug. The root in RIT can't be moved.. In the failover process, it enforces root on-line. But not clean zk node. test will wait forever. void processFailover() throws KeeperException, IOException, InterruptedException { // we enforce on-line root. HServerInfo hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation()); regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi); hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation()); regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi); It seems that we should wait finished as meta region int assignRootAndMeta() throws InterruptedException, IOException, KeeperException { int assigned = 0; long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 1000); // Work on ROOT region. Is it in zk in transition? boolean rit = this.assignmentManager. processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO); if (!catalogTracker.verifyRootRegionLocation(timeout)) { this.assignmentManager.assignRoot(); this.catalogTracker.waitForRoot(); //we need add this code and guarantee that the transition has completed this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO); assigned++; } logs: 2011-08-16 07:45:40,715 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,716 INFO [PostOpenDeployTasks:70236052] catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as C4S2.site:47710 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, server=C4S2.site,47710,1313495126115, region=70236052/-ROOT- 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING 2011-08-16 07:45:40,740 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode
[jira] [Updated] (HBASE-4212) TestMasterFailover fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4212: -- Attachment: HBASE-4212_TrunkV1.patch TestMasterFailover fails occasionally - Key: HBASE-4212 URL: https://issues.apache.org/jira/browse/HBASE-4212 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.5 Attachments: HBASE-4212_TrunkV1.patch, HBASE-4212_branch90V1.patch It seems a bug. The root in RIT can't be moved.. In the failover process, it enforces root on-line. But not clean zk node. test will wait forever. void processFailover() throws KeeperException, IOException, InterruptedException { // we enforce on-line root. HServerInfo hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation()); regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi); hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation()); regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi); It seems that we should wait finished as meta region int assignRootAndMeta() throws InterruptedException, IOException, KeeperException { int assigned = 0; long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 1000); // Work on ROOT region. Is it in zk in transition? boolean rit = this.assignmentManager. processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO); if (!catalogTracker.verifyRootRegionLocation(timeout)) { this.assignmentManager.assignRoot(); this.catalogTracker.waitForRoot(); //we need add this code and guarantee that the transition has completed this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO); assigned++; } logs: 2011-08-16 07:45:40,715 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,716 INFO [PostOpenDeployTasks:70236052] catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as C4S2.site:47710 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, server=C4S2.site,47710,1313495126115, region=70236052/-ROOT- 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING 2011-08-16 07:45:40,740 DEBUG [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/70236052 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode
[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.
[ https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4124: -- Attachment: HBASE-4124_Branch90V1_trial.patch I try to make a patch and fix this issue. But I only run the UT test. Please review it firstly and give me some suggestion. I will test it tomorrow. Thanks. ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Key: HBASE-4124 URL: https://issues.apache.org/jira/browse/HBASE-4124 Project: HBase Issue Type: Bug Components: master Reporter: fulin wang Attachments: HBASE-4124_Branch90V1_trial.patch, log.txt Original Estimate: 0.4h Remaining Estimate: 0.4h ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'. Issue: The RS failed besause of 'already online on this server' and return; The HM can not receive the message and report 'Regions in transition timed out'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits
[ https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086728#comment-13086728 ] gaojinchao commented on HBASE-3845: --- Hi,Patch has not yet apply to the branch ? data loss because lastSeqWritten can miss memstore edits Key: HBASE-3845 URL: https://issues.apache.org/jira/browse/HBASE-3845 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: Prakash Khemani Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.90.5 Attachments: 0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch, HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch, HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch, HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_trunk_2.patch, HBASE-3845_trunk_3.patch (I don't have a test case to prove this yet but I have run it by Dhruba and Kannan internally and wanted to put this up for some feedback.) In this discussion let us assume that the region has only one column family. That way I can use region/memstore interchangeably. After a memstore flush it is possible for lastSeqWritten to have a log-sequence-id for a region that is not the earliest log-sequence-id for that region's memstore. HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure that we only keep track of the earliest log-sequence-number that is present in the memstore. Every time the memstore is flushed we remove the region's entry in lastSequenceWritten and wait for the next append to populate this entry again. This is where the problem happens. step 1: flusher.prepare() snapshots the memstore under HRegion.updatesLock.writeLock(). step 2 : as soon as the updatesLock.writeLock() is released new entries will be added into the memstore. step 3 : wal.completeCacheFlush() is called. This method removes the region's entry from lastSeqWritten. step 4: the next append will create a new entry for the region in lastSeqWritten(). But this will be the log seq id of the current append. All the edits that were added in step 2 are missing. == as a temporary measure, instead of removing the region's entry in step 3 I will replace it with the log-seq-id of the region-flush-event. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13084957#comment-13084957 ] gaojinchao commented on HBASE-3933: --- Hi all. I have a new idea for this issue. why don't we get the regionserver list from zk when it is failover? we can avoid this case that the hlog is splited but region server is servering. Hmaster throw NullPointerException -- Key: HBASE-3933 URL: https://issues.apache.org/jira/browse/HBASE-3933 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Attachments: Hmastersetup0.90 NullPointerException while hmaster starting. {code} java.lang.NullPointerException at java.util.TreeMap.getEntry(TreeMap.java:324) at java.util.TreeMap.get(TreeMap.java:255) at org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512) at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606) at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283) {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082212#comment-13082212 ] gaojinchao commented on HBASE-4064: --- I will study the code for trunk and confirm that have fixed. Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread B. the thread B run continue, will throw an exception with the msg of Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f, but without removing the new-adding RegionState from regionsInTransition,and it can not be removed for ever. public void unassign(HRegionInfo region, boolean force) { LOG.debug(Starting unassignment of region + region.getRegionNameAsString() +
[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076000#comment-13076000 ] gaojinchao commented on HBASE-4064: --- Do we need fix this issue? If it need I will test it. or I will close the issue ? Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread B. the thread B run continue, will throw an exception with the msg of Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f, but without removing the new-adding RegionState from regionsInTransition,and it can not be removed for ever. public void unassign(HRegionInfo region, boolean force) { LOG.debug(Starting unassignment of region +
[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070331#comment-13070331 ] gaojinchao commented on HBASE-4064: --- Master may be crashed because of pool shutdown is asynchronous. The master show : 2011-07-22 13:33:27,806 INFO org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 regions of which 2156 are online. 2011-07-22 13:34:28,646 INFO org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 regions of which 982 are online. 2011-07-22 13:34:31,079 WARN org.apache.hadoop.hbase.master.AssignmentManager: gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 2011-07-22 13:34:31,080 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x31502ef4f0 Creating (or updating) unassigned node for c9b1c97ac6c00033ceb1890e45e66229 with OFFLINE state 2011-07-22 13:34:31,104 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. state=OFFLINE, ts=1311312871080 2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. so generated a random one; hri=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229., src=, dest=C4C2.site,60020,1311310281335; 3 (online=3, exclude=null) available servers 2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. to C4C2.site,60020,1311310281335 2011-07-22 13:34:31,122 WARN org.apache.hadoop.hbase.master.AssignmentManager: gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 2011-07-22 13:34:31,123 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. state=PENDING_OPEN, ts=1311312871121 java.lang.IllegalStateException at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1081) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1036) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:864) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:844) at java.lang.Thread.run(Thread.java:662) 2011-07-22 13:34:31,125 INFO org.apache.hadoop.hbase.master.HMaster: Aborting Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, HBASE-4064_branch90V2.patch, disableflow.png 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG
[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4064: -- Attachment: (was: HBASE-4064_branch90V2.patch) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread B. the thread B run continue, will throw an exception with the msg of Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f, but without removing the new-adding RegionState from regionsInTransition,and it can not be removed for ever. public void unassign(HRegionInfo region, boolean force) { LOG.debug(Starting unassignment of region + region.getRegionNameAsString() + (offlining)); synchronized (this.regions) { //
[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4064: -- Attachment: disableflow.png Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread B. the thread B run continue, will throw an exception with the msg of Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f, but without removing the new-adding RegionState from regionsInTransition,and it can not be removed for ever. public void unassign(HRegionInfo region, boolean force) { LOG.debug(Starting unassignment of region + region.getRegionNameAsString() + (offlining)); synchronized (this.regions) { // Check if this region is
[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070304#comment-13070304 ] gaojinchao commented on HBASE-4064: --- !disableflow.png! Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread B. the thread B run continue, will throw an exception with the msg of Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f, but without removing the new-adding RegionState from regionsInTransition,and it can not be removed for ever. public void unassign(HRegionInfo region, boolean force) { LOG.debug(Starting unassignment of region + region.getRegionNameAsString() + (offlining)); synchronized
[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070306#comment-13070306 ] gaojinchao commented on HBASE-4064: --- The patch can't solve J-D issue. But it is improvement for disable table. I make a flow chart(A -B -C-D). We can find there is a window between Remove region from RIT and Remove region from region clellections. So my patch want to change the positon. Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread B. the thread B run continue, will throw an exception with the msg of Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f, but without removing the new-adding RegionState from
[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4064: -- Attachment: HBASE-4064_branch90V2.patch Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, HBASE-4064_branch90V2.patch, disableflow.png 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread B. the thread B run continue, will throw an exception with the msg of Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f, but without removing the new-adding RegionState from regionsInTransition,and it can not be removed for ever. public void unassign(HRegionInfo region, boolean force) { LOG.debug(Starting unassignment of region + region.getRegionNameAsString() + (offlining)); synchronized
[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070310#comment-13070310 ] gaojinchao commented on HBASE-4064: --- I have made a patch, but I don't verify now. I want to review whether is reasonable firstly. then do it. In my cluster I had changed the parameter(hbase.bulk.assignment.waiton.empty.rit) to avoid this issue. Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, HBASE-4064_branch90V2.patch, disableflow.png 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread B. the thread B run continue, will throw an exception with the msg of Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f, but without removing the new-adding RegionState from regionsInTransition,and it
[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13069442#comment-13069442 ] gaojinchao commented on HBASE-4064: --- @J-D Thanks for your replay. :) I got it. In my case, The race is between the disable threads and ClosedRegionHandler threads. 1.Disable thread get region from regions collection (reference getRegionsOfTable) 2.Thread pool gets region and sends request to region server. at the same time puts region into RIT(regionsInTransition), it indicates that the region is processing. 3.Region server finishs closing region and changes the zk state, notifies the master. 4.When master receives the watcher event, It removes the region from RIT and then remove from regions collection. There is a short window when diable table can't finish in a period(). The region may be unssigned again. My patch try to fix above case. remove regions collection firstly and disable thread can't get a processing region. I found the issue yestertay, Enable threads is also a race condition. (I changed the period for 1 minutes because of reproducing the issue). It seems pool couldn't finish but a new enable process starts. we need a sleep time when a enable period finishes The master logs: 2011-07-22 13:33:27,806 INFO org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 regions of which 2156 are online. 2011-07-22 13:34:28,646 INFO org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 regions of which 982 are online. 2011-07-22 13:34:31,079 WARN org.apache.hadoop.hbase.master.AssignmentManager: gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 2011-07-22 13:34:31,080 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x31502ef4f0 Creating (or updating) unassigned node for c9b1c97ac6c00033ceb1890e45e66229 with OFFLINE state 2011-07-22 13:34:31,104 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. state=OFFLINE, ts=1311312871080 2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. so generated a random one; hri=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229., src=, dest=C4C2.site,60020,1311310281335; 3 (online=3, exclude=null) available servers 2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. to C4C2.site,60020,1311310281335 2011-07-22 13:34:31,122 WARN org.apache.hadoop.hbase.master.AssignmentManager: gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 2011-07-22 13:34:31,123 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. state=PENDING_OPEN, ts=1311312871121 java.lang.IllegalStateException at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1081) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1036) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:864) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:844) at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2011-07-22 13:34:31,125 INFO org.apache.hadoop.hbase.master.HMaster: Aborting 2011-07-22 13:34:31,482 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:6-0x31502ef4f0 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/c9b1c97ac6c00033ceb1890e45e66229 2011-07-22 13:34:31,482 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x31502ef4f0 Unable to get data of znode /hbase/unassigned/c9b1c97ac6c00033ceb1890e45e66229 Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments:
[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068876#comment-13068876 ] gaojinchao commented on HBASE-4064: --- Please don't merge the patch, I found other issue and need dig whether is relation to the patch. Thanks. Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread B. the thread B run continue, will throw an exception with the msg of Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f, but without removing the new-adding RegionState from regionsInTransition,and it can not be removed for ever. public void unassign(HRegionInfo region, boolean force) { LOG.debug(Starting unassignment of region +
[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
[ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068747#comment-13068747 ] gaojinchao commented on HBASE-4095: --- I added some log and found that the initialReplication is zero. when we create a file in hdfs , If I don't write data , the replication should be zero. So the solution has some issue. 2011-07-20 19:38:20,517 WARN [RegionServer:1;C4C3.site,41763,1311161899551] wal.HLog(478): gjc:rollWriter start1311161900517 2011-07-20 19:38:20,650 WARN [RegionServer:0;C4C3.site,35697,1311161899494] wal.HLog(478): gjc:rollWriter start1311161900650 2011-07-20 19:38:20,707 WARN [RegionServer:1;C4C3.site,41763,1311161899551] wal.HLog(518): gjc:updateLock start1311161900707 2011-07-20 19:38:20,707 WARN [RegionServer:1;C4C3.site,41763,1311161899551] wal.HLog(532): gjc:initialReplication start0 2011-07-20 19:38:21,238 WARN [RegionServer:0;C4C3.site,35697,1311161899494] wal.HLog(518): gjc:updateLock start1311161901238 2011-07-20 19:38:21,239 WARN [RegionServer:0;C4C3.site,35697,1311161899494] wal.HLog(532): gjc:initialReplication start0 2011-07-20 19:38:41,726 WARN [IPC Server handler 4 on 37616] wal.HLog(478): gjc:rollWriter start1311161921726 2011-07-20 19:38:41,769 WARN [IPC Server handler 4 on 37616] wal.HLog(518): gjc:updateLock start1311161921769 2011-07-20 19:38:41,769 WARN [IPC Server handler 4 on 37616] wal.HLog(532): gjc:initialReplication start0 Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked Key: HBASE-4095 URL: https://issues.apache.org/jira/browse/HBASE-4095 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.3 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge: 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync. 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: {noformat} private void checkLowReplication() { // if the number of replicas in HDFS has fallen below the initial // value, then roll logs. try { int numCurrentReplicas = getLogReplication(); if (numCurrentReplicas != 0 numCurrentReplicas this.initialReplication) { LOG.warn(HDFS pipeline error detected. + Found + numCurrentReplicas + replicas but expecting + this.initialReplication + replicas. + Requesting close of hlog.); requestLogRoll(); logRollRequested = true; } } catch (Exception e) { LOG.warn(Unable to invoke DFSOutputStream.getNumCurrentReplicas + e + still proceeding ahead...); } } {noformat} 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock. But the lock may be carried by the cacheflush threads. 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped. Here's the logs while the problem happened(Please notice the file size of hlog 193-195-5-111%3A20020.1309937386639 in the last row): 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas. Requesting close of hlog. 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas. Requesting close of hlog. 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore
[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068796#comment-13068796 ] gaojinchao commented on HBASE-4064: --- Hi, I verified the issue by adding a sleep in regionOffline. I think V2 is ok. below code: public void regionOffline(final HRegionInfo regionInfo) { synchronized(this.regionsInTransition) { if (this.regionsInTransition.remove(regionInfo.getEncodedName()) != null) { this.regionsInTransition.notifyAll(); } } try{ Thread.sleep(1000); }catch(Throwable e){ ; } // remove the region plan as well just in case. clearRegionPlan(regionInfo); setOffline(regionInfo); } Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread
[jira] [Updated] (HBASE-4112) Creating table may throw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4112: -- Resolution: Fixed Status: Resolved (was: Patch Available) Creating table may throw NullPointerException - Key: HBASE-4112 URL: https://issues.apache.org/jira/browse/HBASE-4112 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4112_Trunk.patch, HBASE-4112_branch90V1.patch It happened in latest branch 0.90. but I can't reproduce it. It seems using api getHRegionInfoOrNull is better or check the input parameter before call getHRegionInfo. Code: public static Writable getWritable(final byte [] bytes, final Writable w) throws IOException { return getWritable(bytes, 0, bytes.length, w); } return getWritable(bytes, 0, bytes.length, w); // It seems input parameter bytes is null logs: 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection established to C4C3.site/157.5.100.3:2181, initiating session 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment complete on server C4C3.site/157.5.100.3:2181, sessionid = 0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222! [INFO] Create : ufdr111 start! java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1 19) at org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java :306) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1 90) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9 5) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7 3) at org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java: 325) at createTable.main(createTable.java:96) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4112) Creating table threw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4112: -- Attachment: HBASE-4112_branch90V1.patch Creating table threw NullPointerException - Key: HBASE-4112 URL: https://issues.apache.org/jira/browse/HBASE-4112 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4112_branch90V1.patch It happened in latest branch 0.90. but I can't reproduce it. It seems using api getHRegionInfoOrNull is better or check the input parameter before call getHRegionInfo. Code: public static Writable getWritable(final byte [] bytes, final Writable w) throws IOException { return getWritable(bytes, 0, bytes.length, w); } return getWritable(bytes, 0, bytes.length, w); // It seems input parameter bytes is null logs: 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection established to C4C3.site/157.5.100.3:2181, initiating session 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment complete on server C4C3.site/157.5.100.3:2181, sessionid = 0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222! [INFO] Create : ufdr111 start! java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1 19) at org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java :306) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1 90) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9 5) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7 3) at org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java: 325) at createTable.main(createTable.java:96) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4112) Creating table threw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066918#comment-13066918 ] gaojinchao commented on HBASE-4112: --- The reason is META table had some dirty data(eg: column=info:server). recreating table will throw exception. I have made a patch and verified, Please review it. Thanks. All tests passed. Creating table threw NullPointerException - Key: HBASE-4112 URL: https://issues.apache.org/jira/browse/HBASE-4112 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4112_branch90V1.patch It happened in latest branch 0.90. but I can't reproduce it. It seems using api getHRegionInfoOrNull is better or check the input parameter before call getHRegionInfo. Code: public static Writable getWritable(final byte [] bytes, final Writable w) throws IOException { return getWritable(bytes, 0, bytes.length, w); } return getWritable(bytes, 0, bytes.length, w); // It seems input parameter bytes is null logs: 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection established to C4C3.site/157.5.100.3:2181, initiating session 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment complete on server C4C3.site/157.5.100.3:2181, sessionid = 0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222! [INFO] Create : ufdr111 start! java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1 19) at org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java :306) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1 90) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9 5) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7 3) at org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java: 325) at createTable.main(createTable.java:96) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4112) Creating table threw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067444#comment-13067444 ] gaojinchao commented on HBASE-4112: --- False means finished scan. True mean continue and process the next record. In this case , True is better.(my test is also) // the code segment for metaScan. for (Result rr : rrs) { if (processedRows = rowUpperLimit) { break done; } if (!visitor.processRow(rr)) break done; //exit completely // processedRows++; } Creating table threw NullPointerException - Key: HBASE-4112 URL: https://issues.apache.org/jira/browse/HBASE-4112 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4112_branch90V1.patch It happened in latest branch 0.90. but I can't reproduce it. It seems using api getHRegionInfoOrNull is better or check the input parameter before call getHRegionInfo. Code: public static Writable getWritable(final byte [] bytes, final Writable w) throws IOException { return getWritable(bytes, 0, bytes.length, w); } return getWritable(bytes, 0, bytes.length, w); // It seems input parameter bytes is null logs: 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection established to C4C3.site/157.5.100.3:2181, initiating session 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment complete on server C4C3.site/157.5.100.3:2181, sessionid = 0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222! [INFO] Create : ufdr111 start! java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1 19) at org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java :306) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1 90) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9 5) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7 3) at org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java: 325) at createTable.main(createTable.java:96) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4112) Creating table threw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067449#comment-13067449 ] gaojinchao commented on HBASE-4112: --- Ok, I try to make a patch for TRUNK. Creating table threw NullPointerException - Key: HBASE-4112 URL: https://issues.apache.org/jira/browse/HBASE-4112 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4112_branch90V1.patch It happened in latest branch 0.90. but I can't reproduce it. It seems using api getHRegionInfoOrNull is better or check the input parameter before call getHRegionInfo. Code: public static Writable getWritable(final byte [] bytes, final Writable w) throws IOException { return getWritable(bytes, 0, bytes.length, w); } return getWritable(bytes, 0, bytes.length, w); // It seems input parameter bytes is null logs: 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection established to C4C3.site/157.5.100.3:2181, initiating session 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment complete on server C4C3.site/157.5.100.3:2181, sessionid = 0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222! [INFO] Create : ufdr111 start! java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1 19) at org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java :306) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1 90) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9 5) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7 3) at org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java: 325) at createTable.main(createTable.java:96) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4112) Creating table threw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4112: -- Attachment: HBASE-4112_Trunk.patch Creating table threw NullPointerException - Key: HBASE-4112 URL: https://issues.apache.org/jira/browse/HBASE-4112 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4112_Trunk.patch, HBASE-4112_branch90V1.patch It happened in latest branch 0.90. but I can't reproduce it. It seems using api getHRegionInfoOrNull is better or check the input parameter before call getHRegionInfo. Code: public static Writable getWritable(final byte [] bytes, final Writable w) throws IOException { return getWritable(bytes, 0, bytes.length, w); } return getWritable(bytes, 0, bytes.length, w); // It seems input parameter bytes is null logs: 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection established to C4C3.site/157.5.100.3:2181, initiating session 11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment complete on server C4C3.site/157.5.100.3:2181, sessionid = 0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222! [INFO] Create : ufdr111 start! java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1 19) at org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java :306) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1 90) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9 5) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7 3) at org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java: 325) at createTable.main(createTable.java:96) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066387#comment-13066387 ] gaojinchao commented on HBASE-4064: --- @Stack: I will reproduce and verify it after finishing review. Because it may spend a lot of time. Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread B. the thread B run continue, will throw an exception with the msg of Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f, but without removing the new-adding RegionState from regionsInTransition,and it can not be removed for ever. public void unassign(HRegionInfo region, boolean force) { LOG.debug(Starting unassignment of region +
[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...
[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4064: -- Attachment: HBASE-4064_branch90V2.patch I try to make a patch. if the region is in RIT, It shouldn't be unsigned again. So it seems changing the code position can solve this issue. ALL test passed, Please review and give some suggesion. Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long... Key: HBASE-4064 URL: https://issues.apache.org/jira/browse/HBASE-4064 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: Jieshan Bean Fix For: 0.90.5 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch 1. If there is a rubbish RegionState object with PENDING_CLOSE in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as rubbish object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop: 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere . 2 In the following scenario, two concurrent unassigning call of the same region may lead to the above problem: the first unassign call send rpc call success, the master watched the event of RS_ZK_REGION_CLOSED, process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg. while ClosedRegionHandler is running in hbase.master.executor.closeregion.threads thread (A), another unassign call of same region run in another thread(B). while thread B run if (!regions.containsKey(region)), this.regions have the region info, now cpu switch to thread A. The thread A will remove the region from the sets of this.regions and regionsInTransition, then switch to thread B. the thread B run continue, will throw an exception with the msg of Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f, but without removing the new-adding RegionState from regionsInTransition,and it can not be removed for ever. public void unassign(HRegionInfo region,
[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064379#comment-13064379 ] gaojinchao commented on HBASE-3933: --- OK, Thanks. It happens rarely.I can't get a better change now. Hmaster throw NullPointerException -- Key: HBASE-3933 URL: https://issues.apache.org/jira/browse/HBASE-3933 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Attachments: Hmastersetup0.90 NullPointerException while hmaster starting. {code} java.lang.NullPointerException at java.util.TreeMap.getEntry(TreeMap.java:324) at java.util.TreeMap.get(TreeMap.java:255) at org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512) at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606) at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283) {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3995) HBASE-3946 broke TestMasterFailover
[ https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055406#comment-13055406 ] gaojinchao commented on HBASE-3995: --- Hi, stack. Following code snippet is repeated if (storedInfo == null) if (storedInfo == null) { ... if (storedInfo == null) { storedInfo = this.onlineServers.get(info.getServerName()); } } HBASE-3946 broke TestMasterFailover --- Key: HBASE-3995 URL: https://issues.apache.org/jira/browse/HBASE-3995 Project: HBase Issue Type: Bug Reporter: stack Assignee: stack Priority: Blocker Fix For: 0.90.4 Attachments: am.txt TestMasterFailover is all about a new master coming up on an existing cluster. Previous to HBASE-3946, the new master joining a cluster processing any dead servers would assign all regions found on the dead server even if they were split parents. We don't want that. But TestMasterFailover mocks up some pretty interesting conditions. The one we were failing on was that while the master was offine, we'd manually add a region to zk that was in CLOSING state. We'd then go and disable the table up in zk (while master was offline). Finally, we'd' kill the server that was supposed to be hosting the region from the disabled table in CLOSING state. Then we'd have the master join the cluster. It had to figure it out. Before HBASE-3946, we'd just force offline every region that had been on the dead server. This would call all to be assigned only on assign, regions from disabled tables are skipped, so it all worked (except would online parent of a split should there be one). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4028) Hmaster crashes caused by splitting log.
[ https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055248#comment-13055248 ] gaojinchao commented on HBASE-4028: --- Ted, Thanks for your work. Hmaster crashes caused by splitting log. Key: HBASE-4028 URL: https://issues.apache.org/jira/browse/HBASE-4028 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4028-0.90V2, Screenshot-2.png, Verifiedresult.png, hbase-root-master-157-5-100-8.rar In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G when one region server crashed. I added some print in function doneWriting and found the values of totalBuffered is negative. 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832release size25168 void doneWriting(RegionEntryBuffer buffer) { synchronized (this) { LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey + buffer.encodedRegionName ); boolean removed = currentlyWriting.remove(buffer.encodedRegionName); assert removed; } long size = buffer.heapSize(); synchronized (dataAvailable) { totalBuffered -= size; LOG.warn(gjc:release Used + totalBuffered ); // We may unblock writers dataAvailable.notifyAll(); } LOG.warn(gjc:release Used + totalBuffered + release size+ size); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4028) Hmaster crashes caused by splitting log.
[ https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054846#comment-13054846 ] gaojinchao commented on HBASE-4028: --- Oh, my god! There is another bugs. It is Hidden. :) following code snippet protected AtomicReferenceThrowable thrown = new AtomicReferenceThrowable(); thrown.get is null but not thrown. So the below condition is wrong. while (totalBuffered maxHeapUsage thrown == null) I have made a new patch. Please review it. Thanks. Hmaster crashes caused by splitting log. Key: HBASE-4028 URL: https://issues.apache.org/jira/browse/HBASE-4028 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4028-0.90V1.patch, Screenshot-2.png, hbase-root-master-157-5-100-8.rar In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G when one region server crashed. I added some print in function doneWriting and found the values of totalBuffered is negative. 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832release size25168 void doneWriting(RegionEntryBuffer buffer) { synchronized (this) { LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey + buffer.encodedRegionName ); boolean removed = currentlyWriting.remove(buffer.encodedRegionName); assert removed; } long size = buffer.heapSize(); synchronized (dataAvailable) { totalBuffered -= size; LOG.warn(gjc:release Used + totalBuffered ); // We may unblock writers dataAvailable.notifyAll(); } LOG.warn(gjc:release Used + totalBuffered + release size+ size); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.
[ https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4028: -- Attachment: HBASE-4028-0.90V2 Hmaster crashes caused by splitting log. Key: HBASE-4028 URL: https://issues.apache.org/jira/browse/HBASE-4028 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4028-0.90V1.patch, HBASE-4028-0.90V2, Screenshot-2.png, hbase-root-master-157-5-100-8.rar In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G when one region server crashed. I added some print in function doneWriting and found the values of totalBuffered is negative. 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832release size25168 void doneWriting(RegionEntryBuffer buffer) { synchronized (this) { LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey + buffer.encodedRegionName ); boolean removed = currentlyWriting.remove(buffer.encodedRegionName); assert removed; } long size = buffer.heapSize(); synchronized (dataAvailable) { totalBuffered -= size; LOG.warn(gjc:release Used + totalBuffered ); // We may unblock writers dataAvailable.notifyAll(); } LOG.warn(gjc:release Used + totalBuffered + release size+ size); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4028) Hmaster crashes caused by splitting log.
[ https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054847#comment-13054847 ] gaojinchao commented on HBASE-4028: --- The verified result: hbase-root-master-157-5-111-22.log:2011-06-25 17:18:53,768 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of buffered edits, waiting for IO threads... hbase-root-master-157-5-111-22.log:2011-06-25 17:18:56,768 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of buffered edits, waiting for IO threads... hbase-root-master-157-5-111-22.log:2011-06-25 17:18:59,768 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of buffered edits, waiting for IO threads... hbase-root-master-157-5-111-22.log:2011-06-25 17:19:02,768 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of buffered edits, waiting for IO threads... hbase-root-master-157-5-111-22.log:2011-06-25 17:19:05,769 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of buffered edits, waiting for IO threads... hbase-root-master-157-5-111-22.log:2011-06-25 17:19:08,769 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of buffered edits, waiting for IO threads... hbase-root-master-157-5-111-22.log:2011-06-25 17:19:11,769 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of buffered edits, waiting for IO threads... hbase-root-master-157-5-111-22.log:2011-06-25 17:19:14,769 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of buffered edits, waiting for IO threads... hbase-root-master-157-5-111-22.log:2011-06-25 17:19:17,769 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of buffered edits, waiting for IO threads... hbase-root-master-157-5-111-22.log:2011-06-25 17:19:20,770 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of buffered edits, waiting for IO threads... hbase-root-master-157-5-111-22.log:2011-06-25 17:19:23,770 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of buffered edits, waiting for IO thre Hmaster crashes caused by splitting log. Key: HBASE-4028 URL: https://issues.apache.org/jira/browse/HBASE-4028 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4028-0.90V2, Screenshot-2.png, hbase-root-master-157-5-100-8.rar In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G when one region server crashed. I added some print in function doneWriting and found the values of totalBuffered is negative. 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832release size25168 void doneWriting(RegionEntryBuffer buffer) { synchronized (this) { LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey + buffer.encodedRegionName ); boolean removed = currentlyWriting.remove(buffer.encodedRegionName); assert removed; } long size = buffer.heapSize(); synchronized (dataAvailable) { totalBuffered -= size; LOG.warn(gjc:release Used + totalBuffered ); // We may unblock writers dataAvailable.notifyAll(); } LOG.warn(gjc:release Used + totalBuffered + release size+ size); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.
[ https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4028: -- Attachment: (was: HBASE-4028-0.90V1.patch) Hmaster crashes caused by splitting log. Key: HBASE-4028 URL: https://issues.apache.org/jira/browse/HBASE-4028 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4028-0.90V2, Screenshot-2.png, hbase-root-master-157-5-100-8.rar In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G when one region server crashed. I added some print in function doneWriting and found the values of totalBuffered is negative. 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832release size25168 void doneWriting(RegionEntryBuffer buffer) { synchronized (this) { LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey + buffer.encodedRegionName ); boolean removed = currentlyWriting.remove(buffer.encodedRegionName); assert removed; } long size = buffer.heapSize(); synchronized (dataAvailable) { totalBuffered -= size; LOG.warn(gjc:release Used + totalBuffered ); // We may unblock writers dataAvailable.notifyAll(); } LOG.warn(gjc:release Used + totalBuffered + release size+ size); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-4028) Hmaster crashes caused by splitting log.
[ https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao reassigned HBASE-4028: - Assignee: gaojinchao Hmaster crashes caused by splitting log. Key: HBASE-4028 URL: https://issues.apache.org/jira/browse/HBASE-4028 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G when one region server crashed. I added some print in function doneWriting and found the values of totalBuffered is negative. 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832release size25168 void doneWriting(RegionEntryBuffer buffer) { synchronized (this) { LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey + buffer.encodedRegionName ); boolean removed = currentlyWriting.remove(buffer.encodedRegionName); assert removed; } long size = buffer.heapSize(); synchronized (dataAvailable) { totalBuffered -= size; LOG.warn(gjc:release Used + totalBuffered ); // We may unblock writers dataAvailable.notifyAll(); } LOG.warn(gjc:release Used + totalBuffered + release size+ size); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.
[ https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4028: -- Attachment: Screenshot-2.png Hmaster crashes caused by splitting log. Key: HBASE-4028 URL: https://issues.apache.org/jira/browse/HBASE-4028 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: Screenshot-2.png In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G when one region server crashed. I added some print in function doneWriting and found the values of totalBuffered is negative. 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832release size25168 void doneWriting(RegionEntryBuffer buffer) { synchronized (this) { LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey + buffer.encodedRegionName ); boolean removed = currentlyWriting.remove(buffer.encodedRegionName); assert removed; } long size = buffer.heapSize(); synchronized (dataAvailable) { totalBuffered -= size; LOG.warn(gjc:release Used + totalBuffered ); // We may unblock writers dataAvailable.notifyAll(); } LOG.warn(gjc:release Used + totalBuffered + release size+ size); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.
[ https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4028: -- Attachment: hbase-root-master-157-5-100-8.rar Hmaster crashes caused by splitting log. Key: HBASE-4028 URL: https://issues.apache.org/jira/browse/HBASE-4028 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: Screenshot-2.png, hbase-root-master-157-5-100-8.rar In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G when one region server crashed. I added some print in function doneWriting and found the values of totalBuffered is negative. 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832release size25168 void doneWriting(RegionEntryBuffer buffer) { synchronized (this) { LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey + buffer.encodedRegionName ); boolean removed = currentlyWriting.remove(buffer.encodedRegionName); assert removed; } long size = buffer.heapSize(); synchronized (dataAvailable) { totalBuffered -= size; LOG.warn(gjc:release Used + totalBuffered ); // We may unblock writers dataAvailable.notifyAll(); } LOG.warn(gjc:release Used + totalBuffered + release size+ size); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.
[ https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4028: -- Attachment: HBASE-4028-0.90V1.patch Hmaster crashes caused by splitting log. Key: HBASE-4028 URL: https://issues.apache.org/jira/browse/HBASE-4028 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.90.4 Attachments: HBASE-4028-0.90V1.patch, Screenshot-2.png, hbase-root-master-157-5-100-8.rar In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G when one region server crashed. I added some print in function doneWriting and found the values of totalBuffered is negative. 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used -565832release size25168 void doneWriting(RegionEntryBuffer buffer) { synchronized (this) { LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey + buffer.encodedRegionName ); boolean removed = currentlyWriting.remove(buffer.encodedRegionName); assert removed; } long size = buffer.heapSize(); synchronized (dataAvailable) { totalBuffered -= size; LOG.warn(gjc:release Used + totalBuffered ); // We may unblock writers dataAvailable.notifyAll(); } LOG.warn(gjc:release Used + totalBuffered + release size+ size); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13048448#comment-13048448 ] gaojinchao commented on HBASE-3892: --- TRUNK don't need. It has modified for zk watcher. Below code should protect this case. // RegionState must be null, or SPLITTING or PENDING_CLOSE. if (!isInStateForSplitting(regionState)) break; Table can't disable --- Key: HBASE-3892 URL: https://issues.apache.org/jira/browse/HBASE-3892 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 Attachments: AssignmentManager_90v3.patch, AssignmentManager_90v4.patch, logs.rar In TimeoutMonitor : if node exists and node state is RS_ZK_REGION_CLOSED We should send a zk message again when close region is timeout. in this case, It may be loss some message. I See. It seems like a bug. This is my analysis. // disable table and master sent Close message to region server, Region state was set PENDING_CLOSE 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // received splitting message and cleared Region state (PENDING_CLOSE) 2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
[jira] [Commented] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13047117#comment-13047117 ] gaojinchao commented on HBASE-3892: --- It didn't reproduce. So, My guess J-D is right. Below logs shows that: Region server repeated message an interval of 60s. So It should be IPC timeout. 2011-05-08 17:43:45,507 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:44:45,521 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:45:45,524 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:46:45,528 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 Hmaster logs shows regionServerReport IPC had been closed. It also proved IPC timeout 2011-05-08 17:52:47,703 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call regionServerReport(serverName=C4C3.site,60020,1304820199474, load=(requests=0, regions=55, usedHeap=1058, maxHeap=8175), [Lorg.apache.hadoop.hbase.HMsg;@1453ecec, [Lorg.apache.hadoop.hbase.HRegionInfo;@11e78461) from 157.5.100.3:37518: output error 2011-05-08 17:52:47,704 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 6 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083) But I can't dig the root reason. Table can't disable --- Key: HBASE-3892 URL: https://issues.apache.org/jira/browse/HBASE-3892 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 Attachments: AssignmentManager_90v3.patch, AssignmentManager_90v4.patch, logs.rar In TimeoutMonitor : if node exists and node state is RS_ZK_REGION_CLOSED We should send a zk message again when close region is timeout. in this case, It may be loss some message. I See. It seems like a bug. This is my analysis. // disable table and master sent Close message to region server, Region state was set PENDING_CLOSE 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // received splitting message and cleared Region state (PENDING_CLOSE) 2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from
[jira] [Commented] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13045889#comment-13045889 ] gaojinchao commented on HBASE-3892: --- No, It need review and merge. Table can't disable --- Key: HBASE-3892 URL: https://issues.apache.org/jira/browse/HBASE-3892 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 Attachments: AssignmentManager_90v2.patch, AssignmentManager_90v3.patch, logs.rar In TimeoutMonitor : if node exists and node state is RS_ZK_REGION_CLOSED We should send a zk message again when close region is timeout. in this case, It may be loss some message. I See. It seems like a bug. This is my analysis. // disable table and master sent Close message to region server, Region state was set PENDING_CLOSE 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // received splitting message and cleared Region state (PENDING_CLOSE) 2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT:
[jira] [Updated] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-3892: -- Attachment: AssignmentManager_90v4.patch Table can't disable --- Key: HBASE-3892 URL: https://issues.apache.org/jira/browse/HBASE-3892 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 Attachments: AssignmentManager_90v3.patch, AssignmentManager_90v4.patch, logs.rar In TimeoutMonitor : if node exists and node state is RS_ZK_REGION_CLOSED We should send a zk message again when close region is timeout. in this case, It may be loss some message. I See. It seems like a bug. This is my analysis. // disable table and master sent Close message to region server, Region state was set PENDING_CLOSE 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // received splitting message and cleared Region state (PENDING_CLOSE) 2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT:
[jira] [Commented] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042715#comment-13042715 ] gaojinchao commented on HBASE-3892: --- I am not familiar with unit test(It seems diffult to send a double report of a split and test cluster function). So I verified it by modified region server code. Logs: 2011-06-02 19:57:49,056 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613816508215#1308,1307014802589.c5a8a47d9c84f417b9fcc4c8019e7c7e.: Daughters; ufdr2,8613816508215#1308,1307015867020.37481173e31ea469bcaa310cf8d7d980., ufdr2,8613816595415#3432,1307015867020.afbf02ef235cabe66026f7c393d79bc0. from C4C4.site,60020,1307015130114 2011-06-02 19:57:49,057 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x30502c278f Unable to get data of znode /hbase/unassigned/c5a8a47d9c84f417b9fcc4c8019e7c7e because node does not exist (not necessarily an error) 2011-06-02 19:57:49,081 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613816508215#1308,1307014802589.c5a8a47d9c84f417b9fcc4c8019e7c7e.: Daughters; ufdr2,8613816508215#1308,1307015867020.37481173e31ea469bcaa310cf8d7d980., ufdr2,8613816595415#3432,1307015867020.afbf02ef235cabe66026f7c393d79bc0. from C4C4.site,60020,1307015130114 2011-06-02 19:57:49,083 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x30502c278f Unable to get data of znode /hbase/unassigned/c5a8a47d9c84f417b9fcc4c8019e7c7e because node does not exist (not necessarily an error) 2011-06-02 19:57:49,083 WARN org.apache.hadoop.hbase.master.AssignmentManager: Trying to process the split of 37481173e31ea469bcaa310cf8d7d980, but it was already done and one daughter is on region server serverName=C4C4.site,60020,1307015130114, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) 2011-06-02 19:57:56,468 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613819021840#1446,1307014756068.f865d41d918297f30b576b9ea3ccea07.: Daughters; ufdr2,8613819021840#1446,1307015873554.baa21e4f0cfa5840f009d0fac8e83d15., ufdr2,8613819104397#3916,1307015873554.fb63f608e5e37f5e85d71c925bc78010. from C4C3.site,60020,1307015129703 2011-06-02 19:57:56,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x30502c278f Unable to get data of znode /hbase/unassigned/f865d41d918297f30b576b9ea3ccea07 because node does not exist (not necessarily an error) 2011-06-02 19:57:56,472 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613819021840#1446,1307014756068.f865d41d918297f30b576b9ea3ccea07.: Daughters; ufdr2,8613819021840#1446,1307015873554.baa21e4f0cfa5840f009d0fac8e83d15., ufdr2,8613819104397#3916,1307015873554.fb63f608e5e37f5e85d71c925bc78010. from C4C3.site,60020,1307015129703 2011-06-02 19:57:56,474 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x30502c278f Unable to get data of znode /hbase/unassigned/f865d41d918297f30b576b9ea3ccea07 because node does not exist (not necessarily an error) 2011-06-02 19:57:56,474 WARN org.apache.hadoop.hbase.master.AssignmentManager: Trying to process the split of baa21e4f0cfa5840f009d0fac8e83d15, but it was already done and one daughter is on region server serverName=C4C3.site,60020,1307015129703, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) Thanks for your hint. It should be a 60 seconds timeout. Region server repeated message about 60s. 2011-05-08 17:43:45,507 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:44:45,521 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 It seems a race with regionsInTransition. So IPC was blocked. I try to reproduce it. Hmaster logs: It recieved so much message RS_ZK_REGION_CLOSED. 2011-05-08 17:43:45,157 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/83c05d9ead23d9a260edf30dc8739cf7 and set watcher; region=ufdr,2011050802#8613815394007#0610,1304847545412.83c05d9ead23d9a260edf30dc8739cf7., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSING 2011-05-08 17:43:45,525 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:43:48,943 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/5e3bacf3f43b6bad874e80c2f971e632 and set watcher;
[jira] [Updated] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-3892: -- Attachment: AssignmentManager_90v3.patch Table can't disable --- Key: HBASE-3892 URL: https://issues.apache.org/jira/browse/HBASE-3892 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 Attachments: AssignmentManager_90.patch, AssignmentManager_90v2.patch, AssignmentManager_90v3.patch, logs.rar In TimeoutMonitor : if node exists and node state is RS_ZK_REGION_CLOSED We should send a zk message again when close region is timeout. in this case, It may be loss some message. I See. It seems like a bug. This is my analysis. // disable table and master sent Close message to region server, Region state was set PENDING_CLOSE 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // received splitting message and cleared Region state (PENDING_CLOSE) 2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT:
[jira] [Updated] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-3892: -- Attachment: (was: AssignmentManager_90.patch) Table can't disable --- Key: HBASE-3892 URL: https://issues.apache.org/jira/browse/HBASE-3892 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 Attachments: AssignmentManager_90v2.patch, AssignmentManager_90v3.patch, logs.rar In TimeoutMonitor : if node exists and node state is RS_ZK_REGION_CLOSED We should send a zk message again when close region is timeout. in this case, It may be loss some message. I See. It seems like a bug. This is my analysis. // disable table and master sent Close message to region server, Region state was set PENDING_CLOSE 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // received splitting message and cleared Region state (PENDING_CLOSE) 2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT:
[jira] [Commented] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042050#comment-13042050 ] gaojinchao commented on HBASE-3892: --- I keep digging, find the repeated message was sent by region server. If regionServerReport throwed exception, Region server will connect Hmaster again and send message again. //region server logs. 2011-05-08 17:43:45,507 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:44:45,521 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:45:45,524 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:46:45,528 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:47:45,531 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:48:45,535 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:49:46,091 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:50:46,096 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:51:46,099 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 2011-05-08 17:52:46,104 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:6 //region server code. ListHMsg tryRegionServerReport(final ListHMsg outboundMessages) throws IOException { this.serverInfo.setLoad(buildServerLoad()); this.requestCount.set(0); addOutboundMsgs(outboundMessages); HMsg [] msgs = null; while (!this.stopped) { try { msgs = this.hbaseMaster.regionServerReport(this.serverInfo, outboundMessages.toArray(HMsg.EMPTY_HMSG_ARRAY), getMostLoadedRegions()); break; } catch (IOException ioe) { if (ioe instanceof RemoteException) { ioe = ((RemoteException)ioe).unwrapRemoteException(); } if (ioe instanceof YouAreDeadException) { // This will be caught and handled as a fatal error in run() throw ioe; } // Couldn't connect to the master, get location from zk and reconnect // Method blocks until new master is found or we are stopped getMaster(); } } Why did regionServerReport throw exception ? It seems Hmaster was busy and IPC blocked. Hmaster logs: 2011-05-08 17:44:25,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call regionServerReport(serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175), [Lorg.apache.hadoop.hbase.HMsg;@520ed128, [Lorg.apache.hadoop.hbase.HRegionInfo;@4ac5c32e) from 157.5.100.4:50187: output error 2011-05-08 17:44:25,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 11 on 6 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083) Table can't disable --- Key: HBASE-3892 URL: https://issues.apache.org/jira/browse/HBASE-3892 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 Attachments: AssignmentManager_90.patch, AssignmentManager_90v2.patch, logs.rar In TimeoutMonitor : if node exists and node state is RS_ZK_REGION_CLOSED We should send a zk message again when close region is timeout. in this case, It may be loss some message. I See. It seems like a bug. This is my analysis. // disable table and master sent Close message to region server, Region state was set PENDING_CLOSE 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters;
[jira] [Commented] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042062#comment-13042062 ] gaojinchao commented on HBASE-3892: --- patch(AssignmentManager_90v2) looks like benefit. Thanks. Table can't disable --- Key: HBASE-3892 URL: https://issues.apache.org/jira/browse/HBASE-3892 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 Attachments: AssignmentManager_90.patch, AssignmentManager_90v2.patch, logs.rar In TimeoutMonitor : if node exists and node state is RS_ZK_REGION_CLOSED We should send a zk message again when close region is timeout. in this case, It may be loss some message. I See. It seems like a bug. This is my analysis. // disable table and master sent Close message to region server, Region state was set PENDING_CLOSE 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // received splitting message and cleared Region state (PENDING_CLOSE) 2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager:
[jira] [Commented] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041536#comment-13041536 ] gaojinchao commented on HBASE-3892: --- Hi, Stack. I made a mistake above analysis. I read the code again, The root reason is spliting message repeated. 2011-05-08 17:42:45,514 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 //Closed the region. 2011-05-08 17:43:37,599 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. (offlining) 2011-05-08 17:43:45,525 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // set RIT state and sent a message 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. //Received split message again and RIT was deleted . So it could not process closed event. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // 2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT:
[jira] [Updated] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-3892: -- Attachment: AssignmentManager_90v2.patch Table can't disable --- Key: HBASE-3892 URL: https://issues.apache.org/jira/browse/HBASE-3892 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 Attachments: AssignmentManager_90.patch, AssignmentManager_90v2.patch, logs.rar In TimeoutMonitor : if node exists and node state is RS_ZK_REGION_CLOSED We should send a zk message again when close region is timeout. in this case, It may be loss some message. I See. It seems like a bug. This is my analysis. // disable table and master sent Close message to region server, Region state was set PENDING_CLOSE 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // received splitting message and cleared Region state (PENDING_CLOSE) 2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: