[jira] [Commented] (HBASE-4120) isolation and allocation

2012-07-29 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424501#comment-13424501
 ] 

gaojinchao commented on HBASE-4120:
---

Hi Liu jia
Are you working for this issue now? when do you plan to finish?

 isolation and allocation
 

 Key: HBASE-4120
 URL: https://issues.apache.org/jira/browse/HBASE-4120
 Project: HBase
  Issue Type: New Feature
  Components: master, regionserver
Affects Versions: 0.90.2, 0.90.3, 0.90.4, 0.92.0
Reporter: Liu Jia
Assignee: Liu Jia
 Fix For: 0.96.0

 Attachments: Design_document_for_HBase_isolation_and_allocation.pdf, 
 Design_document_for_HBase_isolation_and_allocation_Revised.pdf, 
 HBase_isolation_and_allocation_user_guide.pdf, 
 Performance_of_Table_priority.pdf, 
 Simple_YCSB_Tests_For_TablePriority_Trunk_and_0.90.4.pdf, System 
 Structure.jpg, TablePriority.patch, TablePriority_v12.patch, 
 TablePriority_v12.patch, TablePriority_v15_with_coprocessor.patch, 
 TablePriority_v16_with_coprocessor.patch, TablePriority_v17.patch, 
 TablePriority_v17.patch, TablePriority_v8.patch, TablePriority_v8.patch, 
 TablePriority_v8_for_trunk.patch, TablePrioriy_v9.patch


 The HBase isolation and allocation tool is designed to help users manage 
 cluster resource among different application and tables.
 When we have a large scale of HBase cluster with many applications running on 
 it, there will be lots of problems. In Taobao there is a cluster for many 
 departments to test their applications performance, these applications are 
 based on HBase. With one cluster which has 12 servers, there will be only one 
 application running exclusively on this server, and many other applications 
 must wait until the previous test finished.
 After we add allocation manage function to the cluster, applications can 
 share the cluster and run concurrently. Also if the Test Engineer wants to 
 make sure there is no interference, he/she can move out other tables from 
 this group.
 In groups we use table priority to allocate resource, when system is busy; we 
 can make sure high-priority tables are not affected lower-priority tables
 Different groups can have different region server configurations, some groups 
 optimized for reading can have large block cache size, and others optimized 
 for writing can have large memstore size. 
 Tables and region servers can be moved easily between groups; after changing 
 the configuration, a group can be restarted alone instead of restarting the 
 whole cluster.
 git entry : https://github.com/ICT-Ope/HBase_allocation .
 We hope our work is helpful.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4246) Cluster with too many regions cannot withstand some master failover scenarios

2012-06-20 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398201#comment-13398201
 ] 

gaojinchao commented on HBASE-4246:
---

The version is 0.90.X, I have asked the customer up jute.maxbuffer to 64M. 

 Cluster with too many regions cannot withstand some master failover scenarios
 -

 Key: HBASE-4246
 URL: https://issues.apache.org/jira/browse/HBASE-4246
 Project: HBase
  Issue Type: Bug
  Components: master, zookeeper
Affects Versions: 0.90.4
Reporter: Todd Lipcon
Priority: Critical
 Fix For: 0.96.0


 We ran into the following sequence of events:
 - master startup failed after only ROOT had been assigned (for another reason)
 - restarted the master without restarting other servers. Since there was at 
 least one region assigned, it went through the failover code path
 - master scanned META and inserted every region into /hbase/unassigned in ZK.
 - then, it called listChildren on the /hbase/unassigned znode, and crashed 
 with Packet len6080218 is out of range! since the IPC response was larger 
 than the default maximum.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4246) Cluster with too many regions cannot withstand some master failover scenarios

2012-06-19 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397214#comment-13397214
 ] 

gaojinchao commented on HBASE-4246:
---

Hi, It also happpened in our cluster when we restarted whole cluster(it has 
129723 regions).

2012-06-19 19:29:00,961 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:2-0x137ed2eb936fb85 Creating (or updating) unassigned node for 
80400ccd4a1f3438cc23774ca8a88d17 with OFFLINE state
2012-06-19 19:29:00,965 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=M_ZK_REGION_OFFLINE, server=172-16-6-2:2, 
region=80400ccd4a1f3438cc23774ca8a88d17
2012-06-19 19:29:00,966 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:2-0x137ed2eb936fb85 Creating (or updating) unassigned node for 
7f1a56641906ae0a6cc6919bd927df76 with OFFLINE state
2012-06-19 19:29:00,969 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=M_ZK_REGION_OFFLINE, server=172-16-6-2:2, 
region=7f1a56641906ae0a6cc6919bd927df76
2012-06-19 19:29:01,070 WARN org.apache.zookeeper.ClientCnxn: Session 
0x137ed2eb936fb85 for server 172-16-6-1/172.16.6.1:2181, unexpected error, 
closing socket connection and attempting reconnect
2012-06-19 19:29:01,070 WARN org.apache.zookeeper.ClientCnxn: Session 
0x137ed2eb936fb85 for server 172-16-6-1/172.16.6.1:2181, unexpected error, 
closing socket connection and attempting reconnect
java.io.IOException: Packet len4670048 is out of range!
at 
org.apache.zookeeper.ClientCnxn$SendThread.readLength(ClientCnxn.java:721)
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:880)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145)
2012-06-19 19:29:01,174 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:2-0x137ed2eb936fb85 Unable to list children of znode 
/hbase/unassigned 

 Cluster with too many regions cannot withstand some master failover scenarios
 -

 Key: HBASE-4246
 URL: https://issues.apache.org/jira/browse/HBASE-4246
 Project: HBase
  Issue Type: Bug
  Components: master, zookeeper
Affects Versions: 0.90.4
Reporter: Todd Lipcon
Priority: Critical
 Fix For: 0.96.0


 We ran into the following sequence of events:
 - master startup failed after only ROOT had been assigned (for another reason)
 - restarted the master without restarting other servers. Since there was at 
 least one region assigned, it went through the failover code path
 - master scanned META and inserted every region into /hbase/unassigned in ZK.
 - then, it called listChildren on the /hbase/unassigned znode, and crashed 
 with Packet len6080218 is out of range! since the IPC response was larger 
 than the default maximum.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6055) Snapshots in HBase 0.96

2012-06-06 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290027#comment-13290027
 ] 

gaojinchao commented on HBASE-6055:
---

Fine, Thanks, I will take some time for this feature.

 Snapshots in HBase 0.96
 ---

 Key: HBASE-6055
 URL: https://issues.apache.org/jira/browse/HBASE-6055
 Project: HBase
  Issue Type: New Feature
  Components: client, master, regionserver, zookeeper
Reporter: Jesse Yates
Assignee: Jesse Yates
 Fix For: 0.96.0

 Attachments: Snapshots in HBase.docx


 Continuation of HBASE-50 for the current trunk. Since the implementation has 
 drastically changed, opening as a new ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6055) Snapshots in HBase 0.96

2012-06-05 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289900#comment-13289900
 ] 

gaojinchao commented on HBASE-6055:
---

Hi Jesse
I am considering the solution which don't use Hlog.   The way is only handling 
the memstore and asynchronous flush the memstore to Hfile. when the region 
server is down, we can finish flushing Hfile by replay editLog. Do  you think 
whether it is feasible or not?
If we can do, there are several relatively large benefits:
1. restore the snapshot is easier
2. We can achieve an incremental backup by HFile 

 Snapshots in HBase 0.96
 ---

 Key: HBASE-6055
 URL: https://issues.apache.org/jira/browse/HBASE-6055
 Project: HBase
  Issue Type: New Feature
  Components: client, master, regionserver, zookeeper
Reporter: Jesse Yates
Assignee: Jesse Yates
 Fix For: 0.96.0

 Attachments: Snapshots in HBase.docx


 Continuation of HBASE-50 for the current trunk. Since the implementation has 
 drastically changed, opening as a new ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6055) Snapshots in HBase 0.96

2012-05-28 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284366#comment-13284366
 ] 

gaojinchao commented on HBASE-6055:
---

Hi Jesse, Are you working this feature? I am interested in it.  I will study 
your code.
one question, When we are creating snapshots,  Do we need stop the balance?

 Snapshots in HBase 0.96
 ---

 Key: HBASE-6055
 URL: https://issues.apache.org/jira/browse/HBASE-6055
 Project: HBase
  Issue Type: New Feature
  Components: client, master, regionserver, zookeeper
Reporter: Jesse Yates
Assignee: Jesse Yates
 Fix For: 0.96.0

 Attachments: Snapshots in HBase.docx


 Continuation of HBASE-50 for the current trunk. Since the implementation has 
 drastically changed, opening as a new ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6055) Snapshots in HBase 0.96

2012-05-25 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13283894#comment-13283894
 ] 

gaojinchao commented on HBASE-6055:
---

This is a very useful feature. :0


 Snapshots in HBase 0.96
 ---

 Key: HBASE-6055
 URL: https://issues.apache.org/jira/browse/HBASE-6055
 Project: HBase
  Issue Type: New Feature
  Components: client, master, regionserver, zookeeper
Reporter: Jesse Yates
Assignee: Jesse Yates
 Fix For: 0.96.0

 Attachments: Snapshots in HBase.docx


 Continuation of HBASE-50 for the current trunk. Since the implementation has 
 drastically changed, opening as a new ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5546) Master assigns region in the original region server when opening region failed

2012-05-10 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272971#comment-13272971
 ] 

gaojinchao commented on HBASE-5546:
---

+1, Good job!
 

 Master assigns region in the original region server when opening region 
 failed  
 

 Key: HBASE-5546
 URL: https://issues.apache.org/jira/browse/HBASE-5546
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0
Reporter: gaojinchao
Assignee: Ashutosh Jindal
Priority: Minor
 Fix For: 0.96.0

 Attachments: hbase-5546.patch, hbase-5546_1.patch


 Master assigns region in the original region server when 
 RS_ZK_REGION_FAILED_OPEN envent was coming.
 Maybe we should choose other region server.
 [2012-03-07 10:14:21,750] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:14:31,826] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:14:41,903] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:14:51,975] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:15:02,056] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:15:12,167] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:15:22,231] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:15:32,303] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:15:42,375] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:15:52,447] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:16:02,528] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:16:12,600] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053
 [2012-03-07 10:16:22,676] [DEBUG] [main-EventThread] 
 [org.apache.hadoop.hbase.master.AssignmentManager 553] Handling 
 transition=RS_ZK_REGION_FAILED_OPEN, server=158-1-130-11,20020,1331108408232, 
 region=c70e98bdca98a0657a56436741523053

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4340) Hbase can't balance if ServerShutdownHandler encountered exception

2011-09-10 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13102203#comment-13102203
 ] 

gaojinchao commented on HBASE-4340:
---

Thanks for your work. Ted.
I want to patch through to review, and then make a trunk patch. All test case 
passed need two hours. :)

 Hbase can't balance if ServerShutdownHandler encountered exception
 --

 Key: HBASE-4340
 URL: https://issues.apache.org/jira/browse/HBASE-4340
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4340_branch90.patch


 Version: 0.90.4
 Cluster : 40 boxes
 As I saw below logs. It said that balance couldn't work because of a dead RS.
 I dug deeply and found two issues:
 1.   shutdownhandler didn't clear numProcessing deal with some 
 exceptions. It seems whatever exceptions we should clear the flag or close 
 master.
 2.   dead regionserver(s): [158-1-130-12,20020,1314971097929] is 
 inaccurate. The dead sever should be  158-1-130-10,20020,1315068597979
 //master logs:
 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: 

[jira] [Commented] (HBASE-4340) Hbase can't balance.

2011-09-09 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13101180#comment-13101180
 ] 

gaojinchao commented on HBASE-4340:
---

Yes, All test cases have passed.

 Hbase can't balance.
 

 Key: HBASE-4340
 URL: https://issues.apache.org/jira/browse/HBASE-4340
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4340_branch90.patch


 Version: 0.90.4
 Cluster : 40 boxes
 As I saw below logs. It said that balance couldn't work because of a dead RS.
 I dug deeply and found two issues:
 1.   shutdownhandler didn't clear numProcessing deal with some 
 exceptions. It seems whatever exceptions we should clear the flag or close 
 master.
 2.   dead regionserver(s): [158-1-130-12,20020,1314971097929] is 
 inaccurate. The dead sever should be  158-1-130-10,20020,1315068597979
 //master logs:
 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running 

[jira] [Commented] (HBASE-4212) TestMasterFailover fails occasionally

2011-09-09 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13101213#comment-13101213
 ] 

gaojinchao commented on HBASE-4212:
---

@Stack, Thanks for your review. 
In our environment, it often fails, so we skip this case(for my case is that 
all test cases are performed automatically every day). 

The step for opening a root region:
step A: Master tells Region server to open root region.
step B: Region server opens root region and sets zk node(rootServerZNodezk). It 
is finished means that catalogtracker can works.
step C: Region server updates the zk node(assignmentZNode) tells master that 
root has opened(some cases may fail, but we have told the root could be used).
step D: Master deletes the zk node (assignmentZNode) and adds root region to 
online set.

In my case, master skipped the step D because delayed. master forced root 
region online in processFailover. So zk node couldn't be deleted and failover 
case failed.

finishInitialization code:
// Make sure root and meta assigned before proceeding.
assignRootAndMeta();

// Is this fresh start with no regions assigned or are we a master joining
// an already-running cluster?  If regionsCount == 0, then for sure a
// fresh start.  TOOD: Be fancier.  If regionsCount == 2, perhaps the
// 2 are .META. and -ROOT- and we should fall into the fresh startup
// branch below.  For now, do processFailover.
if (regionCount == 0) {
  LOG.info(Master startup proceeding: cluster startup);
  this.assignmentManager.cleanoutUnassigned();
  this.assignmentManager.assignAllUserRegions();
} else {
  LOG.info(Master startup proceeding: master failover);
  this.assignmentManager.processFailover();
}

processFailover code:
HServerInfo hsi =
  this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
hsi = 
this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);


 TestMasterFailover fails occasionally
 -

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4212_TrunkV1.patch, HBASE-4212_branch90V1.patch


 It seems a bug. The root in RIT can't be moved..
 In the failover process, it enforces root on-line. But not clean zk node. 
 test will wait forever.
   void processFailover() throws KeeperException, IOException, 
 InterruptedException {
  
 // we enforce on-line root.
 HServerInfo hsi =
   
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
 It seems that we should wait finished as meta region 
   int assignRootAndMeta()
   throws InterruptedException, IOException, KeeperException {
 int assigned = 0;
 long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
 1000);
 // Work on ROOT region.  Is it in zk in transition?
 boolean rit = this.assignmentManager.
   
 processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   //we need add this code and guarantee that the transition has completed
   this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
   assigned++;
 }
 logs:
 2011-08-16 07:45:40,715 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
 catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
 C4S2.site:47710
 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 

[jira] [Updated] (HBASE-4340) Hbase can't balance.

2011-09-08 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4340:
--

Attachment: HBASE-4340_branch90.patch

 Hbase can't balance.
 

 Key: HBASE-4340
 URL: https://issues.apache.org/jira/browse/HBASE-4340
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4340_branch90.patch


 Version: 0.90.4
 Cluster : 40 boxes
 As I saw below logs. It said that balance couldn't work because of a dead RS.
 I dug deeply and found two issues:
 1.   shutdownhandler didn't clear numProcessing deal with some 
 exceptions. It seems whatever exceptions we should clear the flag or close 
 master.
 2.   dead regionserver(s): [158-1-130-12,20020,1314971097929] is 
 inaccurate. The dead sever should be  158-1-130-10,20020,1315068597979
 //master logs:
 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 

[jira] [Commented] (HBASE-4340) Hbase can't balance.

2011-09-08 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13100207#comment-13100207
 ] 

gaojinchao commented on HBASE-4340:
---

I have made a patch, Please review.

 Hbase can't balance.
 

 Key: HBASE-4340
 URL: https://issues.apache.org/jira/browse/HBASE-4340
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4340_branch90.patch


 Version: 0.90.4
 Cluster : 40 boxes
 As I saw below logs. It said that balance couldn't work because of a dead RS.
 I dug deeply and found two issues:
 1.   shutdownhandler didn't clear numProcessing deal with some 
 exceptions. It seems whatever exceptions we should clear the flag or close 
 master.
 2.   dead regionserver(s): [158-1-130-12,20020,1314971097929] is 
 inaccurate. The dead sever should be  158-1-130-10,20020,1315068597979
 //master logs:
 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running 

[jira] [Updated] (HBASE-4340) Hbase can't balance.

2011-09-08 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4340:
--

Status: Patch Available  (was: Open)

 Hbase can't balance.
 

 Key: HBASE-4340
 URL: https://issues.apache.org/jira/browse/HBASE-4340
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4340_branch90.patch


 Version: 0.90.4
 Cluster : 40 boxes
 As I saw below logs. It said that balance couldn't work because of a dead RS.
 I dug deeply and found two issues:
 1.   shutdownhandler didn't clear numProcessing deal with some 
 exceptions. It seems whatever exceptions we should clear the flag or close 
 master.
 2.   dead regionserver(s): [158-1-130-12,20020,1314971097929] is 
 inaccurate. The dead sever should be  158-1-130-10,20020,1315068597979
 //master logs:
 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 

[jira] [Commented] (HBASE-2158) Change how high/low global limit works; start taking on writes as soon as we dip below high limit rather than block until low limit as we currently do.

2011-09-07 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098645#comment-13098645
 ] 

gaojinchao commented on HBASE-2158:
---

when memory reach the low limit, we should start a emergency Flusher. So I 
think It is diffcult to reach the high limit and if we reach it ,we will flush 
one by one.

if (fqe == null || fqe instanceof WakeupFlushThread) {
  if (isAboveLowWaterMark()) {
LOG.info(Flush thread woke up with memory above low water.);
if (!flushOneForGlobalPressure()) {
  // Wasn't able to flush any region, but we're above low water mark
  // This is unlikely to happen, but might happen when closing the
  // entire server - another thread is flushing regions. We'll just
  // sleep a little bit to avoid spinning, and then pretend that
  // we flushed one, so anyone blocked will check again
  lock.lock();
  try {
Thread.sleep(1000);
flushOccurred.signalAll();
  } finally {
lock.unlock();
  }
}
// Enqueue another one of these tokens so we'll wake up again
wakeupFlushThread();
  }
  continue;
}

 Change how high/low global limit works; start taking on writes as soon as we 
 dip below high limit rather than block until low limit as we currently do.
 ---

 Key: HBASE-2158
 URL: https://issues.apache.org/jira/browse/HBASE-2158
 Project: HBase
  Issue Type: Improvement
Reporter: stack

 A Ryan Rawson suggestion.  See HBASE-2149 for more context.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-4340) Hbase can't balance.

2011-09-06 Thread gaojinchao (JIRA)
Hbase can't balance.


 Key: HBASE-4340
 URL: https://issues.apache.org/jira/browse/HBASE-4340
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: gaojinchao
 Fix For: 0.90.5


Version: 0.90.4
Cluster : 40 boxes
As I saw below logs. It said that balance couldn't work because of a dead RS.
I dug deeply and found two issues:

1.   shutdownhandler didn't clear numProcessing deal with some exceptions. 
It seems whatever exceptions we should clear the flag or close master.

2.   dead regionserver(s): [158-1-130-12,20020,1314971097929] is 
inaccurate. The dead sever should be  158-1-130-10,20020,1315068597979

//master logs:
2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]

// the exception logs :.
2011-09-03 18:13:27,550 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENING, 
server=158-1-133-11,20020,1315069437236, region=0db4088d75c58dd22f93f389d90ba6cc
2011-09-03 18:13:27,550 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
Caught throwable while processing event 

[jira] [Assigned] (HBASE-4340) Hbase can't balance.

2011-09-06 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao reassigned HBASE-4340:
-

Assignee: gaojinchao

 Hbase can't balance.
 

 Key: HBASE-4340
 URL: https://issues.apache.org/jira/browse/HBASE-4340
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5


 Version: 0.90.4
 Cluster : 40 boxes
 As I saw below logs. It said that balance couldn't work because of a dead RS.
 I dug deeply and found two issues:
 1.   shutdownhandler didn't clear numProcessing deal with some 
 exceptions. It seems whatever exceptions we should clear the flag or close 
 master.
 2.   dead regionserver(s): [158-1-130-12,20020,1314971097929] is 
 inaccurate. The dead sever should be  158-1-130-10,20020,1315068597979
 //master logs:
 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 // the exception logs 

[jira] [Assigned] (HBASE-3521) region be merged with others automatically when all data in the region has expired and removed, or region gets too small.

2011-09-06 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao reassigned HBASE-3521:
-

Assignee: gaojinchao

 region be merged with others automatically when all data in the region has 
 expired and removed, or region gets too small.
 -

 Key: HBASE-3521
 URL: https://issues.apache.org/jira/browse/HBASE-3521
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver, scripts
Affects Versions: 0.90.0
Reporter: zhoushuaifeng
Assignee: gaojinchao
Priority: Minor

 We have test a cluster which have more than 30,000 regions, max size of a 
 region is 512MB. At this situation, data no more growing, but remove some old 
 data and insert new, and regions will be more and more. And some regions may 
 be very small or empty. This occupies too much heapsize, and will be more if 
 regions cannot be merged. This will limit hbase running for a long time. 
 A script that does a survey to remove empty regions, or pick out adjacent 
 small regions that then does the online merge up seems like it would be 
 useful. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3521) region be merged with others automatically when all data in the region has expired and removed, or region gets too small.

2011-09-06 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098560#comment-13098560
 ] 

gaojinchao commented on HBASE-3521:
---

Thanks a lot.This is what I want. :)

 region be merged with others automatically when all data in the region has 
 expired and removed, or region gets too small.
 -

 Key: HBASE-3521
 URL: https://issues.apache.org/jira/browse/HBASE-3521
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver, scripts
Affects Versions: 0.90.0
Reporter: zhoushuaifeng
Assignee: gaojinchao
Priority: Minor

 We have test a cluster which have more than 30,000 regions, max size of a 
 region is 512MB. At this situation, data no more growing, but remove some old 
 data and insert new, and regions will be more and more. And some regions may 
 be very small or empty. This occupies too much heapsize, and will be more if 
 regions cannot be merged. This will limit hbase running for a long time. 
 A script that does a survey to remove empty regions, or pick out adjacent 
 small regions that then does the online merge up seems like it would be 
 useful. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-2158) Change how high/low global limit works; start taking on writes as soon as we dip below high limit rather than block until low limit as we currently do.

2011-09-06 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098586#comment-13098586
 ] 

gaojinchao commented on HBASE-2158:
---

I agree.this issue should be closed.

 Change how high/low global limit works; start taking on writes as soon as we 
 dip below high limit rather than block until low limit as we currently do.
 ---

 Key: HBASE-2158
 URL: https://issues.apache.org/jira/browse/HBASE-2158
 Project: HBase
  Issue Type: Improvement
Reporter: stack

 A Ryan Rawson suggestion.  See HBASE-2149 for more context.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4132) Extend the WALActionsListener API to accomodate log archival

2011-08-29 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093403#comment-13093403
 ] 

gaojinchao commented on HBASE-4132:
---

Whether we can add the following api?
  
 /**
   * The WAL needs to be archived. It is going to be moved from oldPath to
   * newPath.
   * 
   * @param oldPath
   *  the path to the old hlog
   * @param newPath
   *  the path to the new hlog
   * @return true if default behavior should be bypassed, false otherwise
   */
  boolean preArchiveLog(Path oldPath, Path newPath) throws IOException;

  /**
   * The WAL has been archived. It is moved from oldPath to newPath.
   * 
   * @param oldPath
   *  the path to the old hlog
   * @param newPath
   *  the path to the new hlog
   * @param archivalWasSuccessful
   *  true, if the archival was successful
   */
  void postArchiveLog(Path oldPath, Path newPath,
  boolean archivalWasSuccessful) throws IOException;

 Extend the WALActionsListener API to accomodate log archival
 

 Key: HBASE-4132
 URL: https://issues.apache.org/jira/browse/HBASE-4132
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: dhruba borthakur
 Fix For: 0.92.0

 Attachments: walArchive.txt


 The WALObserver interface exposes the log roll events. It would be nice to 
 extend it to accomodate log archival events as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.

2011-08-28 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13092583#comment-13092583
 ] 

gaojinchao commented on HBASE-4124:
---

@Ted thanks for your work. 
sn has checked about null above statement.

if (sn == null) {
  LOG.warn(Region in transition  + regionInfo.getEncodedName() +
 references a null server; letting RIT timeout so will be  +
assigned elsewhere);
  break;
}

 ZK restarted while a region is being assigned, new active HM re-assigns it 
 but the RS warns 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: 4124-trunk.v2, HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.

2011-08-28 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_TrunkV2.patch

I am runing all the test cases. My new modification is more clear. 

 ZK restarted while a region is being assigned, new active HM re-assigns it 
 but the RS warns 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, 
 HBASE-4124_TrunkV2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4134) The total number of regions was more than the actual region count after the hbck fix

2011-08-28 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4134:
--

Fix Version/s: (was: 0.92.0)
   0.94.0

 The total number of regions was more than the actual region count after the 
 hbck fix
 

 Key: HBASE-4134
 URL: https://issues.apache.org/jira/browse/HBASE-4134
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: feng xu
 Fix For: 0.94.0


 1. I found the problem(some regions were multiply assigned) while running 
 hbck to check the cluster's health. Here's the result:
 {noformat}
 ERROR: Region test1,230778,1311216270050.fff783529fcd983043610eaa1cc5c2fe. is 
 listed in META on region server 158-1-91-101:20020 but is multiply assigned 
 to region servers 158-1-91-101:20020, 158-1-91-105:20020 
 ERROR: Region test1,252103,1311216293671.fff9ed2cb69bdce535451a07686c0db5. is 
 listed in META on region server 158-1-91-101:20020 but is multiply assigned 
 to region servers 158-1-91-101:20020, 158-1-91-105:20020 
 ERROR: Region test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. is 
 listed in META on region server 158-1-91-103:20020 but is multiply assigned 
 to region servers 158-1-91-103:20020, 158-1-91-105:20020 
 Summary: 
   -ROOT- is okay. 
 Number of regions: 1 
 Deployed on: 158-1-91-105:20020 
   .META. is okay. 
 Number of regions: 1 
 Deployed on: 158-1-91-103:20020 
   test1 is okay. 
 Number of regions: 25297 
 Deployed on: 158-1-91-101:20020 158-1-91-103:20020 158-1-91-105:20020 
 14829 inconsistencies detected. 
 Status: INCONSISTENT 
 {noformat}
 2. Then I tried to use hbck -fix to fix the problem. Everything seemed ok. 
 But I found that the total number of regions reported by load balancer 
 (35029) was more than the actual region count(25299) after the fixing.
 Here's the related logs snippet:
 {noformat}
 2011-07-22 02:19:02,866 INFO org.apache.hadoop.hbase.master.LoadBalancer: 
 Skipping load balancing.  servers=3 regions=25299 average=8433.0 
 mostloaded=8433 
 2011-07-22 03:06:11,832 INFO org.apache.hadoop.hbase.master.LoadBalancer: 
 Skipping load balancing.  servers=3 regions=35029 average=11676.333 
 mostloaded=11677 leastloaded=11676
 {noformat}
 3. I tracked one region's behavior during the time. Taking the region of 
 test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. as example:
 (1) It was assigned to 158-1-91-101 at first. 
 (2) HBCK sent closing request to RegionServer. And RegionServer closed it 
 silently without notice to HMaster.
 (3) The region was still carried by RS 158-1-91-103 which was known to 
 HMaster.
 (4) HBCK will trigger a new assignment.
 The fact is, the region was assigned again, but the old assignment 
 information still remained in AM#regions,AM#servers.
 That's why the problem of region count was larger than the actual number 
 occurred.  
 {noformat}
 Line 178967: 2011-07-22 02:47:51,247 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/52782c0241a598b3e37ca8729da0 
 (region=test1,282187,1311216322104.52782c0241a598b3e37ca8729da0., 
 server=HBCKServerName, state=M_ZK_REGION_OFFLINE)
 Line 178968: 2011-07-22 02:47:51,247 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling HBCK triggered 
 transition=M_ZK_REGION_OFFLINE, server=HBCKServerName, 
 region=52782c0241a598b3e37ca8729da0
 Line 178969: 2011-07-22 02:47:51,248 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: HBCK repair is triggering 
 assignment of 
 region=test1,282187,1311216322104.52782c0241a598b3e37ca8729da0.
 Line 178970: 2011-07-22 02:47:51,248 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. so generated a 
 random one; hri=test1,282187,1311216322104.52782c0241a598b3e37ca8729da0., 
 src=, dest=158-1-91-101,20020,1311231878544; 3 (online=3, exclude=null) 
 available servers
 Line 178971: 2011-07-22 02:47:51,248 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. to 
 158-1-91-101,20020,1311231878544
 Line 178983: 2011-07-22 02:47:51,285 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=158-1-91-101,20020,1311231878544, 
 region=52782c0241a598b3e37ca8729da0
 Line 179001: 2011-07-22 02:47:51,318 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=158-1-91-101,20020,1311231878544, 
 region=52782c0241a598b3e37ca8729da0
 Line 179002: 2011-07-22 02:47:51,319 DEBUG 
 

[jira] [Commented] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.

2011-08-28 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13092618#comment-13092618
 ] 

gaojinchao commented on HBASE-4124:
---

All test cases passed. Thanks.


 ZK restarted while a region is being assigned, new active HM re-assigns it 
 but the RS warns 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, 
 HBASE-4124_TrunkV2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.

2011-08-26 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_TrunkV1.patch

I have made a patch. I found two test case(TestAdmin and RollLoging) can't 
pass. I use the raw trunk as well

 ZK restarted while a region is being assigned, new active HM re-assigns it 
 but the RS warns 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-25 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V4.patch

According to review, modified the comments.
Thanks for Ted's careful review.

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

2011-08-25 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-3845:
--

Attachment: HBASE-3845_branch90V2.patch

According to review, modified the code.

 data loss because lastSeqWritten can miss memstore edits
 

 Key: HBASE-3845
 URL: https://issues.apache.org/jira/browse/HBASE-3845
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: Prakash Khemani
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch, 
 HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch, 
 HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch, 
 HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_branch90V1.patch, 
 HBASE-3845_branch90V2.patch, HBASE-3845_trunk_2.patch, 
 HBASE-3845_trunk_3.patch


 (I don't have a test case to prove this yet but I have run it by Dhruba and 
 Kannan internally and wanted to put this up for some feedback.)
 In this discussion let us assume that the region has only one column family. 
 That way I can use region/memstore interchangeably.
 After a memstore flush it is possible for lastSeqWritten to have a 
 log-sequence-id for a region that is not the earliest log-sequence-id for 
 that region's memstore.
 HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure 
 that we only keep track  of the earliest log-sequence-number that is present 
 in the memstore.
 Every time the memstore is flushed we remove the region's entry in 
 lastSequenceWritten and wait for the next append to populate this entry 
 again. This is where the problem happens.
 step 1:
 flusher.prepare() snapshots the memstore under 
 HRegion.updatesLock.writeLock().
 step 2 :
 as soon as the updatesLock.writeLock() is released new entries will be added 
 into the memstore.
 step 3 :
 wal.completeCacheFlush() is called. This method removes the region's entry 
 from lastSeqWritten.
 step 4:
 the next append will create a new entry for the region in lastSeqWritten(). 
 But this will be the log seq id of the current append. All the edits that 
 were added in step 2 are missing.
 ==
 as a temporary measure, instead of removing the region's entry in step 3 I 
 will replace it with the log-seq-id of the region-flush-event.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Work started] (HBASE-3933) Hmaster throw NullPointerException

2011-08-25 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-3933 started by gaojinchao.

 Hmaster throw NullPointerException
 --

 Key: HBASE-3933
 URL: https://issues.apache.org/jira/browse/HBASE-3933
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Attachments: Hmastersetup0.90


 NullPointerException while hmaster starting.
 {code}
   java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.get(TreeMap.java:255)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException

2011-08-25 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090853#comment-13090853
 ] 

gaojinchao commented on HBASE-3933:
---

I study the TRUNK. It has fixed. So we can close this issue.

Trunk code:
// Wait for region servers to report in.
this.serverManager.waitForRegionServers(status);
// Check zk for regionservers that are up but didn't register
for (ServerName sn: this.regionServerTracker.getOnlineServers()) {
  if (!this.serverManager.isServerOnline(sn)) {
// Not registered; add it.
LOG.info(Registering server found up in zk:  + sn);
this.serverManager.recordNewServer(sn, HServerLoad.EMPTY_HSERVERLOAD);
  }
}

 Hmaster throw NullPointerException
 --

 Key: HBASE-3933
 URL: https://issues.apache.org/jira/browse/HBASE-3933
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Attachments: Hmastersetup0.90


 NullPointerException while hmaster starting.
 {code}
   java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.get(TreeMap.java:255)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

2011-08-25 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090882#comment-13090882
 ] 

gaojinchao commented on HBASE-3845:
---

@Stack
Please review the patch and give some suggestion. :)

 data loss because lastSeqWritten can miss memstore edits
 

 Key: HBASE-3845
 URL: https://issues.apache.org/jira/browse/HBASE-3845
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: Prakash Khemani
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch, 
 HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch, 
 HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch, 
 HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_branch90V1.patch, 
 HBASE-3845_branch90V2.patch, HBASE-3845_trunk_2.patch, 
 HBASE-3845_trunk_3.patch


 (I don't have a test case to prove this yet but I have run it by Dhruba and 
 Kannan internally and wanted to put this up for some feedback.)
 In this discussion let us assume that the region has only one column family. 
 That way I can use region/memstore interchangeably.
 After a memstore flush it is possible for lastSeqWritten to have a 
 log-sequence-id for a region that is not the earliest log-sequence-id for 
 that region's memstore.
 HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure 
 that we only keep track  of the earliest log-sequence-number that is present 
 in the memstore.
 Every time the memstore is flushed we remove the region's entry in 
 lastSequenceWritten and wait for the next append to populate this entry 
 again. This is where the problem happens.
 step 1:
 flusher.prepare() snapshots the memstore under 
 HRegion.updatesLock.writeLock().
 step 2 :
 as soon as the updatesLock.writeLock() is released new entries will be added 
 into the memstore.
 step 3 :
 wal.completeCacheFlush() is called. This method removes the region's entry 
 from lastSeqWritten.
 step 4:
 the next append will create a new entry for the region in lastSeqWritten(). 
 But this will be the log seq id of the current append. All the edits that 
 were added in step 2 are missing.
 ==
 as a temporary measure, instead of removing the region's entry in step 3 I 
 will replace it with the log-seq-id of the region-flush-event.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.

2011-08-25 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13091474#comment-13091474
 ] 

gaojinchao commented on HBASE-4124:
---

@ Ted
I am making a patch for TRUNK. But I have some questions about TRUNK.
It seems a bug.
In function assign, when we get the return value ALREADY_OPENED .
should we update the meta table ?  or we do this on region server.

hmaster code:
  RegionOpeningState regionOpenState = serverManager.sendRegionOpen(plan
.getDestination(), state.getRegion());
if (regionOpenState == RegionOpeningState.ALREADY_OPENED) {

region server code: if we don't update the meta ,the client may access to the 
old server.

 HRegion onlineRegion = this.getFromOnlineRegions(region.getEncodedName());
if (null != onlineRegion) {
  LOG.warn(Attempted open of  + region.getEncodedName()
  +  but already online on this server);
  return RegionOpeningState.ALREADY_OPENED;
}

 ZK restarted while a region is being assigned, new active HM re-assigns it 
 but the RS warns 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-24 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V3.patch

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-24 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090141#comment-13090141
 ] 

gaojinchao commented on HBASE-4124:
---

@Ted
Does it need a patch for Trunk? 
There is a big change, I need some time to study it.

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-24 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090677#comment-13090677
 ] 

gaojinchao commented on HBASE-4124:
---

@Ted 
I have run all the tests. Thanks for your work.

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-24 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090698#comment-13090698
 ] 

gaojinchao commented on HBASE-4124:
---

@ram
How come we have a dead RS if we dont kill the RS

gao: If you stop the cluster, The meta will handle the server information.

if the master is also killed how can the regions be assigned to some other RS 

gao: When master startup, it collects the regions on a same region server and 
 call sendRegionOpen(destination, regions).
 If the region is relatively large number, when region server opens the 
reigons needs a long time.
 when master crash, the new master may reopen the regions on another region 
server.
 

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

2011-08-24 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-3845:
--

Attachment: HBASE-3845_branch90V1.patch

 data loss because lastSeqWritten can miss memstore edits
 

 Key: HBASE-3845
 URL: https://issues.apache.org/jira/browse/HBASE-3845
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: Prakash Khemani
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch, 
 HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch, 
 HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch, 
 HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_branch90V1.patch, 
 HBASE-3845_trunk_2.patch, HBASE-3845_trunk_3.patch


 (I don't have a test case to prove this yet but I have run it by Dhruba and 
 Kannan internally and wanted to put this up for some feedback.)
 In this discussion let us assume that the region has only one column family. 
 That way I can use region/memstore interchangeably.
 After a memstore flush it is possible for lastSeqWritten to have a 
 log-sequence-id for a region that is not the earliest log-sequence-id for 
 that region's memstore.
 HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure 
 that we only keep track  of the earliest log-sequence-number that is present 
 in the memstore.
 Every time the memstore is flushed we remove the region's entry in 
 lastSequenceWritten and wait for the next append to populate this entry 
 again. This is where the problem happens.
 step 1:
 flusher.prepare() snapshots the memstore under 
 HRegion.updatesLock.writeLock().
 step 2 :
 as soon as the updatesLock.writeLock() is released new entries will be added 
 into the memstore.
 step 3 :
 wal.completeCacheFlush() is called. This method removes the region's entry 
 from lastSeqWritten.
 step 4:
 the next append will create a new entry for the region in lastSeqWritten(). 
 But this will be the log seq id of the current append. All the edits that 
 were added in step 2 are missing.
 ==
 as a temporary measure, instead of removing the region's entry in step 3 I 
 will replace it with the log-seq-id of the region-flush-event.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

2011-08-24 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090740#comment-13090740
 ] 

gaojinchao commented on HBASE-3845:
---

@RAM
I have run all the unit tests, Please help to review it firstly. Thanks.


I will construct the scene to verify today.

 data loss because lastSeqWritten can miss memstore edits
 

 Key: HBASE-3845
 URL: https://issues.apache.org/jira/browse/HBASE-3845
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: Prakash Khemani
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch, 
 HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch, 
 HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch, 
 HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_branch90V1.patch, 
 HBASE-3845_trunk_2.patch, HBASE-3845_trunk_3.patch


 (I don't have a test case to prove this yet but I have run it by Dhruba and 
 Kannan internally and wanted to put this up for some feedback.)
 In this discussion let us assume that the region has only one column family. 
 That way I can use region/memstore interchangeably.
 After a memstore flush it is possible for lastSeqWritten to have a 
 log-sequence-id for a region that is not the earliest log-sequence-id for 
 that region's memstore.
 HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure 
 that we only keep track  of the earliest log-sequence-number that is present 
 in the memstore.
 Every time the memstore is flushed we remove the region's entry in 
 lastSequenceWritten and wait for the next append to populate this entry 
 again. This is where the problem happens.
 step 1:
 flusher.prepare() snapshots the memstore under 
 HRegion.updatesLock.writeLock().
 step 2 :
 as soon as the updatesLock.writeLock() is released new entries will be added 
 into the memstore.
 step 3 :
 wal.completeCacheFlush() is called. This method removes the region's entry 
 from lastSeqWritten.
 step 4:
 the next append will create a new entry for the region in lastSeqWritten(). 
 But this will be the log seq id of the current append. All the edits that 
 were added in step 2 are missing.
 ==
 as a temporary measure, instead of removing the region's entry in step 3 I 
 will replace it with the log-seq-id of the region-flush-event.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-22 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao reassigned HBASE-4124:
-

Assignee: gaojinchao

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-22 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Fix Version/s: 0.90.5

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-20 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V2.patch

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-20 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088146#comment-13088146
 ] 

gaojinchao commented on HBASE-4124:
---

I have finished the test. I discribe the scene:
step 1: startup cluster 
step 2: abort the master when finish call sendRegionOpen(destination, regions)
step 3: startup cluster again.

above steps will reproduce the issue. 
when master is failover. the meta records the dead server,but the region is 
processing for a living region server.


 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-20 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088147#comment-13088147
 ] 

gaojinchao commented on HBASE-4124:
---

sorry.step 3: startup master again .

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-20 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V2.patch

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-20 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: (was: HBASE-4124_Branch90V2.patch)

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-20 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V2.patch

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-20 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: (was: HBASE-4124_Branch90V2.patch)

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-20 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088173#comment-13088173
 ] 

gaojinchao commented on HBASE-4124:
---

I have added a test case for opening a region.

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-4212) TestMasterFailover fails occasionally

2011-08-17 Thread gaojinchao (JIRA)
TestMasterFailover fails occasionally
-

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
 Fix For: 0.90.5


It seems a bug. The root in RIT can't be moved..
In the failover process, it enforces root on-line. But not clean zk node. 
test will wait forever.

  void processFailover() throws KeeperException, IOException, 
InterruptedException {
 
// we enforce on-line root.
HServerInfo hsi =
  this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
hsi = 
this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);

It seems that we should wait finished as meta region 
  int assignRootAndMeta()
  throws InterruptedException, IOException, KeeperException {
int assigned = 0;
long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
1000);

// Work on ROOT region.  Is it in zk in transition?
boolean rit = this.assignmentManager.
  
processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
if (!catalogTracker.verifyRootRegionLocation(timeout)) {
  this.assignmentManager.assignRoot();
  this.catalogTracker.waitForRoot();

  //we need add this code and guarantee that the transition has completed
  this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
  assigned++;
}

logs:
2011-08-16 07:45:40,715 DEBUG 
[RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received 
ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
path=/hbase/unassigned/70236052
2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
path=/hbase/unassigned/70236052
2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
C4S2.site:47710
2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): 
master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode 
/hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0, 
server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 
byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
2011-08-16 07:45:40,740 DEBUG 
[RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received 
ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
path=/hbase/unassigned/70236052
2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] 
zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
path=/hbase/unassigned/70236052
2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052
2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): 
master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode 
/hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0, 
server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENED
2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] 
master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENED, 
server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-

//.It said that zk node can't be 
cleaned because of we have 

[jira] [Updated] (HBASE-4212) TestMasterFailover fails occasionally

2011-08-17 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4212:
--

Attachment: HBASE-4212_branch90V1.patch

 TestMasterFailover fails occasionally
 -

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4212_branch90V1.patch


 It seems a bug. The root in RIT can't be moved..
 In the failover process, it enforces root on-line. But not clean zk node. 
 test will wait forever.
   void processFailover() throws KeeperException, IOException, 
 InterruptedException {
  
 // we enforce on-line root.
 HServerInfo hsi =
   
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
 It seems that we should wait finished as meta region 
   int assignRootAndMeta()
   throws InterruptedException, IOException, KeeperException {
 int assigned = 0;
 long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
 1000);
 // Work on ROOT region.  Is it in zk in transition?
 boolean rit = this.assignmentManager.
   
 processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   //we need add this code and guarantee that the transition has completed
   this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
   assigned++;
 }
 logs:
 2011-08-16 07:45:40,715 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
 catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
 C4S2.site:47710
 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher; 
 region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, 
 state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
 master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
 server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
 transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to 
 RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 
 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
 server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,740 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052
 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher; 
 

[jira] [Commented] (HBASE-4212) TestMasterFailover fails occasionally

2011-08-17 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086199#comment-13086199
 ] 

gaojinchao commented on HBASE-4212:
---

I have made a patch. Please review it. Thanks.

 TestMasterFailover fails occasionally
 -

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4212_branch90V1.patch


 It seems a bug. The root in RIT can't be moved..
 In the failover process, it enforces root on-line. But not clean zk node. 
 test will wait forever.
   void processFailover() throws KeeperException, IOException, 
 InterruptedException {
  
 // we enforce on-line root.
 HServerInfo hsi =
   
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
 It seems that we should wait finished as meta region 
   int assignRootAndMeta()
   throws InterruptedException, IOException, KeeperException {
 int assigned = 0;
 long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
 1000);
 // Work on ROOT region.  Is it in zk in transition?
 boolean rit = this.assignmentManager.
   
 processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   //we need add this code and guarantee that the transition has completed
   this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
   assigned++;
 }
 logs:
 2011-08-16 07:45:40,715 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
 catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
 C4S2.site:47710
 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher; 
 region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, 
 state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
 master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
 server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
 transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to 
 RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 
 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
 server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,740 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052
 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode 

[jira] [Commented] (HBASE-4212) TestMasterFailover fails occasionally

2011-08-17 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086202#comment-13086202
 ] 

gaojinchao commented on HBASE-4212:
---

I test 10 times and logs said that META is assigned after root has finished.

2011-08-17 05:06:51,419 DEBUG [MASTER_OPEN_REGION-C4S2.site:47578-0] 
zookeeper.ZKUtil(1109): master:47578-0x131d6fe02e50009 Retrieved 52 byte(s) of 
data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
server=C4S2.site,60960,1313571996605, state=RS_ZK_REGION_OPENED
2011-08-17 05:06:51,425 DEBUG [Thread-755-EventThread] 
zookeeper.ZooKeeperWatcher(252): master:47578-0x131d6fe02e50009 Received 
ZooKeeper Event, type=NodeDeleted, state=SyncConnected, 
path=/hbase/unassigned/70236052
2011-08-17 05:06:51,425 DEBUG [MASTER_OPEN_REGION-C4S2.site:47578-0] 
zookeeper.ZKAssign(420): master:47578-0x131d6fe02e50009 Successfully deleted 
unassigned node for region 70236052 in expected state RS_ZK_REGION_OPENED
2011-08-17 05:06:51,426 INFO  [Master:0;C4S2.site:47578] master.HMaster(437): 
-ROOT- assigned=1, rit=false, location=C4S2.site:60960
2011-08-17 05:06:51,426 DEBUG [MASTER_OPEN_REGION-C4S2.site:47578-0] 
handler.OpenedRegionHandler(108): Opened region -ROOT-,,0.70236052 on 
C4S2.site,60960,1313571996605
2011-08-17 05:06:51,427 DEBUG [Master:0;C4S2.site:47578] zookeeper.ZKUtil(553): 
master:47578-0x131d6fe02e50009 Unable to get data of znode 
/hbase/unassigned/1028785192 because node does not exist (not an error)
2011-08-17 05:06:51,429 INFO  [Master:0;C4S2.site:47578] 
catalog.CatalogTracker(421): Passed metaserver is null

 TestMasterFailover fails occasionally
 -

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4212_branch90V1.patch


 It seems a bug. The root in RIT can't be moved..
 In the failover process, it enforces root on-line. But not clean zk node. 
 test will wait forever.
   void processFailover() throws KeeperException, IOException, 
 InterruptedException {
  
 // we enforce on-line root.
 HServerInfo hsi =
   
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
 It seems that we should wait finished as meta region 
   int assignRootAndMeta()
   throws InterruptedException, IOException, KeeperException {
 int assigned = 0;
 long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
 1000);
 // Work on ROOT region.  Is it in zk in transition?
 boolean rit = this.assignmentManager.
   
 processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   //we need add this code and guarantee that the transition has completed
   this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
   assigned++;
 }
 logs:
 2011-08-16 07:45:40,715 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
 catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
 C4S2.site:47710
 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher; 
 region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, 
 state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
 master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
 server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
 transition node 

[jira] [Updated] (HBASE-4212) TestMasterFailover fails occasionally

2011-08-17 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4212:
--

Assignee: gaojinchao
  Status: Patch Available  (was: Open)

 TestMasterFailover fails occasionally
 -

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4212_branch90V1.patch


 It seems a bug. The root in RIT can't be moved..
 In the failover process, it enforces root on-line. But not clean zk node. 
 test will wait forever.
   void processFailover() throws KeeperException, IOException, 
 InterruptedException {
  
 // we enforce on-line root.
 HServerInfo hsi =
   
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
 It seems that we should wait finished as meta region 
   int assignRootAndMeta()
   throws InterruptedException, IOException, KeeperException {
 int assigned = 0;
 long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
 1000);
 // Work on ROOT region.  Is it in zk in transition?
 boolean rit = this.assignmentManager.
   
 processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   //we need add this code and guarantee that the transition has completed
   this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
   assigned++;
 }
 logs:
 2011-08-16 07:45:40,715 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
 catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
 C4S2.site:47710
 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher; 
 region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, 
 state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
 master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
 server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
 transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to 
 RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 
 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
 server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,740 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052
 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode 

[jira] [Updated] (HBASE-4212) TestMasterFailover fails occasionally

2011-08-17 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4212:
--

Attachment: HBASE-4212_TrunkV1.patch

 TestMasterFailover fails occasionally
 -

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4212_TrunkV1.patch, HBASE-4212_branch90V1.patch


 It seems a bug. The root in RIT can't be moved..
 In the failover process, it enforces root on-line. But not clean zk node. 
 test will wait forever.
   void processFailover() throws KeeperException, IOException, 
 InterruptedException {
  
 // we enforce on-line root.
 HServerInfo hsi =
   
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
 It seems that we should wait finished as meta region 
   int assignRootAndMeta()
   throws InterruptedException, IOException, KeeperException {
 int assigned = 0;
 long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
 1000);
 // Work on ROOT region.  Is it in zk in transition?
 boolean rit = this.assignmentManager.
   
 processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   //we need add this code and guarantee that the transition has completed
   this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
   assigned++;
 }
 logs:
 2011-08-16 07:45:40,715 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
 catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
 C4S2.site:47710
 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher; 
 region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, 
 state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
 master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
 server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
 transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to 
 RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 
 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
 server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,740 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052
 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode 

[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-17 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V1_trial.patch

I try to make a patch and fix this issue.
But I only run the UT test. Please review it firstly and give me some 
suggestion. I will test it tomorrow. Thanks.

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

2011-08-17 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086728#comment-13086728
 ] 

gaojinchao commented on HBASE-3845:
---

Hi,Patch has not yet apply to the branch ?  

 data loss because lastSeqWritten can miss memstore edits
 

 Key: HBASE-3845
 URL: https://issues.apache.org/jira/browse/HBASE-3845
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: Prakash Khemani
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Fix For: 0.90.5

 Attachments: 
 0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch, 
 HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch, 
 HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch, 
 HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_trunk_2.patch, 
 HBASE-3845_trunk_3.patch


 (I don't have a test case to prove this yet but I have run it by Dhruba and 
 Kannan internally and wanted to put this up for some feedback.)
 In this discussion let us assume that the region has only one column family. 
 That way I can use region/memstore interchangeably.
 After a memstore flush it is possible for lastSeqWritten to have a 
 log-sequence-id for a region that is not the earliest log-sequence-id for 
 that region's memstore.
 HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure 
 that we only keep track  of the earliest log-sequence-number that is present 
 in the memstore.
 Every time the memstore is flushed we remove the region's entry in 
 lastSequenceWritten and wait for the next append to populate this entry 
 again. This is where the problem happens.
 step 1:
 flusher.prepare() snapshots the memstore under 
 HRegion.updatesLock.writeLock().
 step 2 :
 as soon as the updatesLock.writeLock() is released new entries will be added 
 into the memstore.
 step 3 :
 wal.completeCacheFlush() is called. This method removes the region's entry 
 from lastSeqWritten.
 step 4:
 the next append will create a new entry for the region in lastSeqWritten(). 
 But this will be the log seq id of the current append. All the edits that 
 were added in step 2 are missing.
 ==
 as a temporary measure, instead of removing the region's entry in step 3 I 
 will replace it with the log-seq-id of the region-flush-event.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException

2011-08-14 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13084957#comment-13084957
 ] 

gaojinchao commented on HBASE-3933:
---

Hi all. I have a new idea for this issue. why don't we get the regionserver 
list from zk when it is failover? 
we can avoid this case that the hlog is splited but region server is servering.


 Hmaster throw NullPointerException
 --

 Key: HBASE-3933
 URL: https://issues.apache.org/jira/browse/HBASE-3933
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Attachments: Hmastersetup0.90


 NullPointerException while hmaster starting.
 {code}
   java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.get(TreeMap.java:255)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-08-10 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082212#comment-13082212
 ] 

gaojinchao commented on HBASE-4064:
---

I will study the code for trunk and confirm that have fixed.

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-08-01 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076000#comment-13076000
 ] 

gaojinchao commented on HBASE-4064:
---

Do we need fix this issue? If it need I will test it. or I will close the issue 
?

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-25 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070331#comment-13070331
 ] 

gaojinchao commented on HBASE-4064:
---

Master may be crashed because of pool shutdown is asynchronous. 

The master show :
2011-07-22 13:33:27,806 INFO 
org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 
regions of which 2156 are online.

2011-07-22 13:34:28,646 INFO 
org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 
regions of which 982 are online.
2011-07-22 13:34:31,079 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,080 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:6-0x31502ef4f0 Creating (or updating) unassigned node for 
c9b1c97ac6c00033ceb1890e45e66229 with OFFLINE state
2011-07-22 13:34:31,104 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Forcing OFFLINE; 
was=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 
state=OFFLINE, ts=1311312871080
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
No previous transition plan was found (or we are ignoring an existing plan) for 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. so generated a 
random one; 
hri=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229., src=, 
dest=C4C2.site,60020,1311310281335; 3 (online=3, exclude=null) available servers
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Assigning region 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. to 
C4C2.site,60020,1311310281335
2011-07-22 13:34:31,122 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,123 FATAL org.apache.hadoop.hbase.master.HMaster: 
Unexpected state trying to OFFLINE; 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 
state=PENDING_OPEN, ts=1311312871121
java.lang.IllegalStateException
at 
org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1081)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1036)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:864)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:844)
at java.lang.Thread.run(Thread.java:662)
2011-07-22 13:34:31,125 INFO org.apache.hadoop.hbase.master.HMaster: Aborting


 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 HBASE-4064_branch90V2.patch, disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-25 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4064:
--

Attachment: (was: HBASE-4064_branch90V2.patch)

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized (this.regions) {
   // 

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-24 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4064:
--

Attachment: disableflow.png

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized (this.regions) {
   // Check if this region is 

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-24 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070304#comment-13070304
 ] 

gaojinchao commented on HBASE-4064:
---

!disableflow.png!

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized 

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-24 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070306#comment-13070306
 ] 

gaojinchao commented on HBASE-4064:
---

The patch can't solve J-D issue. But it is improvement for disable table.

I make a flow chart(A -B -C-D). We can find there is a window between Remove 
region from RIT and Remove region from region clellections. So my patch want to 
change the positon.


 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from 

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-24 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4064:
--

Attachment: HBASE-4064_branch90V2.patch

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 HBASE-4064_branch90V2.patch, disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized 

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-24 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070310#comment-13070310
 ] 

gaojinchao commented on HBASE-4064:
---

I have made a patch, but I don't verify now. I want to review whether is 
reasonable firstly. then do it.

In my cluster I had changed the 
parameter(hbase.bulk.assignment.waiton.empty.rit) to avoid this issue.


 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 HBASE-4064_branch90V2.patch, disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it 

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-22 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13069442#comment-13069442
 ] 

gaojinchao commented on HBASE-4064:
---

@J-D Thanks for your replay. :)

I got it. In my case, The race is between the disable threads and 
ClosedRegionHandler threads.

 1.Disable thread get region from regions collection (reference 
getRegionsOfTable)

 2.Thread pool gets region and sends request to region server. at the same time 
puts region into RIT(regionsInTransition),   it indicates that the region is 
processing.

 3.Region server finishs closing region and changes the zk state, notifies the 
master.

 4.When master receives the watcher event, It removes the region from RIT and 
then remove from regions collection.
   There is a short window when diable table can't finish in a period(). The 
region may be unssigned again.

My patch try to fix above case. remove regions collection firstly and disable 
thread can't get a processing region.


I found the issue yestertay, Enable threads is also a race condition.  
(I changed the period for 1 minutes because of reproducing the issue). It seems 
pool couldn't finish but a new enable process starts. we need a sleep time when 
a enable period finishes

The master logs:
2011-07-22 13:33:27,806 INFO 
org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 
regions of which 2156 are online.

2011-07-22 13:34:28,646 INFO 
org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 
regions of which 982 are online.
2011-07-22 13:34:31,079 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,080 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:6-0x31502ef4f0 Creating (or updating) unassigned node for 
c9b1c97ac6c00033ceb1890e45e66229 with OFFLINE state
2011-07-22 13:34:31,104 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Forcing OFFLINE; 
was=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 
state=OFFLINE, ts=1311312871080
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
No previous transition plan was found (or we are ignoring an existing plan) for 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. so generated a 
random one; 
hri=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229., src=, 
dest=C4C2.site,60020,1311310281335; 3 (online=3, exclude=null) available servers
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Assigning region 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. to 
C4C2.site,60020,1311310281335
2011-07-22 13:34:31,122 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,123 FATAL org.apache.hadoop.hbase.master.HMaster: 
Unexpected state trying to OFFLINE; 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 
state=PENDING_OPEN, ts=1311312871121
java.lang.IllegalStateException
at 
org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1081)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1036)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:864)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:844)
at 
org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2011-07-22 13:34:31,125 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2011-07-22 13:34:31,482 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
master:6-0x31502ef4f0 Received ZooKeeper Event, type=NodeDataChanged, 
state=SyncConnected, path=/hbase/unassigned/c9b1c97ac6c00033ceb1890e45e66229
2011-07-22 13:34:31,482 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x31502ef4f0 Unable to get data of znode 
/hbase/unassigned/c9b1c97ac6c00033ceb1890e45e66229
 


 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: 

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-21 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068876#comment-13068876
 ] 

gaojinchao commented on HBASE-4064:
---

Please don't merge the patch, I found other issue and need dig whether is 
relation to the patch. Thanks.


 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

2011-07-20 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068747#comment-13068747
 ] 

gaojinchao commented on HBASE-4095:
---

I added some log  and found that the initialReplication is zero. 
when we create a file in hdfs , If I don't write data , the replication should 
be zero.
So the solution has some issue.

2011-07-20 19:38:20,517 WARN  [RegionServer:1;C4C3.site,41763,1311161899551] 
wal.HLog(478): gjc:rollWriter start1311161900517
2011-07-20 19:38:20,650 WARN  [RegionServer:0;C4C3.site,35697,1311161899494] 
wal.HLog(478): gjc:rollWriter start1311161900650
2011-07-20 19:38:20,707 WARN  [RegionServer:1;C4C3.site,41763,1311161899551] 
wal.HLog(518): gjc:updateLock start1311161900707
2011-07-20 19:38:20,707 WARN  [RegionServer:1;C4C3.site,41763,1311161899551] 
wal.HLog(532): gjc:initialReplication start0
2011-07-20 19:38:21,238 WARN  [RegionServer:0;C4C3.site,35697,1311161899494] 
wal.HLog(518): gjc:updateLock start1311161901238
2011-07-20 19:38:21,239 WARN  [RegionServer:0;C4C3.site,35697,1311161899494] 
wal.HLog(532): gjc:initialReplication start0
2011-07-20 19:38:41,726 WARN  [IPC Server handler 4 on 37616] wal.HLog(478): 
gjc:rollWriter start1311161921726
2011-07-20 19:38:41,769 WARN  [IPC Server handler 4 on 37616] wal.HLog(518): 
gjc:updateLock start1311161921769
2011-07-20 19:38:41,769 WARN  [IPC Server handler 4 on 37616] wal.HLog(532): 
gjc:initialReplication start0


 Hlog may not be rolled in a long time if checkLowReplication's request of 
 LogRoll is blocked
 

 Key: HBASE-4095
 URL: https://issues.apache.org/jira/browse/HBASE-4095
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.90.3
Reporter: Jieshan Bean
Assignee: Jieshan Bean
 Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, 
 HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif


 Some large Hlog files(Larger than 10G) appeared in our environment, and I got 
 the reason why they got so huge:
 1. The replicas is less than the expect number. So the method of 
 checkLowReplication will be called each sync.
 2. The method checkLowReplication request log-roll first, and set 
 logRollRequested as true: 
 {noformat}
 private void checkLowReplication() {
 // if the number of replicas in HDFS has fallen below the initial
 // value, then roll logs.
 try {
   int numCurrentReplicas = getLogReplication();
   if (numCurrentReplicas != 0 
 numCurrentReplicas  this.initialReplication) {
   LOG.warn(HDFS pipeline error detected.  +
   Found  + numCurrentReplicas +  replicas but expecting  +
   this.initialReplication +  replicas.  +
Requesting close of hlog.);
   requestLogRoll();
   logRollRequested = true;
   }
 } catch (Exception e) {
   LOG.warn(Unable to invoke DFSOutputStream.getNumCurrentReplicas + e +
  still proceeding ahead...);
 }
 }
 {noformat}
 3.requestLogRoll() just commit the roll request. It may not execute in time, 
 for it must got the un-fair lock of cacheFlushLock.
 But the lock may be carried by the cacheflush threads.
 4.logRollRequested was true until the log-roll executed. So during the time, 
 each request of log-roll in sync() was skipped.
 Here's the logs while the problem happened(Please notice the file size of 
 hlog 193-195-5-111%3A20020.1309937386639 in the last row):
 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
 HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  
 Requesting close of hlog.
 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
 Roll 
 /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119,
  entries=32434, filesize=239589754. New hlog 
 /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
 HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  
 Requesting close of hlog.
 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Renaming flushed file at 
 hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847
  to 
 hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983,
  entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore 

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-20 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068796#comment-13068796
 ] 

gaojinchao commented on HBASE-4064:
---

Hi, I verified the issue by adding a sleep in regionOffline. I think V2 is ok.

below code:
 public void regionOffline(final HRegionInfo regionInfo) {
synchronized(this.regionsInTransition) {
  if (this.regionsInTransition.remove(regionInfo.getEncodedName()) != null) 
{
this.regionsInTransition.notifyAll();
  }
}
try{
  Thread.sleep(1000);
}catch(Throwable e){
  ;
}
// remove the region plan as well just in case.
clearRegionPlan(regionInfo);
setOffline(regionInfo);
  }

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread 

[jira] [Updated] (HBASE-4112) Creating table may throw NullPointerException

2011-07-19 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4112:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 Creating table may throw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_Trunk.patch, HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4112) Creating table threw NullPointerException

2011-07-18 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4112:
--

Attachment: HBASE-4112_branch90V1.patch

 Creating table threw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4112) Creating table threw NullPointerException

2011-07-18 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066918#comment-13066918
 ] 

gaojinchao commented on HBASE-4112:
---

The reason is META table had some dirty data(eg: column=info:server).  
recreating table will throw exception.
I have made a patch and verified, Please review it. Thanks.


All tests passed.

 Creating table threw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4112) Creating table threw NullPointerException

2011-07-18 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067444#comment-13067444
 ] 

gaojinchao commented on HBASE-4112:
---

False means finished scan. True mean continue and process the next record.
In this case , True is better.(my test is also)

// the code segment for metaScan. 
  for (Result rr : rrs) {
if (processedRows = rowUpperLimit) {
  break done;
}
if (!visitor.processRow(rr))
  break done; //exit completely   // 
processedRows++;
  }

 Creating table threw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4112) Creating table threw NullPointerException

2011-07-18 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067449#comment-13067449
 ] 

gaojinchao commented on HBASE-4112:
---

Ok, I try to make a patch for TRUNK.

 Creating table threw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4112) Creating table threw NullPointerException

2011-07-18 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4112:
--

Attachment: HBASE-4112_Trunk.patch

 Creating table threw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_Trunk.patch, HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-16 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066387#comment-13066387
 ] 

gaojinchao commented on HBASE-4064:
---

@Stack:
I will reproduce and verify it after finishing review. Because it may spend a 
lot of time.


 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-14 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4064:
--

Attachment: HBASE-4064_branch90V2.patch

I try to make a patch. if the region is in RIT, It shouldn't be unsigned again. 
So it seems changing the code position can solve this issue. 
ALL test passed, Please review and give some suggesion.

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, 

[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException

2011-07-13 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064379#comment-13064379
 ] 

gaojinchao commented on HBASE-3933:
---

OK, Thanks.
It happens rarely.I can't get a better change now.

 Hmaster throw NullPointerException
 --

 Key: HBASE-3933
 URL: https://issues.apache.org/jira/browse/HBASE-3933
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Attachments: Hmastersetup0.90


 NullPointerException while hmaster starting.
 {code}
   java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.get(TreeMap.java:255)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3995) HBASE-3946 broke TestMasterFailover

2011-06-27 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055406#comment-13055406
 ] 

gaojinchao commented on HBASE-3995:
---

Hi, stack.
Following code snippet is repeated
if (storedInfo == null) 


 if (storedInfo == null) {
  ...
  if (storedInfo == null) {
storedInfo = this.onlineServers.get(info.getServerName());
  }
}

 HBASE-3946 broke TestMasterFailover
 ---

 Key: HBASE-3995
 URL: https://issues.apache.org/jira/browse/HBASE-3995
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: stack
Priority: Blocker
 Fix For: 0.90.4

 Attachments: am.txt


 TestMasterFailover is all about a new master coming up on an existing 
 cluster.  Previous to HBASE-3946, the new master joining a cluster processing 
 any dead servers would assign all regions found on the dead server even if 
 they were split parents.  We don't want that.
 But TestMasterFailover mocks up some pretty interesting conditions.  The one 
 we were failing on was that while the master was offine, we'd manually add a 
 region to zk that was in CLOSING state.  We'd then go and disable the table 
 up in zk (while master was offline).  Finally, we'd' kill the server that was 
 supposed to be hosting the region from the disabled table in CLOSING state. 
 Then we'd have the master join the cluster.  It had to figure it out.
 Before HBASE-3946, we'd just force offline every region that had been on the 
 dead server.  This would call all to be assigned only on assign, regions from 
 disabled tables are skipped, so it all worked (except would online parent 
 of a split should there be one).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4028) Hmaster crashes caused by splitting log.

2011-06-26 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055248#comment-13055248
 ] 

gaojinchao commented on HBASE-4028:
---

Ted, Thanks for your work.

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4028-0.90V2, Screenshot-2.png, Verifiedresult.png, 
 hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4028) Hmaster crashes caused by splitting log.

2011-06-25 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054846#comment-13054846
 ] 

gaojinchao commented on HBASE-4028:
---

Oh, my god! There is another bugs. It is Hidden.  :)
following code snippet
protected AtomicReferenceThrowable thrown = new AtomicReferenceThrowable();

thrown.get is null but not thrown. So the below condition is wrong.
 while (totalBuffered  maxHeapUsage  thrown == null) 

I have made a new patch. Please review it. Thanks.


 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4028-0.90V1.patch, Screenshot-2.png, 
 hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.

2011-06-25 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4028:
--

Attachment: HBASE-4028-0.90V2

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4028-0.90V1.patch, HBASE-4028-0.90V2, 
 Screenshot-2.png, hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4028) Hmaster crashes caused by splitting log.

2011-06-25 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054847#comment-13054847
 ] 

gaojinchao commented on HBASE-4028:
---

The verified result:
hbase-root-master-157-5-111-22.log:2011-06-25 17:18:53,768 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:18:56,768 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:18:59,768 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:02,768 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:05,769 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:08,769 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:11,769 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:14,769 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:17,769 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:20,770 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:23,770 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO thre

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4028-0.90V2, Screenshot-2.png, 
 hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.

2011-06-25 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4028:
--

Attachment: (was: HBASE-4028-0.90V1.patch)

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4028-0.90V2, Screenshot-2.png, 
 hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HBASE-4028) Hmaster crashes caused by splitting log.

2011-06-24 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao reassigned HBASE-4028:
-

Assignee: gaojinchao

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.

2011-06-24 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4028:
--

Attachment: Screenshot-2.png

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: Screenshot-2.png


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.

2011-06-24 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4028:
--

Attachment: hbase-root-master-157-5-100-8.rar

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: Screenshot-2.png, hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.

2011-06-24 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4028:
--

Attachment: HBASE-4028-0.90V1.patch

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4028-0.90V1.patch, Screenshot-2.png, 
 hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3892) Table can't disable

2011-06-12 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13048448#comment-13048448
 ] 

gaojinchao commented on HBASE-3892:
---

TRUNK don't need. It has modified for zk watcher.

Below code should protect this case.
// RegionState must be null, or SPLITTING or PENDING_CLOSE.
if (!isInStateForSplitting(regionState)) break;

 Table can't disable
 ---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: AssignmentManager_90v3.patch, 
 AssignmentManager_90v4.patch, logs.rar


 In TimeoutMonitor : 
 if node exists and node state is RS_ZK_REGION_CLOSED
 We should send a zk message again when close region is timeout.
 in this case, It may be loss some message.
 I See. It seems like a bug. This is my analysis.
 // disable table and master sent Close message to region server, Region state 
 was set PENDING_CLOSE
 2011-05-08 17:44:25,745 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
 usedHeap=4097, maxHeap=8175) for region 
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 // received splitting message and cleared Region state (PENDING_CLOSE)
 2011-05-08 17:46:45,303 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, 
 load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 

[jira] [Commented] (HBASE-3892) Table can't disable

2011-06-10 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13047117#comment-13047117
 ] 

gaojinchao commented on HBASE-3892:
---

It didn't reproduce. So, My guess J-D is right. Below logs shows that:

Region server repeated message an interval of 60s. So It should be IPC timeout.
2011-05-08 17:43:45,507 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:44:45,521 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:45:45,524 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:46:45,528 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6

Hmaster logs shows regionServerReport IPC had been closed.  It also proved IPC 
timeout

2011-05-08 17:52:47,703 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
Responder, call regionServerReport(serverName=C4C3.site,60020,1304820199474, 
load=(requests=0, regions=55, usedHeap=1058, maxHeap=8175), 
[Lorg.apache.hadoop.hbase.HMsg;@1453ecec, 
[Lorg.apache.hadoop.hbase.HRegionInfo;@11e78461) from 157.5.100.3:37518: output 
error
2011-05-08 17:52:47,704 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 7 on 6 caught: java.nio.channels.ClosedChannelException
at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at 
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at 
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at 
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

But I can't dig the root reason. 


 Table can't disable
 ---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: AssignmentManager_90v3.patch, 
 AssignmentManager_90v4.patch, logs.rar


 In TimeoutMonitor : 
 if node exists and node state is RS_ZK_REGION_CLOSED
 We should send a zk message again when close region is timeout.
 in this case, It may be loss some message.
 I See. It seems like a bug. This is my analysis.
 // disable table and master sent Close message to region server, Region state 
 was set PENDING_CLOSE
 2011-05-08 17:44:25,745 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
 usedHeap=4097, maxHeap=8175) for region 
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 // received splitting message and cleared Region state (PENDING_CLOSE)
 2011-05-08 17:46:45,303 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, 
 load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from 

[jira] [Commented] (HBASE-3892) Table can't disable

2011-06-08 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13045889#comment-13045889
 ] 

gaojinchao commented on HBASE-3892:
---

No, It need review and merge. 

 Table can't disable
 ---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: AssignmentManager_90v2.patch, 
 AssignmentManager_90v3.patch, logs.rar


 In TimeoutMonitor : 
 if node exists and node state is RS_ZK_REGION_CLOSED
 We should send a zk message again when close region is timeout.
 in this case, It may be loss some message.
 I See. It seems like a bug. This is my analysis.
 // disable table and master sent Close message to region server, Region state 
 was set PENDING_CLOSE
 2011-05-08 17:44:25,745 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
 usedHeap=4097, maxHeap=8175) for region 
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 // received splitting message and cleared Region state (PENDING_CLOSE)
 2011-05-08 17:46:45,303 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, 
 load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 

[jira] [Updated] (HBASE-3892) Table can't disable

2011-06-08 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-3892:
--

Attachment: AssignmentManager_90v4.patch

 Table can't disable
 ---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: AssignmentManager_90v3.patch, 
 AssignmentManager_90v4.patch, logs.rar


 In TimeoutMonitor : 
 if node exists and node state is RS_ZK_REGION_CLOSED
 We should send a zk message again when close region is timeout.
 in this case, It may be loss some message.
 I See. It seems like a bug. This is my analysis.
 // disable table and master sent Close message to region server, Region state 
 was set PENDING_CLOSE
 2011-05-08 17:44:25,745 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
 usedHeap=4097, maxHeap=8175) for region 
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 // received splitting message and cleared Region state (PENDING_CLOSE)
 2011-05-08 17:46:45,303 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, 
 load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 

[jira] [Commented] (HBASE-3892) Table can't disable

2011-06-02 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042715#comment-13042715
 ] 

gaojinchao commented on HBASE-3892:
---

I am not familiar with unit test(It seems diffult to send a double report of a 
split and test cluster function). So
I verified it by modified region server code.

Logs:
2011-06-02 19:57:49,056 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr2,8613816508215#1308,1307014802589.c5a8a47d9c84f417b9fcc4c8019e7c7e.: 
Daughters; 
ufdr2,8613816508215#1308,1307015867020.37481173e31ea469bcaa310cf8d7d980., 
ufdr2,8613816595415#3432,1307015867020.afbf02ef235cabe66026f7c393d79bc0. from 
C4C4.site,60020,1307015130114
2011-06-02 19:57:49,057 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x30502c278f Unable to get data of znode 
/hbase/unassigned/c5a8a47d9c84f417b9fcc4c8019e7c7e because node does not exist 
(not necessarily an error)
2011-06-02 19:57:49,081 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr2,8613816508215#1308,1307014802589.c5a8a47d9c84f417b9fcc4c8019e7c7e.: 
Daughters; 
ufdr2,8613816508215#1308,1307015867020.37481173e31ea469bcaa310cf8d7d980., 
ufdr2,8613816595415#3432,1307015867020.afbf02ef235cabe66026f7c393d79bc0. from 
C4C4.site,60020,1307015130114
2011-06-02 19:57:49,083 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x30502c278f Unable to get data of znode 
/hbase/unassigned/c5a8a47d9c84f417b9fcc4c8019e7c7e because node does not exist 
(not necessarily an error)
2011-06-02 19:57:49,083 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Trying to process the split of 37481173e31ea469bcaa310cf8d7d980, but it was 
already done and one daughter is on region server 
serverName=C4C4.site,60020,1307015130114, load=(requests=0, regions=0, 
usedHeap=0, maxHeap=0)
2011-06-02 19:57:56,468 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr2,8613819021840#1446,1307014756068.f865d41d918297f30b576b9ea3ccea07.: 
Daughters; 
ufdr2,8613819021840#1446,1307015873554.baa21e4f0cfa5840f009d0fac8e83d15., 
ufdr2,8613819104397#3916,1307015873554.fb63f608e5e37f5e85d71c925bc78010. from 
C4C3.site,60020,1307015129703
2011-06-02 19:57:56,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x30502c278f Unable to get data of znode 
/hbase/unassigned/f865d41d918297f30b576b9ea3ccea07 because node does not exist 
(not necessarily an error)
2011-06-02 19:57:56,472 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr2,8613819021840#1446,1307014756068.f865d41d918297f30b576b9ea3ccea07.: 
Daughters; 
ufdr2,8613819021840#1446,1307015873554.baa21e4f0cfa5840f009d0fac8e83d15., 
ufdr2,8613819104397#3916,1307015873554.fb63f608e5e37f5e85d71c925bc78010. from 
C4C3.site,60020,1307015129703
2011-06-02 19:57:56,474 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x30502c278f Unable to get data of znode 
/hbase/unassigned/f865d41d918297f30b576b9ea3ccea07 because node does not exist 
(not necessarily an error)
2011-06-02 19:57:56,474 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Trying to process the split of baa21e4f0cfa5840f009d0fac8e83d15, but it was 
already done and one daughter is on region server 
serverName=C4C3.site,60020,1307015129703, load=(requests=0, regions=0, 
usedHeap=0, maxHeap=0)


Thanks for your hint. It should be a 60 seconds timeout. Region server repeated 
message about 60s.

2011-05-08 17:43:45,507 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:44:45,521 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6

It seems a race with regionsInTransition. So IPC was blocked. I try to 
reproduce it.

Hmaster logs:  It recieved so much message RS_ZK_REGION_CLOSED. 
2011-05-08 17:43:45,157 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x22fcd582836003d Retrieved 125 byte(s) of data from znode 
/hbase/unassigned/83c05d9ead23d9a260edf30dc8739cf7 and set watcher; 
region=ufdr,2011050802#8613815394007#0610,1304847545412.83c05d9ead23d9a260edf30dc8739cf7.,
 server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSING
2011-05-08 17:43:45,525 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:43:48,943 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x22fcd582836003d Retrieved 125 byte(s) of data from znode 
/hbase/unassigned/5e3bacf3f43b6bad874e80c2f971e632 and set watcher; 

[jira] [Updated] (HBASE-3892) Table can't disable

2011-06-02 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-3892:
--

Attachment: AssignmentManager_90v3.patch

 Table can't disable
 ---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: AssignmentManager_90.patch, 
 AssignmentManager_90v2.patch, AssignmentManager_90v3.patch, logs.rar


 In TimeoutMonitor : 
 if node exists and node state is RS_ZK_REGION_CLOSED
 We should send a zk message again when close region is timeout.
 in this case, It may be loss some message.
 I See. It seems like a bug. This is my analysis.
 // disable table and master sent Close message to region server, Region state 
 was set PENDING_CLOSE
 2011-05-08 17:44:25,745 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
 usedHeap=4097, maxHeap=8175) for region 
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 // received splitting message and cleared Region state (PENDING_CLOSE)
 2011-05-08 17:46:45,303 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, 
 load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 

[jira] [Updated] (HBASE-3892) Table can't disable

2011-06-02 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-3892:
--

Attachment: (was: AssignmentManager_90.patch)

 Table can't disable
 ---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: AssignmentManager_90v2.patch, 
 AssignmentManager_90v3.patch, logs.rar


 In TimeoutMonitor : 
 if node exists and node state is RS_ZK_REGION_CLOSED
 We should send a zk message again when close region is timeout.
 in this case, It may be loss some message.
 I See. It seems like a bug. This is my analysis.
 // disable table and master sent Close message to region server, Region state 
 was set PENDING_CLOSE
 2011-05-08 17:44:25,745 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
 usedHeap=4097, maxHeap=8175) for region 
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 // received splitting message and cleared Region state (PENDING_CLOSE)
 2011-05-08 17:46:45,303 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, 
 load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 

[jira] [Commented] (HBASE-3892) Table can't disable

2011-06-01 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042050#comment-13042050
 ] 

gaojinchao commented on HBASE-3892:
---

I keep digging, find the repeated message was sent by region server.

If regionServerReport throwed exception, Region server will connect Hmaster 
again and send message again.

//region server logs.
2011-05-08 17:43:45,507 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:44:45,521 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:45:45,524 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:46:45,528 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:47:45,531 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:48:45,535 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:49:46,091 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:50:46,096 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:51:46,099 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6
2011-05-08 17:52:46,104 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
C4C1.site:6

//region server code.
ListHMsg tryRegionServerReport(final ListHMsg outboundMessages)
  throws IOException {
this.serverInfo.setLoad(buildServerLoad());
this.requestCount.set(0);
addOutboundMsgs(outboundMessages);
HMsg [] msgs = null;
while (!this.stopped) {
  try {
msgs = this.hbaseMaster.regionServerReport(this.serverInfo,   
  outboundMessages.toArray(HMsg.EMPTY_HMSG_ARRAY),
  getMostLoadedRegions());
break;
  } catch (IOException ioe) {
if (ioe instanceof RemoteException) {
  ioe = ((RemoteException)ioe).unwrapRemoteException();
}
if (ioe instanceof YouAreDeadException) {
  // This will be caught and handled as a fatal error in run()
  throw ioe;
}
// Couldn't connect to the master, get location from zk and reconnect
// Method blocks until new master is found or we are stopped
getMaster();
  }
}

Why did regionServerReport throw exception ?

It seems Hmaster was busy and IPC blocked. 

Hmaster logs:
2011-05-08 17:44:25,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
Responder, call regionServerReport(serverName=C4C4.site,60020,1304820199467, 
load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175), 
[Lorg.apache.hadoop.hbase.HMsg;@520ed128, 
[Lorg.apache.hadoop.hbase.HRegionInfo;@4ac5c32e) from 157.5.100.4:50187: output 
error
2011-05-08 17:44:25,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 11 on 6 caught: java.nio.channels.ClosedChannelException
at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at 
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at 
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at 
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)



 Table can't disable
 ---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: AssignmentManager_90.patch, 
 AssignmentManager_90v2.patch, logs.rar


 In TimeoutMonitor : 
 if node exists and node state is RS_ZK_REGION_CLOSED
 We should send a zk message again when close region is timeout.
 in this case, It may be loss some message.
 I See. It seems like a bug. This is my analysis.
 // disable table and master sent Close message to region server, Region state 
 was set PENDING_CLOSE
 2011-05-08 17:44:25,745 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
 usedHeap=4097, maxHeap=8175) for region 
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 

[jira] [Commented] (HBASE-3892) Table can't disable

2011-06-01 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042062#comment-13042062
 ] 

gaojinchao commented on HBASE-3892:
---

patch(AssignmentManager_90v2) looks like benefit.  Thanks.

 Table can't disable
 ---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: AssignmentManager_90.patch, 
 AssignmentManager_90v2.patch, logs.rar


 In TimeoutMonitor : 
 if node exists and node state is RS_ZK_REGION_CLOSED
 We should send a zk message again when close region is timeout.
 in this case, It may be loss some message.
 I See. It seems like a bug. This is my analysis.
 // disable table and master sent Close message to region server, Region state 
 was set PENDING_CLOSE
 2011-05-08 17:44:25,745 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
 usedHeap=4097, maxHeap=8175) for region 
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 // received splitting message and cleared Region state (PENDING_CLOSE)
 2011-05-08 17:46:45,303 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, 
 load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: 
 

[jira] [Commented] (HBASE-3892) Table can't disable

2011-05-31 Thread gaojinchao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041536#comment-13041536
 ] 

gaojinchao commented on HBASE-3892:
---

Hi, Stack. I made a mistake above analysis.
I read the code again, The root reason is spliting message repeated. 

2011-05-08 17:42:45,514 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467

//Closed the region.
2011-05-08 17:43:37,599 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Starting unassignment of region 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 (offlining)

2011-05-08 17:43:45,525 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467

// set RIT state and sent a message 
2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, 
regions=123, usedHeap=4097, maxHeap=8175) for region 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.

//Received split message again and RIT was deleted . So it could not process 
closed event.

2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467

//
2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Overwriting 4418fb197685a21f77e151e401cf8b66 on 
serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
usedHeap=4097, maxHeap=8175)
2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 

[jira] [Updated] (HBASE-3892) Table can't disable

2011-05-31 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-3892:
--

Attachment: AssignmentManager_90v2.patch

 Table can't disable
 ---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: AssignmentManager_90.patch, 
 AssignmentManager_90v2.patch, logs.rar


 In TimeoutMonitor : 
 if node exists and node state is RS_ZK_REGION_CLOSED
 We should send a zk message again when close region is timeout.
 in this case, It may be loss some message.
 I See. It seems like a bug. This is my analysis.
 // disable table and master sent Close message to region server, Region state 
 was set PENDING_CLOSE
 2011-05-08 17:44:25,745 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
 usedHeap=4097, maxHeap=8175) for region 
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 // received splitting message and cleared Region state (PENDING_CLOSE)
 2011-05-08 17:46:45,303 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, 
 load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 

  1   2   >