from:"gaojinchao \(JIRA\)"

Hbase can't balance.


 Key: HBASE-4340
 URL: https://issues.apache.org/jira/browse/HBASE-4340
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: gaojinchao
 Fix For: 0.90.5


Version: 0.90.4
Cluster : 40 boxes
As I saw below logs. It said that balance couldn't work because of a dead RS.
I dug deeply and found two issues:

1.   shutdownhandler didn't clear numProcessing deal with some exceptions. 
It seems whatever exceptions we should clear the flag or close master.

2.   dead regionserver(s): [158-1-130-12,20020,1314971097929] is 
inaccurate. The dead sever should be  158-1-130-10,20020,1315068597979

//master logs:
2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]
2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
running balancer because processing dead regionserver(s): 
[158-1-130-12,20020,1314971097929]

// the exception logs :.
2011-09-03 18:13:27,550 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENING, 
server=158-1-133-11,20020,1315069437236, region=0db4088d75c58dd22f93f389d90ba6cc
2011-09-03 18:13:27,550 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
Caught throwable while processing event

[jira] [Assigned] (HBASE-4340) Hbase can't balance.


 [ 
https://issues.apache.org/jira/browse/HBASE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao reassigned HBASE-4340:
-

Assignee: gaojinchao

 Hbase can't balance.
 

 Key: HBASE-4340
 URL: https://issues.apache.org/jira/browse/HBASE-4340
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5


 Version: 0.90.4
 Cluster : 40 boxes
 As I saw below logs. It said that balance couldn't work because of a dead RS.
 I dug deeply and found two issues:
 1.   shutdownhandler didn't clear numProcessing deal with some 
 exceptions. It seems whatever exceptions we should clear the flag or close 
 master.
 2.   dead regionserver(s): [158-1-130-12,20020,1314971097929] is 
 inaccurate. The dead sever should be  158-1-130-10,20020,1315068597979
 //master logs:
 2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not 
 running balancer because processing dead regionserver(s): 
 [158-1-130-12,20020,1314971097929]
 // the exception logs

[jira] [Assigned] (HBASE-3521) region be merged with others automatically when all data in the region has expired and removed, or region gets too small.


 [ 
https://issues.apache.org/jira/browse/HBASE-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao reassigned HBASE-3521:
-

Assignee: gaojinchao

 region be merged with others automatically when all data in the region has 
 expired and removed, or region gets too small.
 -

 Key: HBASE-3521
 URL: https://issues.apache.org/jira/browse/HBASE-3521
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver, scripts
Affects Versions: 0.90.0
Reporter: zhoushuaifeng
Assignee: gaojinchao
Priority: Minor

 We have test a cluster which have more than 30,000 regions, max size of a 
 region is 512MB. At this situation, data no more growing, but remove some old 
 data and insert new, and regions will be more and more. And some regions may 
 be very small or empty. This occupies too much heapsize, and will be more if 
 regions cannot be merged. This will limit hbase running for a long time. 
 A script that does a survey to remove empty regions, or pick out adjacent 
 small regions that then does the online merge up seems like it would be 
 useful. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3521) region be merged with others automatically when all data in the region has expired and removed, or region gets too small.


[ 
https://issues.apache.org/jira/browse/HBASE-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098560#comment-13098560
 ] 

gaojinchao commented on HBASE-3521:
---

Thanks a lot.This is what I want. :)

 region be merged with others automatically when all data in the region has 
 expired and removed, or region gets too small.
 -

 Key: HBASE-3521
 URL: https://issues.apache.org/jira/browse/HBASE-3521
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver, scripts
Affects Versions: 0.90.0
Reporter: zhoushuaifeng
Assignee: gaojinchao
Priority: Minor

 We have test a cluster which have more than 30,000 regions, max size of a 
 region is 512MB. At this situation, data no more growing, but remove some old 
 data and insert new, and regions will be more and more. And some regions may 
 be very small or empty. This occupies too much heapsize, and will be more if 
 regions cannot be merged. This will limit hbase running for a long time. 
 A script that does a survey to remove empty regions, or pick out adjacent 
 small regions that then does the online merge up seems like it would be 
 useful. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-2158) Change how high/low global limit works; start taking on writes as soon as we dip below high limit rather than block until low limit as we currently do.


[ 
https://issues.apache.org/jira/browse/HBASE-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098586#comment-13098586
 ] 

gaojinchao commented on HBASE-2158:
---

I agree.this issue should be closed.

 Change how high/low global limit works; start taking on writes as soon as we 
 dip below high limit rather than block until low limit as we currently do.
 ---

 Key: HBASE-2158
 URL: https://issues.apache.org/jira/browse/HBASE-2158
 Project: HBase
  Issue Type: Improvement
Reporter: stack

 A Ryan Rawson suggestion.  See HBASE-2149 for more context.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4132) Extend the WALActionsListener API to accomodate log archival

2011-08-29 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093403#comment-13093403
 ] 

gaojinchao commented on HBASE-4132:
---

Whether we can add the following api?
  
 /**
   * The WAL needs to be archived. It is going to be moved from oldPath to
   * newPath.
   * 
   * @param oldPath
   *  the path to the old hlog
   * @param newPath
   *  the path to the new hlog
   * @return true if default behavior should be bypassed, false otherwise
   */
  boolean preArchiveLog(Path oldPath, Path newPath) throws IOException;

  /**
   * The WAL has been archived. It is moved from oldPath to newPath.
   * 
   * @param oldPath
   *  the path to the old hlog
   * @param newPath
   *  the path to the new hlog
   * @param archivalWasSuccessful
   *  true, if the archival was successful
   */
  void postArchiveLog(Path oldPath, Path newPath,
  boolean archivalWasSuccessful) throws IOException;

 Extend the WALActionsListener API to accomodate log archival
 

 Key: HBASE-4132
 URL: https://issues.apache.org/jira/browse/HBASE-4132
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: dhruba borthakur
 Fix For: 0.92.0

 Attachments: walArchive.txt


 The WALObserver interface exposes the log roll events. It would be nice to 
 extend it to accomodate log archival events as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.


[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13092583#comment-13092583
 ] 

gaojinchao commented on HBASE-4124:
---

@Ted thanks for your work. 
sn has checked about null above statement.

if (sn == null) {
  LOG.warn(Region in transition  + regionInfo.getEncodedName() +
 references a null server; letting RIT timeout so will be  +
assigned elsewhere);
  break;
}

 ZK restarted while a region is being assigned, new active HM re-assigns it 
 but the RS warns 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: 4124-trunk.v2, HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_TrunkV2.patch

I am runing all the test cases. My new modification is more clear. 

 ZK restarted while a region is being assigned, new active HM re-assigns it 
 but the RS warns 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, 
 HBASE-4124_TrunkV2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4134) The total number of regions was more than the actual region count after the hbck fix


 [ 
https://issues.apache.org/jira/browse/HBASE-4134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4134:
--

Fix Version/s: (was: 0.92.0)
   0.94.0

 The total number of regions was more than the actual region count after the 
 hbck fix
 

 Key: HBASE-4134
 URL: https://issues.apache.org/jira/browse/HBASE-4134
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: feng xu
 Fix For: 0.94.0


 1. I found the problem(some regions were multiply assigned) while running 
 hbck to check the cluster's health. Here's the result:
 {noformat}
 ERROR: Region test1,230778,1311216270050.fff783529fcd983043610eaa1cc5c2fe. is 
 listed in META on region server 158-1-91-101:20020 but is multiply assigned 
 to region servers 158-1-91-101:20020, 158-1-91-105:20020 
 ERROR: Region test1,252103,1311216293671.fff9ed2cb69bdce535451a07686c0db5. is 
 listed in META on region server 158-1-91-101:20020 but is multiply assigned 
 to region servers 158-1-91-101:20020, 158-1-91-105:20020 
 ERROR: Region test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. is 
 listed in META on region server 158-1-91-103:20020 but is multiply assigned 
 to region servers 158-1-91-103:20020, 158-1-91-105:20020 
 Summary: 
   -ROOT- is okay. 
 Number of regions: 1 
 Deployed on: 158-1-91-105:20020 
   .META. is okay. 
 Number of regions: 1 
 Deployed on: 158-1-91-103:20020 
   test1 is okay. 
 Number of regions: 25297 
 Deployed on: 158-1-91-101:20020 158-1-91-103:20020 158-1-91-105:20020 
 14829 inconsistencies detected. 
 Status: INCONSISTENT 
 {noformat}
 2. Then I tried to use hbck -fix to fix the problem. Everything seemed ok. 
 But I found that the total number of regions reported by load balancer 
 (35029) was more than the actual region count(25299) after the fixing.
 Here's the related logs snippet:
 {noformat}
 2011-07-22 02:19:02,866 INFO org.apache.hadoop.hbase.master.LoadBalancer: 
 Skipping load balancing.  servers=3 regions=25299 average=8433.0 
 mostloaded=8433 
 2011-07-22 03:06:11,832 INFO org.apache.hadoop.hbase.master.LoadBalancer: 
 Skipping load balancing.  servers=3 regions=35029 average=11676.333 
 mostloaded=11677 leastloaded=11676
 {noformat}
 3. I tracked one region's behavior during the time. Taking the region of 
 test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. as example:
 (1) It was assigned to 158-1-91-101 at first. 
 (2) HBCK sent closing request to RegionServer. And RegionServer closed it 
 silently without notice to HMaster.
 (3) The region was still carried by RS 158-1-91-103 which was known to 
 HMaster.
 (4) HBCK will trigger a new assignment.
 The fact is, the region was assigned again, but the old assignment 
 information still remained in AM#regions,AM#servers.
 That's why the problem of region count was larger than the actual number 
 occurred.  
 {noformat}
 Line 178967: 2011-07-22 02:47:51,247 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/52782c0241a598b3e37ca8729da0 
 (region=test1,282187,1311216322104.52782c0241a598b3e37ca8729da0., 
 server=HBCKServerName, state=M_ZK_REGION_OFFLINE)
 Line 178968: 2011-07-22 02:47:51,247 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling HBCK triggered 
 transition=M_ZK_REGION_OFFLINE, server=HBCKServerName, 
 region=52782c0241a598b3e37ca8729da0
 Line 178969: 2011-07-22 02:47:51,248 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: HBCK repair is triggering 
 assignment of 
 region=test1,282187,1311216322104.52782c0241a598b3e37ca8729da0.
 Line 178970: 2011-07-22 02:47:51,248 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. so generated a 
 random one; hri=test1,282187,1311216322104.52782c0241a598b3e37ca8729da0., 
 src=, dest=158-1-91-101,20020,1311231878544; 3 (online=3, exclude=null) 
 available servers
 Line 178971: 2011-07-22 02:47:51,248 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 test1,282187,1311216322104.52782c0241a598b3e37ca8729da0. to 
 158-1-91-101,20020,1311231878544
 Line 178983: 2011-07-22 02:47:51,285 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=158-1-91-101,20020,1311231878544, 
 region=52782c0241a598b3e37ca8729da0
 Line 179001: 2011-07-22 02:47:51,318 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=158-1-91-101,20020,1311231878544, 
 region=52782c0241a598b3e37ca8729da0
 Line 179002: 2011-07-22 02:47:51,319 DEBUG

[jira] [Commented] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.


[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13092618#comment-13092618
 ] 

gaojinchao commented on HBASE-4124:
---

All test cases passed. Thanks.


 ZK restarted while a region is being assigned, new active HM re-assigns it 
 but the RS warns 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, 
 HBASE-4124_TrunkV2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.

2011-08-26 Thread gaojinchao (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_TrunkV1.patch

I have made a patch. I found two test case(TestAdmin and RollLoging) can't 
pass. I use the raw trunk as well

 ZK restarted while a region is being assigned, new active HM re-assigns it 
 but the RS warns 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, HBASE-4124_TrunkV1.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V4.patch

According to review, modified the comments.
Thanks for Ted's careful review.

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

[
https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-3845:
--

Attachment: HBASE-3845_branch90V2.patch

According to review, modified the code.

data loss because lastSeqWritten can miss memstore edits

Key: HBASE-3845
URL: https://issues.apache.org/jira/browse/HBASE-3845
Project: HBase
Issue Type: Bug
Affects Versions: 0.90.3
Reporter: Prakash Khemani
Assignee: ramkrishna.s.vasudevan
Priority: Critical
Fix For: 0.92.0

Attachments:
0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch,
HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch,
HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch,
HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_branch90V1.patch,
HBASE-3845_branch90V2.patch, HBASE-3845_trunk_2.patch,
HBASE-3845_trunk_3.patch

(I don't have a test case to prove this yet but I have run it by Dhruba and
Kannan internally and wanted to put this up for some feedback.)
In this discussion let us assume that the region has only one column family.
That way I can use region/memstore interchangeably.
After a memstore flush it is possible for lastSeqWritten to have a
log-sequence-id for a region that is not the earliest log-sequence-id for
that region's memstore.
HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure
that we only keep track of the earliest log-sequence-number that is present
in the memstore.
Every time the memstore is flushed we remove the region's entry in
lastSequenceWritten and wait for the next append to populate this entry
again. This is where the problem happens.
step 1:
flusher.prepare() snapshots the memstore under
HRegion.updatesLock.writeLock().
step 2 :
as soon as the updatesLock.writeLock() is released new entries will be added
into the memstore.
step 3 :
wal.completeCacheFlush() is called. This method removes the region's entry
from lastSeqWritten.
step 4:
the next append will create a new entry for the region in lastSeqWritten().
But this will be the log seq id of the current append. All the edits that
were added in step 2 are missing.
==
as a temporary measure, instead of removing the region's entry in step 3 I
will replace it with the log-seq-id of the region-flush-event.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Work started] (HBASE-3933) Hmaster throw NullPointerException


 [ 
https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-3933 started by gaojinchao.

 Hmaster throw NullPointerException
 --

 Key: HBASE-3933
 URL: https://issues.apache.org/jira/browse/HBASE-3933
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Attachments: Hmastersetup0.90


 NullPointerException while hmaster starting.
 {code}
   java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.get(TreeMap.java:255)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException


[ 
https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090853#comment-13090853
 ] 

gaojinchao commented on HBASE-3933:
---

I study the TRUNK. It has fixed. So we can close this issue.

Trunk code:
// Wait for region servers to report in.
this.serverManager.waitForRegionServers(status);
// Check zk for regionservers that are up but didn't register
for (ServerName sn: this.regionServerTracker.getOnlineServers()) {
  if (!this.serverManager.isServerOnline(sn)) {
// Not registered; add it.
LOG.info(Registering server found up in zk:  + sn);
this.serverManager.recordNewServer(sn, HServerLoad.EMPTY_HSERVERLOAD);
  }
}

 Hmaster throw NullPointerException
 --

 Key: HBASE-3933
 URL: https://issues.apache.org/jira/browse/HBASE-3933
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Attachments: Hmastersetup0.90


 NullPointerException while hmaster starting.
 {code}
   java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.get(TreeMap.java:255)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

[
https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090882#comment-13090882
]

gaojinchao commented on HBASE-3845:
---

@Stack
Please review the patch and give some suggestion. :)

data loss because lastSeqWritten can miss memstore edits

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4124) ZK restarted while a region is being assigned, new active HM re-assigns it but the RS warns 'already online on this server'.


[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13091474#comment-13091474
 ] 

gaojinchao commented on HBASE-4124:
---

@ Ted
I am making a patch for TRUNK. But I have some questions about TRUNK.
It seems a bug.
In function assign, when we get the return value ALREADY_OPENED .
should we update the meta table ?  or we do this on region server.

hmaster code:
  RegionOpeningState regionOpenState = serverManager.sendRegionOpen(plan
.getDestination(), state.getRegion());
if (regionOpenState == RegionOpeningState.ALREADY_OPENED) {

region server code: if we don't update the meta ,the client may access to the 
old server.

 HRegion onlineRegion = this.getFromOnlineRegions(region.getEncodedName());
if (null != onlineRegion) {
  LOG.warn(Attempted open of  + region.getEncodedName()
  +  but already online on this server);
  return RegionOpeningState.ALREADY_OPENED;
}

 ZK restarted while a region is being assigned, new active HM re-assigns it 
 but the RS warns 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, 
 HBASE-4124_Branch90V4.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V3.patch

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090141#comment-13090141
 ] 

gaojinchao commented on HBASE-4124:
---

@Ted
Does it need a patch for Trunk? 
There is a big change, I need some time to study it.

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090677#comment-13090677
 ] 

gaojinchao commented on HBASE-4124:
---

@Ted 
I have run all the tests. Thanks for your work.

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090698#comment-13090698
 ] 

gaojinchao commented on HBASE-4124:
---

@ram
How come we have a dead RS if we dont kill the RS

gao: If you stop the cluster, The meta will handle the server information.

if the master is also killed how can the regions be assigned to some other RS 

gao: When master startup, it collects the regions on a same region server and 
 call sendRegionOpen(destination, regions).
 If the region is relatively large number, when region server opens the 
reigons needs a long time.
 when master crash, the new master may reopen the regions on another region 
server.
 

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, HBASE-4124_Branch90V3.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

[
https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-3845:
--

Attachment: HBASE-3845_branch90V1.patch

data loss because lastSeqWritten can miss memstore edits

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

[
https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090740#comment-13090740
]

gaojinchao commented on HBASE-3845:
---

@RAM
I have run all the unit tests, Please help to review it firstly. Thanks.

I will construct the scene to verify today.

data loss because lastSeqWritten can miss memstore edits

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-22 Thread gaojinchao (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao reassigned HBASE-4124:
-

Assignee: gaojinchao

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.

2011-08-22 Thread gaojinchao (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Fix Version/s: 0.90.5

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Fix For: 0.90.5

 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V2.patch

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088146#comment-13088146
 ] 

gaojinchao commented on HBASE-4124:
---

I have finished the test. I discribe the scene:
step 1: startup cluster 
step 2: abort the master when finish call sendRegionOpen(destination, regions)
step 3: startup cluster again.

above steps will reproduce the issue. 
when master is failover. the meta records the dead server,but the region is 
processing for a living region server.


 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088147#comment-13088147
 ] 

gaojinchao commented on HBASE-4124:
---

sorry.step 3: startup master again .

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V2.patch

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: (was: HBASE-4124_Branch90V2.patch)

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V2.patch

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: (was: HBASE-4124_Branch90V2.patch)

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


[ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088173#comment-13088173
 ] 

gaojinchao commented on HBASE-4124:
---

I have added a test case for opening a region.

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, 
 HBASE-4124_Branch90V2.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-4212) TestMasterFailover fails occasionally

TestMasterFailover fails occasionally
-

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
 Fix For: 0.90.5


It seems a bug. The root in RIT can't be moved..
In the failover process, it enforces root on-line. But not clean zk node. 
test will wait forever.

  void processFailover() throws KeeperException, IOException, 
InterruptedException {
 
// we enforce on-line root.
HServerInfo hsi =
  this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
hsi = 
this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);

It seems that we should wait finished as meta region 
  int assignRootAndMeta()
  throws InterruptedException, IOException, KeeperException {
int assigned = 0;
long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
1000);

// Work on ROOT region.  Is it in zk in transition?
boolean rit = this.assignmentManager.
  
processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
if (!catalogTracker.verifyRootRegionLocation(timeout)) {
  this.assignmentManager.assignRoot();
  this.catalogTracker.waitForRoot();

  //we need add this code and guarantee that the transition has completed
  this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
  assigned++;
}

logs:
2011-08-16 07:45:40,715 DEBUG 
[RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received 
ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
path=/hbase/unassigned/70236052
2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
path=/hbase/unassigned/70236052
2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
C4S2.site:47710
2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): 
master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode 
/hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0, 
server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 
byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
2011-08-16 07:45:40,740 DEBUG 
[RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 Received 
ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
path=/hbase/unassigned/70236052
2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] 
zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
path=/hbase/unassigned/70236052
2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052
2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] zookeeper.ZKUtil(1109): 
master:60701-0x131d2690f780009 Retrieved 52 byte(s) of data from znode 
/hbase/unassigned/70236052 and set watcher; region=-ROOT-,,0, 
server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENED
2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] 
master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENED, 
server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-

//.It said that zk node can't be 
cleaned because of we have

[jira] [Updated] (HBASE-4212) TestMasterFailover fails occasionally


 [ 
https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4212:
--

Attachment: HBASE-4212_branch90V1.patch

 TestMasterFailover fails occasionally
 -

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4212_branch90V1.patch


 It seems a bug. The root in RIT can't be moved..
 In the failover process, it enforces root on-line. But not clean zk node. 
 test will wait forever.
   void processFailover() throws KeeperException, IOException, 
 InterruptedException {
  
 // we enforce on-line root.
 HServerInfo hsi =
   
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
 It seems that we should wait finished as meta region 
   int assignRootAndMeta()
   throws InterruptedException, IOException, KeeperException {
 int assigned = 0;
 long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
 1000);
 // Work on ROOT region.  Is it in zk in transition?
 boolean rit = this.assignmentManager.
   
 processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   //we need add this code and guarantee that the transition has completed
   this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
   assigned++;
 }
 logs:
 2011-08-16 07:45:40,715 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
 catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
 C4S2.site:47710
 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher; 
 region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, 
 state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
 master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
 server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
 transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to 
 RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 
 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
 server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,740 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052
 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher;

[jira] [Commented] (HBASE-4212) TestMasterFailover fails occasionally


[ 
https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086199#comment-13086199
 ] 

gaojinchao commented on HBASE-4212:
---

I have made a patch. Please review it. Thanks.

 TestMasterFailover fails occasionally
 -

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4212_branch90V1.patch


 It seems a bug. The root in RIT can't be moved..
 In the failover process, it enforces root on-line. But not clean zk node. 
 test will wait forever.
   void processFailover() throws KeeperException, IOException, 
 InterruptedException {
  
 // we enforce on-line root.
 HServerInfo hsi =
   
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
 It seems that we should wait finished as meta region 
   int assignRootAndMeta()
   throws InterruptedException, IOException, KeeperException {
 int assigned = 0;
 long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
 1000);
 // Work on ROOT region.  Is it in zk in transition?
 boolean rit = this.assignmentManager.
   
 processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   //we need add this code and guarantee that the transition has completed
   this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
   assigned++;
 }
 logs:
 2011-08-16 07:45:40,715 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
 catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
 C4S2.site:47710
 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher; 
 region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, 
 state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
 master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
 server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
 transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to 
 RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 
 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
 server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,740 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052
 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode

[jira] [Commented] (HBASE-4212) TestMasterFailover fails occasionally


[ 
https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086202#comment-13086202
 ] 

gaojinchao commented on HBASE-4212:
---

I test 10 times and logs said that META is assigned after root has finished.

2011-08-17 05:06:51,419 DEBUG [MASTER_OPEN_REGION-C4S2.site:47578-0] 
zookeeper.ZKUtil(1109): master:47578-0x131d6fe02e50009 Retrieved 52 byte(s) of 
data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
server=C4S2.site,60960,1313571996605, state=RS_ZK_REGION_OPENED
2011-08-17 05:06:51,425 DEBUG [Thread-755-EventThread] 
zookeeper.ZooKeeperWatcher(252): master:47578-0x131d6fe02e50009 Received 
ZooKeeper Event, type=NodeDeleted, state=SyncConnected, 
path=/hbase/unassigned/70236052
2011-08-17 05:06:51,425 DEBUG [MASTER_OPEN_REGION-C4S2.site:47578-0] 
zookeeper.ZKAssign(420): master:47578-0x131d6fe02e50009 Successfully deleted 
unassigned node for region 70236052 in expected state RS_ZK_REGION_OPENED
2011-08-17 05:06:51,426 INFO  [Master:0;C4S2.site:47578] master.HMaster(437): 
-ROOT- assigned=1, rit=false, location=C4S2.site:60960
2011-08-17 05:06:51,426 DEBUG [MASTER_OPEN_REGION-C4S2.site:47578-0] 
handler.OpenedRegionHandler(108): Opened region -ROOT-,,0.70236052 on 
C4S2.site,60960,1313571996605
2011-08-17 05:06:51,427 DEBUG [Master:0;C4S2.site:47578] zookeeper.ZKUtil(553): 
master:47578-0x131d6fe02e50009 Unable to get data of znode 
/hbase/unassigned/1028785192 because node does not exist (not an error)
2011-08-17 05:06:51,429 INFO  [Master:0;C4S2.site:47578] 
catalog.CatalogTracker(421): Passed metaserver is null

 TestMasterFailover fails occasionally
 -

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4212_branch90V1.patch


 It seems a bug. The root in RIT can't be moved..
 In the failover process, it enforces root on-line. But not clean zk node. 
 test will wait forever.
   void processFailover() throws KeeperException, IOException, 
 InterruptedException {
  
 // we enforce on-line root.
 HServerInfo hsi =
   
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
 It seems that we should wait finished as meta region 
   int assignRootAndMeta()
   throws InterruptedException, IOException, KeeperException {
 int assigned = 0;
 long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
 1000);
 // Work on ROOT region.  Is it in zk in transition?
 boolean rit = this.assignmentManager.
   
 processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   //we need add this code and guarantee that the transition has completed
   this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
   assigned++;
 }
 logs:
 2011-08-16 07:45:40,715 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
 catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
 C4S2.site:47710
 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher; 
 region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, 
 state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
 master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
 server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
 transition node

[jira] [Updated] (HBASE-4212) TestMasterFailover fails occasionally


 [ 
https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4212:
--

Assignee: gaojinchao
  Status: Patch Available  (was: Open)

 TestMasterFailover fails occasionally
 -

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4212_branch90V1.patch


 It seems a bug. The root in RIT can't be moved..
 In the failover process, it enforces root on-line. But not clean zk node. 
 test will wait forever.
   void processFailover() throws KeeperException, IOException, 
 InterruptedException {
  
 // we enforce on-line root.
 HServerInfo hsi =
   
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
 It seems that we should wait finished as meta region 
   int assignRootAndMeta()
   throws InterruptedException, IOException, KeeperException {
 int assigned = 0;
 long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
 1000);
 // Work on ROOT region.  Is it in zk in transition?
 boolean rit = this.assignmentManager.
   
 processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   //we need add this code and guarantee that the transition has completed
   this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
   assigned++;
 }
 logs:
 2011-08-16 07:45:40,715 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
 catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
 C4S2.site:47710
 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher; 
 region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, 
 state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
 master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
 server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
 transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to 
 RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 
 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
 server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,740 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052
 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode

[jira] [Updated] (HBASE-4212) TestMasterFailover fails occasionally


 [ 
https://issues.apache.org/jira/browse/HBASE-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4212:
--

Attachment: HBASE-4212_TrunkV1.patch

 TestMasterFailover fails occasionally
 -

 Key: HBASE-4212
 URL: https://issues.apache.org/jira/browse/HBASE-4212
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.5

 Attachments: HBASE-4212_TrunkV1.patch, HBASE-4212_branch90V1.patch


 It seems a bug. The root in RIT can't be moved..
 In the failover process, it enforces root on-line. But not clean zk node. 
 test will wait forever.
   void processFailover() throws KeeperException, IOException, 
 InterruptedException {
  
 // we enforce on-line root.
 HServerInfo hsi =
   
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 regionOnline(HRegionInfo.FIRST_META_REGIONINFO, hsi);
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 regionOnline(HRegionInfo.ROOT_REGIONINFO, hsi);
 It seems that we should wait finished as meta region 
   int assignRootAndMeta()
   throws InterruptedException, IOException, KeeperException {
 int assigned = 0;
 long timeout = this.conf.getLong(hbase.catalog.verification.timeout, 
 1000);
 // Work on ROOT region.  Is it in zk in transition?
 boolean rit = this.assignmentManager.
   
 processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.ROOT_REGIONINFO);
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   //we need add this code and guarantee that the transition has completed
   this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
   assigned++;
 }
 logs:
 2011-08-16 07:45:40,715 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,715 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,715 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,716 INFO  [PostOpenDeployTasks:70236052] 
 catalog.RootLocationEditor(62): Setting ROOT region location in ZooKeeper as 
 C4S2.site:47710
 2011-08-16 07:45:40,716 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode /hbase/unassigned/70236052 and set watcher; 
 region=-ROOT-,,0, server=C4S2.site,47710,1313495126115, 
 state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,717 DEBUG [Thread-760-EventThread] 
 master.AssignmentManager(477): Handling transition=RS_ZK_REGION_OPENING, 
 server=C4S2.site,47710,1313495126115, region=70236052/-ROOT-
 2011-08-16 07:45:40,725 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(661): regionserver:47710-0x131d2690f780004 Attempting to 
 transition node 70236052/-ROOT- from RS_ZK_REGION_OPENING to 
 RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,727 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKUtil(1109): regionserver:47710-0x131d2690f780004 Retrieved 52 
 byte(s) of data from znode /hbase/unassigned/70236052; data=region=-ROOT-,,0, 
 server=C4S2.site,47710,1313495126115, state=RS_ZK_REGION_OPENING
 2011-08-16 07:45:40,740 DEBUG 
 [RegionServer:0;C4S2.site,47710,1313495126115-EventThread] 
 zookeeper.ZooKeeperWatcher(252): regionserver:47710-0x131d2690f780004 
 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [Thread-760-EventThread] 
 zookeeper.ZooKeeperWatcher(252): master:60701-0x131d2690f780009 Received 
 ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/70236052
 2011-08-16 07:45:40,740 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 zookeeper.ZKAssign(712): regionserver:47710-0x131d2690f780004 Successfully 
 transitioned node 70236052 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
 2011-08-16 07:45:40,741 DEBUG [RS_OPEN_ROOT-C4S2.site,47710,1313495126115-0] 
 handler.OpenRegionHandler(121): Opened -ROOT-,,0.70236052
 2011-08-16 07:45:40,741 DEBUG [Thread-760-EventThread] 
 zookeeper.ZKUtil(1109): master:60701-0x131d2690f780009 Retrieved 52 byte(s) 
 of data from znode

[jira] [Updated] (HBASE-4124) ZK restarted while assigning a region, new active HM re-assign it but the RS warned 'already online on this server'.


 [ 
https://issues.apache.org/jira/browse/HBASE-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4124:
--

Attachment: HBASE-4124_Branch90V1_trial.patch

I try to make a patch and fix this issue.
But I only run the UT test. Please review it firstly and give me some 
suggestion. I will test it tomorrow. Thanks.

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 

 Key: HBASE-4124
 URL: https://issues.apache.org/jira/browse/HBASE-4124
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: fulin wang
 Attachments: HBASE-4124_Branch90V1_trial.patch, log.txt

   Original Estimate: 0.4h
  Remaining Estimate: 0.4h

 ZK restarted while assigning a region, new active HM re-assign it but the RS 
 warned 'already online on this server'.
 Issue:
 The RS failed besause of 'already online on this server' and return; The HM 
 can not receive the message and report 'Regions in transition timed out'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

[
https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086728#comment-13086728
]

gaojinchao commented on HBASE-3845:
---

Hi,Patch has not yet apply to the branch ?

data loss because lastSeqWritten can miss memstore edits

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException

2011-08-14 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13084957#comment-13084957
 ] 

gaojinchao commented on HBASE-3933:
---

Hi all. I have a new idea for this issue. why don't we get the regionserver 
list from zk when it is failover? 
we can avoid this case that the hlog is splited but region server is servering.


 Hmaster throw NullPointerException
 --

 Key: HBASE-3933
 URL: https://issues.apache.org/jira/browse/HBASE-3933
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Attachments: Hmastersetup0.90


 NullPointerException while hmaster starting.
 {code}
   java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.get(TreeMap.java:255)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-08-10 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082212#comment-13082212
 ] 

gaojinchao commented on HBASE-4064:
---

I will study the code for trunk and confirm that have fixed.

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-08-01 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076000#comment-13076000
 ] 

gaojinchao commented on HBASE-4064:
---

Do we need fix this issue? If it need I will test it. or I will close the issue 
?

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-25 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070331#comment-13070331
 ] 

gaojinchao commented on HBASE-4064:
---

Master may be crashed because of pool shutdown is asynchronous. 

The master show :
2011-07-22 13:33:27,806 INFO 
org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 
regions of which 2156 are online.

2011-07-22 13:34:28,646 INFO 
org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 
regions of which 982 are online.
2011-07-22 13:34:31,079 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,080 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:6-0x31502ef4f0 Creating (or updating) unassigned node for 
c9b1c97ac6c00033ceb1890e45e66229 with OFFLINE state
2011-07-22 13:34:31,104 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Forcing OFFLINE; 
was=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 
state=OFFLINE, ts=1311312871080
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
No previous transition plan was found (or we are ignoring an existing plan) for 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. so generated a 
random one; 
hri=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229., src=, 
dest=C4C2.site,60020,1311310281335; 3 (online=3, exclude=null) available servers
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Assigning region 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. to 
C4C2.site,60020,1311310281335
2011-07-22 13:34:31,122 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,123 FATAL org.apache.hadoop.hbase.master.HMaster: 
Unexpected state trying to OFFLINE; 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 
state=PENDING_OPEN, ts=1311312871121
java.lang.IllegalStateException
at 
org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1081)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1036)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:864)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:844)
at java.lang.Thread.run(Thread.java:662)
2011-07-22 13:34:31,125 INFO org.apache.hadoop.hbase.master.HMaster: Aborting


 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 HBASE-4064_branch90V2.patch, disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-25 Thread gaojinchao (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4064:
--

Attachment: (was: HBASE-4064_branch90V2.patch)

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized (this.regions) {
   //

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...


 [ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4064:
--

Attachment: disableflow.png

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized (this.regions) {
   // Check if this region is

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...


[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070304#comment-13070304
 ] 

gaojinchao commented on HBASE-4064:
---

!disableflow.png!

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

[
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070306#comment-13070306
]

gaojinchao commented on HBASE-4064:
---

The patch can't solve J-D issue. But it is improvement for disable table.

I make a flow chart(A -B -C-D). We can find there is a window between Remove
region from RIT and Remove region from region clellections. So my patch want to
change the positon.

Two concurrent unassigning of the same region caused the endless loop of
Region has been PENDING_CLOSE for too long...

Key: HBASE-4064
URL: https://issues.apache.org/jira/browse/HBASE-4064
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
Fix For: 0.90.5

Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch,
disableflow.png

1. If there is a rubbish RegionState object with PENDING_CLOSE in
regionsInTransition(The RegionState was remained by some exception which
should be removed, that's why I called it as rubbish object), but the
region is not currently assigned anywhere, TimeoutMonitor will fall into an
endless loop:
2011-06-27 10:32:21,326 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
state=PENDING_CLOSE, ts=1309141555301
2011-06-27 10:32:21,326 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
2011-06-27 10:32:21,438 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
(offlining)
2011-06-27 10:32:21,441 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign
region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is
not currently assigned anywhere
2011-06-27 10:32:31,207 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
state=PENDING_CLOSE, ts=1309141555301
2011-06-27 10:32:31,207 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
2011-06-27 10:32:31,215 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
(offlining)
2011-06-27 10:32:31,215 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign
region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is
not currently assigned anywhere
2011-06-27 10:32:41,164 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
state=PENDING_CLOSE, ts=1309141555301
2011-06-27 10:32:41,164 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
2011-06-27 10:32:41,172 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
(offlining)
2011-06-27 10:32:41,172 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign
region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is
not currently assigned anywhere
.
2 In the following scenario, two concurrent unassigning call of the same
region may lead to the above problem:
the first unassign call send rpc call success, the master watched the event
of RS_ZK_REGION_CLOSED, process this event, will create a
ClosedRegionHandler to remove the state of the region in master.eg.
while ClosedRegionHandler is running in
hbase.master.executor.closeregion.threads thread (A), another unassign call
of same region run in another thread(B).
while thread B run if (!regions.containsKey(region)), this.regions have
the region info, now cpu switch to thread A.
The thread A will remove the region from the sets of this.regions and
regionsInTransition, then switch to thread B. the thread B run continue,
will throw an exception with the msg of Server null returned
java.lang.NullPointerException: Passed server is null for
9a6e26d40293663a79523c58315b930f, but without removing the new-adding
RegionState from

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...


 [ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4064:
--

Attachment: HBASE-4064_branch90V2.patch

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 HBASE-4064_branch90V2.patch, disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...


[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13070310#comment-13070310
 ] 

gaojinchao commented on HBASE-4064:
---

I have made a patch, but I don't verify now. I want to review whether is 
reasonable firstly. then do it.

In my cluster I had changed the 
parameter(hbase.bulk.assignment.waiton.empty.rit) to avoid this issue.


 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
 HBASE-4064_branch90V2.patch, disableflow.png


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-22 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13069442#comment-13069442
 ] 

gaojinchao commented on HBASE-4064:
---

@J-D Thanks for your replay. :)

I got it. In my case, The race is between the disable threads and 
ClosedRegionHandler threads.

 1.Disable thread get region from regions collection (reference 
getRegionsOfTable)

 2.Thread pool gets region and sends request to region server. at the same time 
puts region into RIT(regionsInTransition),   it indicates that the region is 
processing.

 3.Region server finishs closing region and changes the zk state, notifies the 
master.

 4.When master receives the watcher event, It removes the region from RIT and 
then remove from regions collection.
   There is a short window when diable table can't finish in a period(). The 
region may be unssigned again.

My patch try to fix above case. remove regions collection firstly and disable 
thread can't get a processing region.


I found the issue yestertay, Enable threads is also a race condition.  
(I changed the period for 1 minutes because of reproducing the issue). It seems 
pool couldn't finish but a new enable process starts. we need a sleep time when 
a enable period finishes

The master logs:
2011-07-22 13:33:27,806 INFO 
org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 
regions of which 2156 are online.

2011-07-22 13:34:28,646 INFO 
org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 
regions of which 982 are online.
2011-07-22 13:34:31,079 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,080 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:6-0x31502ef4f0 Creating (or updating) unassigned node for 
c9b1c97ac6c00033ceb1890e45e66229 with OFFLINE state
2011-07-22 13:34:31,104 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Forcing OFFLINE; 
was=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 
state=OFFLINE, ts=1311312871080
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
No previous transition plan was found (or we are ignoring an existing plan) for 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. so generated a 
random one; 
hri=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229., src=, 
dest=C4C2.site,60020,1311310281335; 3 (online=3, exclude=null) available servers
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Assigning region 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. to 
C4C2.site,60020,1311310281335
2011-07-22 13:34:31,122 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,123 FATAL org.apache.hadoop.hbase.master.HMaster: 
Unexpected state trying to OFFLINE; 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 
state=PENDING_OPEN, ts=1311312871121
java.lang.IllegalStateException
at 
org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1081)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1036)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:864)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:844)
at 
org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2011-07-22 13:34:31,125 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2011-07-22 13:34:31,482 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
master:6-0x31502ef4f0 Received ZooKeeper Event, type=NodeDataChanged, 
state=SyncConnected, path=/hbase/unassigned/c9b1c97ac6c00033ceb1890e45e66229
2011-07-22 13:34:31,482 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x31502ef4f0 Unable to get data of znode 
/hbase/unassigned/c9b1c97ac6c00033ceb1890e45e66229
 


 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments:

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-21 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068876#comment-13068876
 ] 

gaojinchao commented on HBASE-4064:
---

Please don't merge the patch, I found other issue and need dig whether is 
relation to the patch. Thanks.


 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

2011-07-20 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068747#comment-13068747
 ] 

gaojinchao commented on HBASE-4095:
---

I added some log  and found that the initialReplication is zero. 
when we create a file in hdfs , If I don't write data , the replication should 
be zero.
So the solution has some issue.

2011-07-20 19:38:20,517 WARN  [RegionServer:1;C4C3.site,41763,1311161899551] 
wal.HLog(478): gjc:rollWriter start1311161900517
2011-07-20 19:38:20,650 WARN  [RegionServer:0;C4C3.site,35697,1311161899494] 
wal.HLog(478): gjc:rollWriter start1311161900650
2011-07-20 19:38:20,707 WARN  [RegionServer:1;C4C3.site,41763,1311161899551] 
wal.HLog(518): gjc:updateLock start1311161900707
2011-07-20 19:38:20,707 WARN  [RegionServer:1;C4C3.site,41763,1311161899551] 
wal.HLog(532): gjc:initialReplication start0
2011-07-20 19:38:21,238 WARN  [RegionServer:0;C4C3.site,35697,1311161899494] 
wal.HLog(518): gjc:updateLock start1311161901238
2011-07-20 19:38:21,239 WARN  [RegionServer:0;C4C3.site,35697,1311161899494] 
wal.HLog(532): gjc:initialReplication start0
2011-07-20 19:38:41,726 WARN  [IPC Server handler 4 on 37616] wal.HLog(478): 
gjc:rollWriter start1311161921726
2011-07-20 19:38:41,769 WARN  [IPC Server handler 4 on 37616] wal.HLog(518): 
gjc:updateLock start1311161921769
2011-07-20 19:38:41,769 WARN  [IPC Server handler 4 on 37616] wal.HLog(532): 
gjc:initialReplication start0


 Hlog may not be rolled in a long time if checkLowReplication's request of 
 LogRoll is blocked
 

 Key: HBASE-4095
 URL: https://issues.apache.org/jira/browse/HBASE-4095
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.90.3
Reporter: Jieshan Bean
Assignee: Jieshan Bean
 Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, 
 HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif


 Some large Hlog files(Larger than 10G) appeared in our environment, and I got 
 the reason why they got so huge:
 1. The replicas is less than the expect number. So the method of 
 checkLowReplication will be called each sync.
 2. The method checkLowReplication request log-roll first, and set 
 logRollRequested as true: 
 {noformat}
 private void checkLowReplication() {
 // if the number of replicas in HDFS has fallen below the initial
 // value, then roll logs.
 try {
   int numCurrentReplicas = getLogReplication();
   if (numCurrentReplicas != 0 
 numCurrentReplicas  this.initialReplication) {
   LOG.warn(HDFS pipeline error detected.  +
   Found  + numCurrentReplicas +  replicas but expecting  +
   this.initialReplication +  replicas.  +
Requesting close of hlog.);
   requestLogRoll();
   logRollRequested = true;
   }
 } catch (Exception e) {
   LOG.warn(Unable to invoke DFSOutputStream.getNumCurrentReplicas + e +
  still proceeding ahead...);
 }
 }
 {noformat}
 3.requestLogRoll() just commit the roll request. It may not execute in time, 
 for it must got the un-fair lock of cacheFlushLock.
 But the lock may be carried by the cacheflush threads.
 4.logRollRequested was true until the log-roll executed. So during the time, 
 each request of log-roll in sync() was skipped.
 Here's the logs while the problem happened(Please notice the file size of 
 hlog 193-195-5-111%3A20020.1309937386639 in the last row):
 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
 HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  
 Requesting close of hlog.
 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
 Roll 
 /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119,
  entries=32434, filesize=239589754. New hlog 
 /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
 HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  
 Requesting close of hlog.
 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Renaming flushed file at 
 hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847
  to 
 hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983,
  entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-20 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068796#comment-13068796
 ] 

gaojinchao commented on HBASE-4064:
---

Hi, I verified the issue by adding a sleep in regionOffline. I think V2 is ok.

below code:
 public void regionOffline(final HRegionInfo regionInfo) {
synchronized(this.regionsInTransition) {
  if (this.regionsInTransition.remove(regionInfo.getEncodedName()) != null) 
{
this.regionsInTransition.notifyAll();
  }
}
try{
  Thread.sleep(1000);
}catch(Throwable e){
  ;
}
// remove the region plan as well just in case.
clearRegionPlan(regionInfo);
setOffline(regionInfo);
  }

 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread

[jira] [Updated] (HBASE-4112) Creating table may throw NullPointerException

2011-07-19 Thread gaojinchao (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4112:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 Creating table may throw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_Trunk.patch, HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4112) Creating table threw NullPointerException


 [ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4112:
--

Attachment: HBASE-4112_branch90V1.patch

 Creating table threw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4112) Creating table threw NullPointerException


[ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066918#comment-13066918
 ] 

gaojinchao commented on HBASE-4112:
---

The reason is META table had some dirty data(eg: column=info:server).  
recreating table will throw exception.
I have made a patch and verified, Please review it. Thanks.


All tests passed.

 Creating table threw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4112) Creating table threw NullPointerException


[ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067444#comment-13067444
 ] 

gaojinchao commented on HBASE-4112:
---

False means finished scan. True mean continue and process the next record.
In this case , True is better.(my test is also)

// the code segment for metaScan. 
  for (Result rr : rrs) {
if (processedRows = rowUpperLimit) {
  break done;
}
if (!visitor.processRow(rr))
  break done; //exit completely   // 
processedRows++;
  }

 Creating table threw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4112) Creating table threw NullPointerException


[ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067449#comment-13067449
 ] 

gaojinchao commented on HBASE-4112:
---

Ok, I try to make a patch for TRUNK.

 Creating table threw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4112) Creating table threw NullPointerException


 [ 
https://issues.apache.org/jira/browse/HBASE-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4112:
--

Attachment: HBASE-4112_Trunk.patch

 Creating table threw NullPointerException
 -

 Key: HBASE-4112
 URL: https://issues.apache.org/jira/browse/HBASE-4112
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4112_Trunk.patch, HBASE-4112_branch90V1.patch


  It happened in latest branch 0.90. but I can't reproduce it.
 
  It seems using api getHRegionInfoOrNull is better or check the input 
  parameter before call getHRegionInfo.
 
  Code:
   public static Writable getWritable(final byte [] bytes, final 
  Writable w)
   throws IOException {
 return getWritable(bytes, 0, bytes.length, w);
   }
  return getWritable(bytes, 0, bytes.length, w);  // It seems input 
  parameter bytes is null
 
  logs:
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Socket connection 
  established to C4C3.site/157.5.100.3:2181, initiating session
  11/07/15 10:15:42 INFO zookeeper.ClientCnxn: Session establishment 
  complete on server C4C3.site/157.5.100.3:2181, sessionid = 
  0x2312b8e3f72, negotiated timeout = 18 [INFO] Create : ufdr111 222!
  [INFO] Create : ufdr111 start!
  java.lang.NullPointerException
 at 
  org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
  org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
  19)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin$1.processRow(HBaseAdmin.java
  :306)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:1
  90)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:9
  5)
 at 
  org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:7
  3)
 at 
  org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:
  325)
 at createTable.main(createTable.java:96)
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-16 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066387#comment-13066387
 ] 

gaojinchao commented on HBASE-4064:
---

@Stack:
I will reproduce and verify it after finishing review. Because it may spend a 
lot of time.


 Two concurrent unassigning of the same region caused the endless loop of 
 Region has been PENDING_CLOSE for too long...
 

 Key: HBASE-4064
 URL: https://issues.apache.org/jira/browse/HBASE-4064
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
 Fix For: 0.90.5

 Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch


 1. If there is a rubbish RegionState object with PENDING_CLOSE in 
 regionsInTransition(The RegionState was remained by some exception which 
 should be removed, that's why I called it as rubbish object), but the 
 region is not currently assigned anywhere, TimeoutMonitor will fall into an 
 endless loop:
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:21,326 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:21,438 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:21,441 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:31,207 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:31,215 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 state=PENDING_CLOSE, ts=1309141555301
 2011-06-27 10:32:41,164 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
 (offlining)
 2011-06-27 10:32:41,172 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
 region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
 not currently assigned anywhere
 .
 2  In the following scenario, two concurrent unassigning call of the same 
 region may lead to the above problem:
 the first unassign call send rpc call success, the master watched the event 
 of RS_ZK_REGION_CLOSED, process this event, will create a 
 ClosedRegionHandler to remove the state of the region in master.eg.
 while ClosedRegionHandler is running in  
 hbase.master.executor.closeregion.threads thread (A), another unassign call 
 of same region run in another thread(B).
 while thread B  run if (!regions.containsKey(region)), this.regions have 
 the region info, now  cpu switch to thread A.
 The thread A will remove the region from the sets of this.regions and 
 regionsInTransition, then switch to thread B. the thread B run continue, 
 will throw an exception with the msg of Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 9a6e26d40293663a79523c58315b930f, but without removing the new-adding 
 RegionState from regionsInTransition,and it can not be removed for ever.
  public void unassign(HRegionInfo region, boolean force) {
 LOG.debug(Starting unassignment of region  +

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of Region has been PENDING_CLOSE for too long...

2011-07-14 Thread gaojinchao (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-4064:
--

Attachment: HBASE-4064_branch90V2.patch

I try to make a patch. if the region is in RIT, It shouldn't be unsigned again.
So it seems changing the code position can solve this issue.
ALL test passed, Please review and give some suggesion.

Two concurrent unassigning of the same region caused the endless loop of
Region has been PENDING_CLOSE for too long...

Key: HBASE-4064
URL: https://issues.apache.org/jira/browse/HBASE-4064
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
Fix For: 0.90.5

Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch

[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException

2011-07-13 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064379#comment-13064379
 ] 

gaojinchao commented on HBASE-3933:
---

OK, Thanks.
It happens rarely.I can't get a better change now.

 Hmaster throw NullPointerException
 --

 Key: HBASE-3933
 URL: https://issues.apache.org/jira/browse/HBASE-3933
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Attachments: Hmastersetup0.90


 NullPointerException while hmaster starting.
 {code}
   java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.get(TreeMap.java:255)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3995) HBASE-3946 broke TestMasterFailover

2011-06-27 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055406#comment-13055406
 ] 

gaojinchao commented on HBASE-3995:
---

Hi, stack.
Following code snippet is repeated
if (storedInfo == null) 


 if (storedInfo == null) {
  ...
  if (storedInfo == null) {
storedInfo = this.onlineServers.get(info.getServerName());
  }
}

 HBASE-3946 broke TestMasterFailover
 ---

 Key: HBASE-3995
 URL: https://issues.apache.org/jira/browse/HBASE-3995
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: stack
Priority: Blocker
 Fix For: 0.90.4

 Attachments: am.txt


 TestMasterFailover is all about a new master coming up on an existing 
 cluster.  Previous to HBASE-3946, the new master joining a cluster processing 
 any dead servers would assign all regions found on the dead server even if 
 they were split parents.  We don't want that.
 But TestMasterFailover mocks up some pretty interesting conditions.  The one 
 we were failing on was that while the master was offine, we'd manually add a 
 region to zk that was in CLOSING state.  We'd then go and disable the table 
 up in zk (while master was offline).  Finally, we'd' kill the server that was 
 supposed to be hosting the region from the disabled table in CLOSING state. 
 Then we'd have the master join the cluster.  It had to figure it out.
 Before HBASE-3946, we'd just force offline every region that had been on the 
 dead server.  This would call all to be assigned only on assign, regions from 
 disabled tables are skipped, so it all worked (except would online parent 
 of a split should there be one).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4028) Hmaster crashes caused by splitting log.

2011-06-26 Thread gaojinchao (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055248#comment-13055248
 ] 

gaojinchao commented on HBASE-4028:
---

Ted, Thanks for your work.

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4028-0.90V2, Screenshot-2.png, Verifiedresult.png, 
 hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4028) Hmaster crashes caused by splitting log.


[ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054846#comment-13054846
 ] 

gaojinchao commented on HBASE-4028:
---

Oh, my god! There is another bugs. It is Hidden.  :)
following code snippet
protected AtomicReferenceThrowable thrown = new AtomicReferenceThrowable();

thrown.get is null but not thrown. So the below condition is wrong.
 while (totalBuffered  maxHeapUsage  thrown == null) 

I have made a new patch. Please review it. Thanks.


 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4028-0.90V1.patch, Screenshot-2.png, 
 hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.


 [ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4028:
--

Attachment: HBASE-4028-0.90V2

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4028-0.90V1.patch, HBASE-4028-0.90V2, 
 Screenshot-2.png, hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4028) Hmaster crashes caused by splitting log.


[ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054847#comment-13054847
 ] 

gaojinchao commented on HBASE-4028:
---

The verified result:
hbase-root-master-157-5-111-22.log:2011-06-25 17:18:53,768 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:18:56,768 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:18:59,768 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:02,768 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:05,769 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:08,769 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:11,769 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:14,769 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:17,769 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:20,770 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO threads...
hbase-root-master-157-5-111-22.log:2011-06-25 17:19:23,770 WARN 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Used 134314752 bytes of 
buffered edits, waiting for IO thre

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4028-0.90V2, Screenshot-2.png, 
 hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.


 [ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4028:
--

Attachment: (was: HBASE-4028-0.90V1.patch)

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: HBASE-4028-0.90V2, Screenshot-2.png, 
 hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-4028) Hmaster crashes caused by splitting log.


 [ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao reassigned HBASE-4028:
-

Assignee: gaojinchao

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.


 [ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4028:
--

Attachment: Screenshot-2.png

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: Screenshot-2.png


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.


 [ 
https://issues.apache.org/jira/browse/HBASE-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4028:
--

Attachment: hbase-root-master-157-5-100-8.rar

 Hmaster crashes caused by splitting log.
 

 Key: HBASE-4028
 URL: https://issues.apache.org/jira/browse/HBASE-4028
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.90.4

 Attachments: Screenshot-2.png, hbase-root-master-157-5-100-8.rar


 In my performance cluster(0.90.3), The Hmaster memory from 100 M up to 4G 
 when one region server crashed.
 I added some print in function doneWriting and found the values of 
 totalBuffered is negative.
 10:29:52,119 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: 
 gjc:release Used -565832
 hbase-root-master-157-5-111-21.log:2011-06-24 10:29:52,119 WARN 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: gjc:release Used 
 -565832release size25168
 void doneWriting(RegionEntryBuffer buffer) {
   synchronized (this) {
   LOG.warn(gjc1: relase currentlyWriting +biggestBufferKey  + 
 buffer.encodedRegionName );
 boolean removed = currentlyWriting.remove(buffer.encodedRegionName);
 assert removed;
   }
   long size = buffer.heapSize();
   synchronized (dataAvailable) {
 totalBuffered -= size;
 LOG.warn(gjc:release Used  + totalBuffered );
 // We may unblock writers
 dataAvailable.notifyAll();
   }
   LOG.warn(gjc:release Used  + totalBuffered + release size+ size);
 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4028) Hmaster crashes caused by splitting log.