[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-09-04 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Fix Version/s: 0.98.0
 Hadoop Flags: Reviewed

Integrated to trunk.

Thanks for the reviews.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Fix For: 0.98.0

 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.2.txt, 9387-v4.3.txt, 
 9387-v4.4.txt, 9387-v4.txt, 9387-v5.txt, 9387-v6.txt, 9387-v7.txt, 
 9387-v8.txt, 9387-v9.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-09-04 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Fix Version/s: 0.96.0

Integrated to 0.96 as well.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Fix For: 0.98.0, 0.96.0

 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.2.txt, 9387-v4.3.txt, 
 9387-v4.4.txt, 9387-v4.txt, 9387-v5.txt, 9387-v6.txt, 9387-v7.txt, 
 9387-v8.txt, 9387-v9.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-09-04 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-9387:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to 0.96 and trunk so resolving.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Fix For: 0.98.0, 0.96.0

 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.2.txt, 9387-v4.3.txt, 
 9387-v4.4.txt, 9387-v4.txt, 9387-v5.txt, 9387-v6.txt, 9387-v7.txt, 
 9387-v8.txt, 9387-v9.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-09-03 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Attachment: 9387-v8.txt

Patch v8 moves the znode existence check and subsequent abortion to 
transitionToOpened().

This is to avoid unnecessary region server abortion.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.2.txt, 9387-v4.3.txt, 
 9387-v4.4.txt, 9387-v4.txt, 9387-v5.txt, 9387-v6.txt, 9387-v7.txt, 
 9387-v8.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-09-03 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Attachment: 9387-v9.txt

Patch v9 uses different messages for znode version mismatch and znode 
disappearance.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.2.txt, 9387-v4.3.txt, 
 9387-v4.4.txt, 9387-v4.txt, 9387-v5.txt, 9387-v6.txt, 9387-v7.txt, 
 9387-v8.txt, 9387-v9.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-30 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Attachment: 9387-v3.txt

Thanks for the comments.

Patch v3 also handles failure scenario in 
tryTransitionFromOfflineToFailedOpen().

Overnight I looped TestFullLogReconstruction 100 times on the same machine 
where this issue was first produced, with patch v1.
They all passed.

Follow-on JIRA can be filed to make region transition handling better.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Attachments: 9387-v1.txt, 9387-v3.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-30 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-9387:
-

Priority: Critical  (was: Major)

Suggested workaround would be to review all transitions and on edges such as 
this one, going from OPENING to OPENED, if it fails, do a radical abort (not 
just for meta region).

Then in another issue stepback and revisit our system for managing region 
manipulation.  It is way to complex consuming way too many hours of eng. time 
and there are holes.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Attachments: 9387-v1.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-30 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Attachment: 9387-v4.txt

Patch v4 removes the LOG.warn() statements.

Let me see if I can write a test.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-30 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Attachment: 9387-v5.txt

Patch v5 checks whether znode exists.
If znode doesn't exist, abort region server.
Otherwise log warning.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.txt, 9387-v5.txt, 
 hbase-9387.patch, org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-30 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Attachment: 9387-v4.2.txt

The abort() call in tryTransitionFromOfflineToFailedOpen() made 
TestRegionServerNoMaster fail.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.2.txt, 9387-v4.txt, 
 9387-v5.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-30 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Attachment: 9387-v4.3.txt

Reused TestOpenRegionHandler#testYankingRegionFromUnderIt() for verification of 
region server abortion.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.2.txt, 9387-v4.3.txt, 
 9387-v4.txt, 9387-v5.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-30 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Attachment: 9387-v4.4.txt

Patch v4.4 removes unnecessary change.

MockRegionServer#abort() calls stop() - HRegionServer#abort() does the same.

Added comment in TestOpenRegionHandler#testYankingRegionFromUnderIt() 
explaining why region server abortion is expected.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.2.txt, 9387-v4.3.txt, 
 9387-v4.4.txt, 9387-v4.txt, 9387-v5.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-30 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Attachment: 9387-v6.txt

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.2.txt, 9387-v4.3.txt, 
 9387-v4.4.txt, 9387-v4.txt, 9387-v5.txt, 9387-v6.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-30 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Attachment: 9387-v7.txt

Patch v7 adds testRegionServerAbortionDueToFailureTransitioningToOpened in 
TestOpenRegionHandler which simulates the scenario described in this JIRA

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Critical
 Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.2.txt, 9387-v4.3.txt, 
 9387-v4.4.txt, 9387-v4.txt, 9387-v5.txt, 9387-v6.txt, 9387-v7.txt, 
 hbase-9387.patch, org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-29 Thread Jeffrey Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Zhong updated HBASE-9387:
-

Summary: Region could get lost during assignment  (was: 
TestFullLogReconstruction#testReconstruction occasionally fails when 
distributed log replay is turned on)

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.95.2
Reporter: Ted Yu
 Attachments: 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-29 Thread Jeffrey Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Zhong updated HBASE-9387:
-

Component/s: Region Assignment

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
 Attachments: 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-29 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Attachment: 9387-v1.txt

First attempt at fixing the bug.

If OpenRegionHandler#tryTransitionFromOpeningToFailedOpen() couldn't transition 
to FAILED_OPEN, region server aborts.

In the test, one more region server is added.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
 Attachments: 9387-v1.txt, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-29 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-9387:
--

Assignee: Ted Yu
  Status: Patch Available  (was: Open)

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
 Attachments: 9387-v1.txt, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-29 Thread Jeffrey Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Zhong updated HBASE-9387:
-

Attachment: hbase-9387.patch

[~te...@apache.org] I posted a patch to minimize RS aborting situation.

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
 Attachments: 9387-v1.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-29 Thread Jeffrey Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Zhong updated HBASE-9387:
-

Attachment: (was: hbase-9387.patch)

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
 Attachments: 9387-v1.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-29 Thread Jeffrey Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Zhong updated HBASE-9387:
-

Attachment: hbase-9387.patch

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
 Attachments: 9387-v1.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-29 Thread Jeffrey Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Zhong updated HBASE-9387:
-

Attachment: (was: hbase-9387.patch)

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
 Attachments: 9387-v1.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9387) Region could get lost during assignment

2013-08-29 Thread Jeffrey Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Zhong updated HBASE-9387:
-

Attachment: hbase-9387.patch

 Region could get lost during assignment
 ---

 Key: HBASE-9387
 URL: https://issues.apache.org/jira/browse/HBASE-9387
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 0.95.2
Reporter: Ted Yu
Assignee: Ted Yu
 Attachments: 9387-v1.txt, hbase-9387.patch, 
 org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt


 I observed test timeout running against hadoop 2.1.0 with distributed log 
 replay turned on.
 Looks like region state for 1588230740 became inconsistent between master and 
 the surviving region server:
 {code}
 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
 master.RegionStates(299): Onlined 1588230740 on 
 kiyo.gq1.ygridcore.net,57016,1377814510039
 ...
 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
 client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
 parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
 hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
 35 failed; retrying after sleep of 302 because: 
 org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
 opened: 1588230740
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
 at 
 org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira