[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855356#comment-15855356
 ] 

Hadoop QA commented on HBASE-17275:
-----------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
4s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s 
{color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
56s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
18s {color} | {color:green} branch-1 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 56s 
{color} | {color:red} hbase-server in branch-1 has 2 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s 
{color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
56s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
15m 22s {color} | {color:green} The patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
19s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 86m 41s 
{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 117m 25s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:e01ee2f |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12851294/HBASE-17275-branch-1.v2.patch
 |
| JIRA Issue | HBASE-17275 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 1f842921c430 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 
15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/hbase.sh |
| git revision | branch-1 / f708d98 |
| Default Java | 1.7.0_80 |
| Multi-JDK versions |  /usr/lib/jvm/java-8-oracle:1.8.0_121 
/usr/lib/jvm/java-7-oracle:1.7.0_80 |
| findbugs | v3.0.0 |
| findbugs | 
https://builds.apache.org/job/PreCommit-HBASE-Build/5601/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/5601/testReport/ |
| modules | C: hbase-server U: hbase-server |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/5601/console |
| Powered by | Apache Yetus 0.3.0   http://yetus.apache.org |


This message was automatically generated.



> Assign timeout cause region unassign forever
> --------------------------------------------
>
>                 Key: HBASE-17275
>                 URL: https://issues.apache.org/jira/browse/HBASE-17275
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 1.2.3, 1.1.7
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>         Attachments: HBASE-17275-branch-1.patch, HBASE-17275-branch-1.v2.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=60000 expired.
>         at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
>         at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
>         at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
>         at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
>         at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
>         at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=60000 expired.
>         at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
>         ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 
> Master first close this region in case of multi assign, then change the state 
> of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node 
> in zk was also transitioned to OFFLINE, as in 4,5,6,7
> {noformat}
> 4. 2016-11-22 10:17:32,321 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Sent CLOSE to example.org,30003,1479780976834 for region 
> test,P7HQ55,1475985973151.7e9aee32eb98a6fc9d503b99fc5f9615.
> 5. 2016-11-22 10:17:32,461 INFO  [AM.-pool1-t26] master.RegionStates: 
> Transition {7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479781052344, server=example.org,30003,1479780976834} to 
> {7e9aee32eb98a6fc9d503b99fc5f9615 state=OFFLINE, ts=1479781052461, 
> server=example.org,30003,1479780976834}
> 6. 2016-11-22 10:17:32,469 DEBUG [AM.-pool1-t26] zookeeper.ZKAssign: 
> master:30001-0x15810b5f633015e, 
> quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181,
>  baseZNode=/test-hbase11-func2 Creating (or updating) unassigned node 
> 7e9aee32eb98a6fc9d503b99fc5f9615 with OFFLINE state
> 7. 2016-11-22 10:17:32,546 INFO  [AM.-pool1-t26] master.AssignmentManager: 
> Assigning test,P7HQ55,1475985973151.7e9aee32eb98a6fc9d503b99fc5f9615. to 
> example.org,30003,1479780976834
> {noformat}
> The thread handling the OPENED zk event still operating on this region too. 
> IT change the state the region state from PENDING_OPEN to OPENED, but when 
> trying to delete the RIT zk node, it found out this zk node was in a illeagl 
> state OFFLINE, which was set in 6. So the RIT node can't be deleted. Now, the 
> inconsistent case appeared. For this region,  its state shows OPENED, but in 
> RIT node, its state shows OFFLINE 
> {noformat}
> 8. 2016-11-22 10:17:32,563 INFO  [AM.ZK.Worker-pool2-t3] master.RegionStates: 
> Transition {7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479781052546, server=example.org,30003,1479780976834} to 
> {7e9aee32eb98a6fc9d503b99fc5f9615 state=OPEN, ts=1479781052563, 
> server=example.org,30003,1479780976834}
> 9. 2016-11-22 10:17:32,595 WARN  [AM.ZK.Worker-pool2-t3] zookeeper.ZKAssign: 
> master:30001-0x15810b5f633015e, 
> quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181,
>  baseZNode=/test-hbase11-func2 Attempting to delete unassigned node 
> 7e9aee32eb98a6fc9d503b99fc5f9615 in RS_ZK_REGION_OPENED state but node is in 
> M_ZK_REGION_OFFLINE state
> 10. 2016-11-22 10:17:32,595 WARN  [AM.ZK.Worker-pool2-t12] 
> zookeeper.ZKAssign: master:30001-0x15810b5f633015e, 
> quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181,
>  baseZNode=/test-hbase11-func2 Attempting to delete unassigned node 
> 4fb51c279d694f8808739ea632673584 in RS_ZK_REGION_OPENED state but node is in 
> M_ZK_REGION_OFFLINE state
> 11. 2016-11-22 10:17:32,595 ERROR [AM.ZK.Worker-pool2-t3] 
> coordination.ZkOpenRegionCoordination: Znode of region 
> 7e9aee32eb98a6fc9d503b99fc5f9615 could not be deleted.
> {noformat}
> After a while, the second assign's zk event arrived, due to the inconsistent 
> state(expect PENDING_OPEN but OPENED), master unassign this region, then this 
> region will remain unassigned
> {noformat}
> 12. 2016-11-22 10:17:51,305 DEBUG [AM.ZK.Worker-pool2-t20] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=OPEN, ts=1479781052563, 
> server=example.org,30003,1479780976834}
> 13. 2016-11-22 10:17:51,305 WARN  [AM.ZK.Worker-pool2-t20] 
> master.AssignmentManager: Received OPENED for 
> 7e9aee32eb98a6fc9d503b99fc5f9615 from example.org,30003,1479780976834 but the 
> region isn't PENDING_OPEN/OPENING here: {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=OPEN, ts=1479781052563, server=example.org,30003,1479780976834}
> 14. 2016-11-22 10:17:51,308 DEBUG [AM.ZK.Worker-pool2-t20] 
> master.AssignmentManager: Sent CLOSE to example.org,30003,1479780976834 for 
> region test,P7HQ55,1475985973151.7e9aee32eb98a6fc9d503b99fc5f9615.
> {noformat}
> Restart master may solve this issue, but due to HBASE-17264 and HBASE-17265, 
> restart master may still not helping.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to