[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2017-02-07 Thread Stephen Yuan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856386#comment-15856386
 ] 

Stephen Yuan Jiang commented on HBASE-17275:


+1 V3 patch looks good.

> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch, 
> HBASE-17275-branch-1.v2.patch, HBASE-17275-branch-1.v3.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
> at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
> at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
> at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
> at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=6 expired.
> at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
> ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 
> Master first close this region in case of multi assign, then change the state 
> of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node 
> in zk was also transitioned to OFFLINE, as in 4,5,6,7
> {noformat}
> 4. 2016-11-22 10:17:32,321 DEBUG 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2017-02-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855865#comment-15855865
 ] 

Hadoop QA commented on HBASE-17275:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
42s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s 
{color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
55s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} branch-1 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 57s 
{color} | {color:red} hbase-server in branch-1 has 2 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
41s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
58s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
19s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
15m 24s {color} | {color:green} The patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 8s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 122m 18s 
{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
20s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 152m 29s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.master.TestMasterBalanceThrottling |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.12.6 Server=1.12.6 Image:yetus/hbase:e01ee2f |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12851333/HBASE-17275-branch-1.v3.patch
 |
| JIRA Issue | HBASE-17275 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 233ac2c10c04 3.13.0-100-generic #147-Ubuntu SMP Tue Oct 18 
16:48:51 UTC 2016 x86_64 x86_64 x86_64 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2017-02-07 Thread Stephen Yuan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855525#comment-15855525
 ] 

Stephen Yuan Jiang commented on HBASE-17275:


[~allan163], if you don't want the V2 patch, at least in your V1 patch using my 
first suggestion (put the {{if(regionState.isOpened() && 
regionState.getServerName().equals(sn)) {}} inside the {{if(regionState != 
null) {}}).  It makes the code cleaner.

> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch, HBASE-17275-branch-1.v2.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
> at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
> at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
> at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
> at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=6 expired.
> at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
> ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 
> Master first close this region in case of multi assign, then 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2017-02-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855356#comment-15855356
 ] 

Hadoop QA commented on HBASE-17275:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
4s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s 
{color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
56s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
18s {color} | {color:green} branch-1 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 56s 
{color} | {color:red} hbase-server in branch-1 has 2 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s 
{color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
56s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
15m 22s {color} | {color:green} The patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
19s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed with JDK v1.8.0_121 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 86m 41s 
{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 117m 25s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:e01ee2f |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12851294/HBASE-17275-branch-1.v2.patch
 |
| JIRA Issue | HBASE-17275 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 1f842921c430 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 
15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2017-02-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855254#comment-15855254
 ] 

Ted Yu commented on HBASE-17275:


I am fine with v1.

> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch, HBASE-17275-branch-1.v2.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
> at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
> at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
> at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
> at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=6 expired.
> at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
> ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 
> Master first close this region in case of multi assign, then change the state 
> of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node 
> in zk was also transitioned to OFFLINE, as in 4,5,6,7
> {noformat}
> 4. 2016-11-22 10:17:32,321 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Sent CLOSE to 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2017-02-06 Thread Allan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855213#comment-15855213
 ] 

Allan Yang commented on HBASE-17275:


Thanks, [~syuanjiang], your proposal can get rid of the duplicated code, so I 
uploaded a v2 patch for review follow your proposal.
But I will +1 on the v1 patch, since despite of dup code, it is more easy to 
understand, and we have already committed these code to our production env, it 
works fine so far.
What do you think, [~tedyu]? 

> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch, HBASE-17275-branch-1.v2.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
> at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
> at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
> at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
> at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=6 expired.
> at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
> ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2017-02-06 Thread Stephen Yuan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854913#comment-15854913
 ] 

Stephen Yuan Jiang commented on HBASE-17275:


The logic is clear (the region aleady opened in the same RS, so just remove zk 
RIT node).  

For the actual implementation, I think the patch could write more clearly, by 
doing something like:
{code}
if(regionState != null) {
if(regionState.isOpened() && regionState.getServerName().equals(sn)) {
//if this region was already opened on the same RS, we don't have 
to unassign it. It won't cause
//double assign. One possible scenario of what happened is 
HBASE-17275
// >>> the new code, a little worry is that code is dupe from the 
other place <<<
} else {
//  the existing close region code <<<
   
}
}
{code}

To solve the code duplication problem, I have a proposal (the code is a little 
hard to follow): 
{code}
case RS_ZK_REGION_OPENED:
  // Should see OPENED after OPENING but possible after PENDING_OPEN.
  if (regionState == null
  || !regionState.isPendingOpenOrOpeningOnServer(sn)) {
LOG.warn("Received OPENED for " + prettyPrintedRegionName
   + " from " + sn + " but the region isn't PENDING_OPEN/OPENING 
here: "
   + regionStates.getRegionState(encodedName));

-if (regionState != null) {
+   if (regionState != null &&
+   (!regionState.isOpened() || 
!regionState.getServerName().equals(sn))) {
   // Close it without updating the internal region states,
   // so as not to create double assignments in unlucky scenarios
   // mentioned in OpenRegionHandler#process
  unassign(regionState.getRegion(), null, -1, null, false, sn);
}
 return;
   }
   // Handle OPENED by removing from transition and deleted zk node
-  regionState =
+  // We deal with two situations here: either the region is pending 
open/opening in
+  // the target RS; or the region has already opened in the target RS, 
we just need
+  // to clean up the RIT state.
+  if (regionState.isPendingOpenOrOpeningOnServer(sn)) {
+regionState =
   
regionStates.transitionOpenFromPendingOpenOrOpeningOnServer(rt,regionState, sn);
+  }
   if (regionState != null) {
 failedOpenTracker.remove(encodedName); // reset the count, if any
 new OpenedRegionHandler(
  server, this, regionState.getRegion(), coordination, 
ord).process();
updateOpenedRegionHandlerTracker(regionState.getRegion());
  }
  break;

{code}

I am ok with either approach.

> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2017-02-03 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852477#comment-15852477
 ] 

Ted Yu commented on HBASE-17275:


The patch looks good to me.

[~syuanjiang]:
What do you think ?

> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
> at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
> at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
> at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
> at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=6 expired.
> at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
> ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 
> Master first close this region in case of multi assign, then change the state 
> of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node 
> in zk was also transitioned to OFFLINE, as in 4,5,6,7
> {noformat}
> 4. 2016-11-22 10:17:32,321 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2017-02-03 Thread Allan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852451#comment-15852451
 ] 

Allan Yang commented on HBASE-17275:


{quote}
Isn't there something that needs to be done for "which is more than 15 seconds 
late" log ?
{quote}
Since I assign more than 8,000 regions to a single RS. The delay may due to the 
huge pressure on Zookeeper server, or the serialization of  event thread in the 
client side. Yes, it definitely need some investigation.
{quote}
i.e. we need to know whether the late event is for the first or second region 
assignment.
{quote}
It is from the first region assignment, as you can tell from the log, when 
master recv the zk event, the second region assignment still not send to the RS.



> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
> at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
> at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
> at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
> at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=6 expired.
> at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
> ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2017-02-01 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849051#comment-15849051
 ] 

Ted Yu commented on HBASE-17275:


Isn't there something that needs to be done for "which is more than 15 seconds 
late" log ?
i.e. we need to know whether the late event is for the first or second region 
assignment.

> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
> at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
> at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
> at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
> at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=6 expired.
> at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
> ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 
> Master first close this region in case of multi assign, then change the state 
> of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node 
> in zk was also transitioned to 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2016-12-16 Thread Stephen Yuan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755887#comment-15755887
 ] 

Stephen Yuan Jiang commented on HBASE-17275:


I will look at all AM-related JIRA [~allan163] mentioned above.

> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
> at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
> at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
> at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
> at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=6 expired.
> at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
> ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 
> Master first close this region in case of multi assign, then change the state 
> of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node 
> in zk was also transitioned to OFFLINE, as in 4,5,6,7
> {noformat}
> 4. 2016-11-22 10:17:32,321 DEBUG [AM.-pool1-t26] 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2016-12-15 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753179#comment-15753179
 ] 

Ted Yu commented on HBASE-17275:


[~syuanjiang] is more familiar with region assignment.

Stephen:
Can you take a look at the JIRAs ?

> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
> at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
> at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
> at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
> at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=6 expired.
> at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
> ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 
> Master first close this region in case of multi assign, then change the state 
> of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node 
> in zk was also transitioned to OFFLINE, as in 4,5,6,7
> {noformat}
> 4. 2016-11-22 10:17:32,321 DEBUG 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2016-12-15 Thread Allan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753130#comment-15753130
 ] 

Allan Yang commented on HBASE-17275:


[~tedyu] can you or find someone to look at HBASE-17264,  HBASE-17265 and 
HBASE-17275, they are all related, can the fixes really take effects in our 
environment. 

> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
> at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
> at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
> at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
> at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=6 expired.
> at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
> ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 
> Master first close this region in case of multi assign, then change the state 
> of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node 
> in zk was also transitioned to OFFLINE, 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2016-12-07 Thread Allan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15730761#comment-15730761
 ] 

Allan Yang commented on HBASE-17275:


hadoop.hbase.replication.TestSerialReplication passed locally

> Assign timeout cause region unassign forever
> 
>
> Key: HBASE-17275
> URL: https://issues.apache.org/jira/browse/HBASE-17275
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.2.3, 1.1.7
>Reporter: Allan Yang
>Assignee: Allan Yang
> Attachments: HBASE-17275-branch-1.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=6 expired.
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
> at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
> at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
> at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
> at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
> at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=6 expired.
> at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
> at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
> ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 
> Master first close this region in case of multi assign, then change the state 
> of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node 
> in zk was also transitioned to OFFLINE, as in 4,5,6,7
> {noformat}
> 4. 2016-11-22 10:17:32,321 DEBUG [AM.-pool1-t26] master.AssignmentManager: 

[jira] [Commented] (HBASE-17275) Assign timeout cause region unassign forever

2016-12-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728628#comment-15728628
 ] 

Hadoop QA commented on HBASE-17275:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
2s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} branch-1 passed with JDK v1.8.0_111 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s 
{color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
58s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
18s {color} | {color:green} branch-1 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 11s 
{color} | {color:red} hbase-server in branch-1 has 2 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s 
{color} | {color:green} branch-1 passed with JDK v1.8.0_111 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s 
{color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
46s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed with JDK v1.8.0_111 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
57s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
18s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
16m 39s {color} | {color:green} The patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
40s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed with JDK v1.8.0_111 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 95m 57s {color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 129m 4s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.replication.TestSerialReplication |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.11.2 Server=1.11.2 Image:yetus/hbase:e01ee2f |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12842134/HBASE-17275-branch-1.patch
 |
| JIRA Issue | HBASE-17275 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 1f768afb3e18 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 
20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |