[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-06-20 Thread Stephen Yuan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056681#comment-16056681
 ] 

Stephen Yuan Jiang commented on HBASE-18036:


[~enis], with Proc-V2 AM, the current change is no longer available.  
Currently, with initial commit of new AM, SSH calls 
AM.createAssignProcedures(), with forceNewPlan=true.  Even forceNewPlan is 
false, when we compare existing plan's ServerName, it will not be equal to the 
dead server due to timestamp change (ServerName is hostname+port+timestamp) & 
hence a new plan/server would be used for the region assignment.  Hence, 
locality is not guaranteed to be retained.  The potential change would be more 
involved than we have now in 1.x code base.  I open HBASE-18246 to track it 
(FYI, [~stack]).  

> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Fix For: 1.3.2, 1.1.11, 1.2.7
>
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, 
> HBASE-18036.v2-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-06-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056638#comment-16056638
 ] 

Hudson commented on HBASE-18036:


FAILURE: Integrated in Jenkins build HBase-1.4 #780 (See 
[https://builds.apache.org/job/HBase-1.4/780/])
HBASE-18036 Data locality is not maintained after cluster restart or SSH 
(syuanjiangdev: rev 532e0dda16f3c5034aa337201bf6d733cc0a1c7b)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java


> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Fix For: 1.3.2, 1.1.11, 1.2.7
>
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, 
> HBASE-18036.v2-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-06-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056573#comment-16056573
 ] 

Hudson commented on HBASE-18036:


SUCCESS: Integrated in Jenkins build HBase-1.2-JDK7 #154 (See 
[https://builds.apache.org/job/HBase-1.2-JDK7/154/])
HBASE-18036 Data locality is not maintained after cluster restart or SSH 
(syuanjiangdev: rev 3f9ba2f247ef0fb7cebf35a4501bd7cfa36197bc)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java


> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Fix For: 1.3.2, 1.1.11, 1.2.7
>
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, 
> HBASE-18036.v2-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-06-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056556#comment-16056556
 ] 

Hudson commented on HBASE-18036:


SUCCESS: Integrated in Jenkins build HBase-1.2-JDK8 #150 (See 
[https://builds.apache.org/job/HBase-1.2-JDK8/150/])
HBASE-18036 Data locality is not maintained after cluster restart or SSH 
(syuanjiangdev: rev 3f9ba2f247ef0fb7cebf35a4501bd7cfa36197bc)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java


> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Fix For: 1.3.2, 1.1.11, 1.2.7
>
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, 
> HBASE-18036.v2-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-06-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056474#comment-16056474
 ] 

Hudson commented on HBASE-18036:


FAILURE: Integrated in Jenkins build HBase-1.3-JDK7 #184 (See 
[https://builds.apache.org/job/HBase-1.3-JDK7/184/])
HBASE-18036 Data locality is not maintained after cluster restart or SSH 
(syuanjiangdev: rev 2fb68f5046a5c5dd54070148a80882ece5c9b8a1)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java


> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Fix For: 1.3.2, 1.1.11, 1.2.7
>
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, 
> HBASE-18036.v2-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-06-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056428#comment-16056428
 ] 

Hudson commented on HBASE-18036:


SUCCESS: Integrated in Jenkins build HBase-1.3-IT #66 (See 
[https://builds.apache.org/job/HBase-1.3-IT/66/])
HBASE-18036 Data locality is not maintained after cluster restart or SSH 
(syuanjiangdev: rev 2fb68f5046a5c5dd54070148a80882ece5c9b8a1)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java


> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Fix For: 1.3.2, 1.1.11, 1.2.7
>
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, 
> HBASE-18036.v2-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-06-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056395#comment-16056395
 ] 

Hudson commented on HBASE-18036:


SUCCESS: Integrated in Jenkins build HBase-1.2-IT #887 (See 
[https://builds.apache.org/job/HBase-1.2-IT/887/])
HBASE-18036 Data locality is not maintained after cluster restart or SSH 
(syuanjiangdev: rev 3f9ba2f247ef0fb7cebf35a4501bd7cfa36197bc)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java


> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Fix For: 1.3.2, 1.1.11, 1.2.7
>
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, 
> HBASE-18036.v2-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-15 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011161#comment-16011161
 ] 

Enis Soztutar commented on HBASE-18036:
---

Thanks Stephen, sorry to come in late. You should file a follow up jira to do 
the same fix in master as well, no? Or are you saying that AMv2 code already 
handles this? 

> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, 
> HBASE-18036.v2-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009987#comment-16009987
 ] 

Hudson commented on HBASE-18036:


SUCCESS: Integrated in Jenkins build HBase-1.1-JDK8 #1952 (See 
[https://builds.apache.org/job/HBase-1.1-JDK8/1952/])
HBASE-18036 Data locality is not maintained after cluster restart or SSH 
(syuanjiangdev: rev 26cb211e1dc8f5011238de40308965f0e16a)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/BaseLoadBalancer.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java


> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, 
> HBASE-18036.v2-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009973#comment-16009973
 ] 

Hudson commented on HBASE-18036:


SUCCESS: Integrated in Jenkins build HBase-1.1-JDK7 #1869 (See 
[https://builds.apache.org/job/HBase-1.1-JDK7/1869/])
HBASE-18036 Data locality is not maintained after cluster restart or SSH 
(syuanjiangdev: rev 26cb211e1dc8f5011238de40308965f0e16a)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/BaseLoadBalancer.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java


> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, 
> HBASE-18036.v2-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009825#comment-16009825
 ] 

Hadoop QA commented on HBASE-18036:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 26s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
5s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 6s 
{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
30s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
23s {color} | {color:green} branch-1 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 3m 36s 
{color} | {color:red} hbase-server in branch-1 has 1 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s 
{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 5s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
33s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
22s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
28m 8s {color} | {color:green} The patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 
22s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 1s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 141m 1s {color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
22s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 190m 29s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.master.balancer.TestStochasticLoadBalancer2 
|
|   | hadoop.hbase.regionserver.TestRSKilledWhenInitializing |
|   | hadoop.hbase.replication.TestReplicationKillSlaveRS |
|   | hadoop.hbase.regionserver.TestCompactionInDeadRegionServer |
|   | hadoop.hbase.regionserver.TestScannerHeartbeatMessages |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.03.0-ce Server=17.03.0-ce Image:yetus/hbase:58c504e |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12867990/HBASE-18036.v0-branch-1.patch
 |
| JIRA Issue | HBASE-18036 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux f4eec39326f4 4.8.3-std-1 #1 SMP Fri Oct 21 11:15:43 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/hbase.sh |
| git revision | branch-1 / 0a45282 |
| Default Java | 1.8.0_131 |
| findbugs | v3.0.0 |
| findbugs | 
https://builds.apache.org/job/PreCommit-HBASE-Build/6785/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html
 |
| unit | 
https://builds.apache.org/job/PreCommit-HBASE-Build/6785/artifact/patchprocess/patch-unit-hbase-server.txt
 |
| unit test logs |  

[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009228#comment-16009228
 ] 

Hadoop QA commented on HBASE-18036:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 34s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
46s {color} | {color:green} branch-1.1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s 
{color} | {color:green} branch-1.1 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
35s {color} | {color:green} branch-1.1 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
21s {color} | {color:green} branch-1.1 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 3m 20s 
{color} | {color:red} hbase-server in branch-1.1 has 80 extant Findbugs 
warnings. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 47s 
{color} | {color:red} hbase-server in branch-1.1 failed. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 1s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
34s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
22s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
25m 0s {color} | {color:green} The patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 
19s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
42s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 44s 
{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 148m 59s 
{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
27s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 192m 22s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.regionserver.TestSplitWalDataLoss |
|   | hadoop.hbase.master.handler.TestEnableTableHandler |
| Timed out junit tests | 
org.apache.hadoop.hbase.mapreduce.TestTableInputFormat |
|   | org.apache.hadoop.hbase.snapshot.TestExportSnapshot |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.03.0-ce Server=17.03.0-ce Image:yetus/hbase:de9b245 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12867929/HBASE-18036.v2-branch-1.1.patch
 |
| JIRA Issue | HBASE-18036 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 47644c8f9a24 4.8.3-std-1 #1 SMP Fri Oct 21 11:15:43 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/hbase.sh |
| git revision | branch-1.1 / 7d820db |
| Default Java | 1.8.0_131 |
| findbugs | v3.0.0 |
| findbugs | 
https://builds.apache.org/job/PreCommit-HBASE-Build/6780/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html
 |
| javadoc | 
https://builds.apache.org/job/PreCommit-HBASE-Build/6780/artifact/patchprocess/branch-javadoc-hbase-server.txt
 |
| javadoc | 
https://builds.apache.org/job/PreCommit-HBASE-Build/6780/artifact/patchprocess/patch-javadoc-hbase-server.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HBASE-Build/6780/artifact/patchprocess/patch-unit-hbase-server.txt
 |
| unit 

[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008967#comment-16008967
 ] 

Hadoop QA commented on HBASE-18036:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 39s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 11m 
30s {color} | {color:green} branch-1.1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 13s 
{color} | {color:green} branch-1.1 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
48s {color} | {color:green} branch-1.1 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
32s {color} | {color:green} branch-1.1 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 4m 3s 
{color} | {color:red} hbase-server in branch-1.1 has 80 extant Findbugs 
warnings. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 1m 2s 
{color} | {color:red} hbase-server in branch-1.1 failed. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
20s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 9s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 9s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
41s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
25m 49s {color} | {color:green} The patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 
19s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
45s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 45s 
{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 158m 27s 
{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
35s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 213m 47s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.replication.TestReplicationSmallTests |
|   | hadoop.hbase.replication.TestReplicationEndpoint |
|   | hadoop.hbase.client.TestMultiParallel |
|   | hadoop.hbase.master.TestAssignmentManager |
| Timed out junit tests | org.apache.hadoop.hbase.mapreduce.TestRowCounter |
|   | org.apache.hadoop.hbase.snapshot.TestExportSnapshot |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.03.0-ce Server=17.03.0-ce Image:yetus/hbase:de9b245 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12867851/HBASE-18036.v1-branch-1.1.patch
 |
| JIRA Issue | HBASE-18036 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 101aa0c262d2 4.8.3-std-1 #1 SMP Fri Oct 21 11:15:43 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/hbase.sh |
| git revision | branch-1.1 / 7d820db |
| Default Java | 1.8.0_131 |
| findbugs | v3.0.0 |
| findbugs | 
https://builds.apache.org/job/PreCommit-HBASE-Build/6773/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html
 |
| javadoc | 
https://builds.apache.org/job/PreCommit-HBASE-Build/6773/artifact/patchprocess/branch-javadoc-hbase-server.txt
 |
| javadoc | 
https://builds.apache.org/job/PreCommit-HBASE-Build/6773/artifact/patchprocess/patch-javadoc-hbase-server.txt
 |
| unit | 

[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-12 Thread Stephen Yuan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008684#comment-16008684
 ] 

Stephen Yuan Jiang commented on HBASE-18036:


The V1 patch has minor change based on [~elserj]'s feedback.  Also add some 
logging to make the change clear.

Next up: I will use the same logic in branch-1 and other child branches.  Base 
on [~devaraj]'s offline feedback, I will remove the newly introduced 
"hbase.master.retain.assignment" config in branch-1; but keep the config in 
other branches (this config is just for in case of regression, user has a way 
to revert back to original round robin behavior; as patch releases usually 
don't have full testing)

> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v1-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-12 Thread Stephen Yuan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008681#comment-16008681
 ] 

Stephen Yuan Jiang commented on HBASE-18036:


[~stack], thanks for the review.  For master, I am not going to make any 
change, as the proc-v2 change would overwrite anyway.  I plan to makes the same 
change in ServerCrashProcedure in branch-1 and other child branches.


> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Attachments: HBASE-18036.v0-branch-1.1.patch, 
> HBASE-18036.v1-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008562#comment-16008562
 ] 

stack commented on HBASE-18036:
---

[~syuanjiang] +1 on patch. It is an improvement.  HBASE-17791 is a description 
of the more general case. We need to fix it too. For master and versions of 
hbase newer than what you were looking at, what you thinking? Thanks.

> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Attachments: HBASE-18036.v0-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-12 Thread Stephen Yuan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008538#comment-16008538
 ] 

Stephen Yuan Jiang commented on HBASE-18036:


[~stack], could you help review the change?

> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Attachments: HBASE-18036.v0-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-12 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008248#comment-16008248
 ] 

Josh Elser commented on HBASE-18036:


Looks OK to me, but I know enough to defer to someone who has more recently 
looked at SSH :)

{code}
+  public boolean isServerWithSameHostnamePortOnline(final ServerName 
serverName) {
+return (findServerWithSameHostnamePortWithLock(serverName) != null);
+  }
{code}

nit: remove the unnecessary parens

{code}
+  boolean retainAssignment =
+  
server.getConfiguration().getBoolean("hbase.master.retain.assignment", true);
{code}

Nice to expose the config property just in case.

> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Attachments: HBASE-18036.v0-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH

2017-05-11 Thread Stephen Yuan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007393#comment-16007393
 ] 

Stephen Yuan Jiang commented on HBASE-18036:


The V0 patch attached is my first attempt to resolve this issue - The change is 
in SSH.  By the time that the SSH is run, if the dead region server has already 
restarted (we will have the same hostname and port, but different start code in 
ServerName), SSH will try to retain the locality by assigning the region back 
to the same region server.  I introduce a config if someone wants to keep the 
round-robin assignment behavior.  

I forced the existing TestAssignmentManagerOnCluster tests to use the new code 
path in SSH and does not see any problem.  The thing missing is that a new UT 
in TestAssignmentManagerOnCluster to test the retaining assignment code path in 
SSH.  

For now, I'd like to post this V0 patch to get some feedback.

> Data locality is not maintained after cluster restart or SSH
> 
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>Reporter: Stephen Yuan Jiang
>Assignee: Stephen Yuan Jiang
> Attachments: HBASE-18036.v0-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after 
> cluster restart.  However, we have seem some complains about data locality 
> loss when cluster restart (eg. HBASE-17963).  
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() 
> code,  for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
>   // Fresh cluster startup.
>   LOG.info("Clean cluster startup. Assigning user regions");
>   assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; 
> however, from master log,  we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
>   LOG.info("Found regions out on cluster or in RIT; presuming failover");
>   // Process list of dead servers and regions in RIT.
>   // See HBASE-4580 for more information.
>   processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH 
> and SSH uses roundRobinAssignment() in LoadBalancer.  That is why we would 
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)