[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056681#comment-16056681 ] Stephen Yuan Jiang commented on HBASE-18036: [~enis], with Proc-V2 AM, the current change is no longer available. Currently, with initial commit of new AM, SSH calls AM.createAssignProcedures(), with forceNewPlan=true. Even forceNewPlan is false, when we compare existing plan's ServerName, it will not be equal to the dead server due to timestamp change (ServerName is hostname+port+timestamp) & hence a new plan/server would be used for the region assignment. Hence, locality is not guaranteed to be retained. The potential change would be more involved than we have now in 1.x code base. I open HBASE-18246 to track it (FYI, [~stack]). > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Fix For: 1.3.2, 1.1.11, 1.2.7 > > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, > HBASE-18036.v2-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056638#comment-16056638 ] Hudson commented on HBASE-18036: FAILURE: Integrated in Jenkins build HBase-1.4 #780 (See [https://builds.apache.org/job/HBase-1.4/780/]) HBASE-18036 Data locality is not maintained after cluster restart or SSH (syuanjiangdev: rev 532e0dda16f3c5034aa337201bf6d733cc0a1c7b) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Fix For: 1.3.2, 1.1.11, 1.2.7 > > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, > HBASE-18036.v2-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056573#comment-16056573 ] Hudson commented on HBASE-18036: SUCCESS: Integrated in Jenkins build HBase-1.2-JDK7 #154 (See [https://builds.apache.org/job/HBase-1.2-JDK7/154/]) HBASE-18036 Data locality is not maintained after cluster restart or SSH (syuanjiangdev: rev 3f9ba2f247ef0fb7cebf35a4501bd7cfa36197bc) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Fix For: 1.3.2, 1.1.11, 1.2.7 > > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, > HBASE-18036.v2-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056556#comment-16056556 ] Hudson commented on HBASE-18036: SUCCESS: Integrated in Jenkins build HBase-1.2-JDK8 #150 (See [https://builds.apache.org/job/HBase-1.2-JDK8/150/]) HBASE-18036 Data locality is not maintained after cluster restart or SSH (syuanjiangdev: rev 3f9ba2f247ef0fb7cebf35a4501bd7cfa36197bc) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Fix For: 1.3.2, 1.1.11, 1.2.7 > > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, > HBASE-18036.v2-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056474#comment-16056474 ] Hudson commented on HBASE-18036: FAILURE: Integrated in Jenkins build HBase-1.3-JDK7 #184 (See [https://builds.apache.org/job/HBase-1.3-JDK7/184/]) HBASE-18036 Data locality is not maintained after cluster restart or SSH (syuanjiangdev: rev 2fb68f5046a5c5dd54070148a80882ece5c9b8a1) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Fix For: 1.3.2, 1.1.11, 1.2.7 > > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, > HBASE-18036.v2-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056428#comment-16056428 ] Hudson commented on HBASE-18036: SUCCESS: Integrated in Jenkins build HBase-1.3-IT #66 (See [https://builds.apache.org/job/HBase-1.3-IT/66/]) HBASE-18036 Data locality is not maintained after cluster restart or SSH (syuanjiangdev: rev 2fb68f5046a5c5dd54070148a80882ece5c9b8a1) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Fix For: 1.3.2, 1.1.11, 1.2.7 > > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, > HBASE-18036.v2-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056395#comment-16056395 ] Hudson commented on HBASE-18036: SUCCESS: Integrated in Jenkins build HBase-1.2-IT #887 (See [https://builds.apache.org/job/HBase-1.2-IT/887/]) HBASE-18036 Data locality is not maintained after cluster restart or SSH (syuanjiangdev: rev 3f9ba2f247ef0fb7cebf35a4501bd7cfa36197bc) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Fix For: 1.3.2, 1.1.11, 1.2.7 > > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, > HBASE-18036.v2-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011161#comment-16011161 ] Enis Soztutar commented on HBASE-18036: --- Thanks Stephen, sorry to come in late. You should file a follow up jira to do the same fix in master as well, no? Or are you saying that AMv2 code already handles this? > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, > HBASE-18036.v2-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009987#comment-16009987 ] Hudson commented on HBASE-18036: SUCCESS: Integrated in Jenkins build HBase-1.1-JDK8 #1952 (See [https://builds.apache.org/job/HBase-1.1-JDK8/1952/]) HBASE-18036 Data locality is not maintained after cluster restart or SSH (syuanjiangdev: rev 26cb211e1dc8f5011238de40308965f0e16a) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/BaseLoadBalancer.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, > HBASE-18036.v2-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009973#comment-16009973 ] Hudson commented on HBASE-18036: SUCCESS: Integrated in Jenkins build HBase-1.1-JDK7 #1869 (See [https://builds.apache.org/job/HBase-1.1-JDK7/1869/]) HBASE-18036 Data locality is not maintained after cluster restart or SSH (syuanjiangdev: rev 26cb211e1dc8f5011238de40308965f0e16a) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/BaseLoadBalancer.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v0-branch-1.patch, HBASE-18036.v1-branch-1.1.patch, > HBASE-18036.v2-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009825#comment-16009825 ] Hadoop QA commented on HBASE-18036: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 26s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 5s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 6s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 30s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 23s {color} | {color:green} branch-1 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 3m 36s {color} | {color:red} hbase-server in branch-1 has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 5s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 28m 8s {color} | {color:green} The patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} | | {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 1s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 141m 1s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 190m 29s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.master.balancer.TestStochasticLoadBalancer2 | | | hadoop.hbase.regionserver.TestRSKilledWhenInitializing | | | hadoop.hbase.replication.TestReplicationKillSlaveRS | | | hadoop.hbase.regionserver.TestCompactionInDeadRegionServer | | | hadoop.hbase.regionserver.TestScannerHeartbeatMessages | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.03.0-ce Server=17.03.0-ce Image:yetus/hbase:58c504e | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12867990/HBASE-18036.v0-branch-1.patch | | JIRA Issue | HBASE-18036 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux f4eec39326f4 4.8.3-std-1 #1 SMP Fri Oct 21 11:15:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/hbase.sh | | git revision | branch-1 / 0a45282 | | Default Java | 1.8.0_131 | | findbugs | v3.0.0 | | findbugs | https://builds.apache.org/job/PreCommit-HBASE-Build/6785/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/6785/artifact/patchprocess/patch-unit-hbase-server.txt | | unit test logs |
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009228#comment-16009228 ] Hadoop QA commented on HBASE-18036: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 34s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 46s {color} | {color:green} branch-1.1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s {color} | {color:green} branch-1.1 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s {color} | {color:green} branch-1.1 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 21s {color} | {color:green} branch-1.1 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 3m 20s {color} | {color:red} hbase-server in branch-1.1 has 80 extant Findbugs warnings. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 47s {color} | {color:red} hbase-server in branch-1.1 failed. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 1s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 25m 0s {color} | {color:green} The patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} | | {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 42s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 44s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 148m 59s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 192m 22s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.regionserver.TestSplitWalDataLoss | | | hadoop.hbase.master.handler.TestEnableTableHandler | | Timed out junit tests | org.apache.hadoop.hbase.mapreduce.TestTableInputFormat | | | org.apache.hadoop.hbase.snapshot.TestExportSnapshot | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.03.0-ce Server=17.03.0-ce Image:yetus/hbase:de9b245 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12867929/HBASE-18036.v2-branch-1.1.patch | | JIRA Issue | HBASE-18036 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 47644c8f9a24 4.8.3-std-1 #1 SMP Fri Oct 21 11:15:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/hbase.sh | | git revision | branch-1.1 / 7d820db | | Default Java | 1.8.0_131 | | findbugs | v3.0.0 | | findbugs | https://builds.apache.org/job/PreCommit-HBASE-Build/6780/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html | | javadoc | https://builds.apache.org/job/PreCommit-HBASE-Build/6780/artifact/patchprocess/branch-javadoc-hbase-server.txt | | javadoc | https://builds.apache.org/job/PreCommit-HBASE-Build/6780/artifact/patchprocess/patch-javadoc-hbase-server.txt | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/6780/artifact/patchprocess/patch-unit-hbase-server.txt | | unit
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008967#comment-16008967 ] Hadoop QA commented on HBASE-18036: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 39s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 11m 30s {color} | {color:green} branch-1.1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 13s {color} | {color:green} branch-1.1 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 48s {color} | {color:green} branch-1.1 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 32s {color} | {color:green} branch-1.1 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 4m 3s {color} | {color:red} hbase-server in branch-1.1 has 80 extant Findbugs warnings. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 1m 2s {color} | {color:red} hbase-server in branch-1.1 failed. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 9s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 9s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 25m 49s {color} | {color:green} The patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} | | {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 45s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 45s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 158m 27s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 35s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 213m 47s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.replication.TestReplicationSmallTests | | | hadoop.hbase.replication.TestReplicationEndpoint | | | hadoop.hbase.client.TestMultiParallel | | | hadoop.hbase.master.TestAssignmentManager | | Timed out junit tests | org.apache.hadoop.hbase.mapreduce.TestRowCounter | | | org.apache.hadoop.hbase.snapshot.TestExportSnapshot | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.03.0-ce Server=17.03.0-ce Image:yetus/hbase:de9b245 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12867851/HBASE-18036.v1-branch-1.1.patch | | JIRA Issue | HBASE-18036 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 101aa0c262d2 4.8.3-std-1 #1 SMP Fri Oct 21 11:15:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/hbase.sh | | git revision | branch-1.1 / 7d820db | | Default Java | 1.8.0_131 | | findbugs | v3.0.0 | | findbugs | https://builds.apache.org/job/PreCommit-HBASE-Build/6773/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html | | javadoc | https://builds.apache.org/job/PreCommit-HBASE-Build/6773/artifact/patchprocess/branch-javadoc-hbase-server.txt | | javadoc | https://builds.apache.org/job/PreCommit-HBASE-Build/6773/artifact/patchprocess/patch-javadoc-hbase-server.txt | | unit |
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008684#comment-16008684 ] Stephen Yuan Jiang commented on HBASE-18036: The V1 patch has minor change based on [~elserj]'s feedback. Also add some logging to make the change clear. Next up: I will use the same logic in branch-1 and other child branches. Base on [~devaraj]'s offline feedback, I will remove the newly introduced "hbase.master.retain.assignment" config in branch-1; but keep the config in other branches (this config is just for in case of regression, user has a way to revert back to original round robin behavior; as patch releases usually don't have full testing) > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v1-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008681#comment-16008681 ] Stephen Yuan Jiang commented on HBASE-18036: [~stack], thanks for the review. For master, I am not going to make any change, as the proc-v2 change would overwrite anyway. I plan to makes the same change in ServerCrashProcedure in branch-1 and other child branches. > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Attachments: HBASE-18036.v0-branch-1.1.patch, > HBASE-18036.v1-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008562#comment-16008562 ] stack commented on HBASE-18036: --- [~syuanjiang] +1 on patch. It is an improvement. HBASE-17791 is a description of the more general case. We need to fix it too. For master and versions of hbase newer than what you were looking at, what you thinking? Thanks. > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Attachments: HBASE-18036.v0-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008538#comment-16008538 ] Stephen Yuan Jiang commented on HBASE-18036: [~stack], could you help review the change? > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Attachments: HBASE-18036.v0-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008248#comment-16008248 ] Josh Elser commented on HBASE-18036: Looks OK to me, but I know enough to defer to someone who has more recently looked at SSH :) {code} + public boolean isServerWithSameHostnamePortOnline(final ServerName serverName) { +return (findServerWithSameHostnamePortWithLock(serverName) != null); + } {code} nit: remove the unnecessary parens {code} + boolean retainAssignment = + server.getConfiguration().getBoolean("hbase.master.retain.assignment", true); {code} Nice to expose the config property just in case. > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Attachments: HBASE-18036.v0-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
[ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007393#comment-16007393 ] Stephen Yuan Jiang commented on HBASE-18036: The V0 patch attached is my first attempt to resolve this issue - The change is in SSH. By the time that the SSH is run, if the dead region server has already restarted (we will have the same hostname and port, but different start code in ServerName), SSH will try to retain the locality by assigning the region back to the same region server. I introduce a config if someone wants to keep the round-robin assignment behavior. I forced the existing TestAssignmentManagerOnCluster tests to use the new code path in SSH and does not see any problem. The thing missing is that a new UT in TestAssignmentManagerOnCluster to test the retaining assignment code path in SSH. For now, I'd like to post this V0 patch to get some feedback. > Data locality is not maintained after cluster restart or SSH > > > Key: HBASE-18036 > URL: https://issues.apache.org/jira/browse/HBASE-18036 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Attachments: HBASE-18036.v0-branch-1.1.patch > > > After HBASE-2896 / HBASE-4402, we think data locality is maintained after > cluster restart. However, we have seem some complains about data locality > loss when cluster restart (eg. HBASE-17963). > Examining the AssignmentManager#processDeadServersAndRegionsInTransition() > code, for cluster start, I expected to hit the following code path: > {code} > if (!failover) { > // Fresh cluster startup. > LOG.info("Clean cluster startup. Assigning user regions"); > assignAllUserRegions(allRegions); > } > {code} > where assignAllUserRegions would use retainAssignment() call in LoadBalancer; > however, from master log, we usually hit the failover code path: > {code} > // If we found user regions out on cluster, its a failover. > if (failover) { > LOG.info("Found regions out on cluster or in RIT; presuming failover"); > // Process list of dead servers and regions in RIT. > // See HBASE-4580 for more information. > processDeadServersAndRecoverLostRegions(deadServers); > } > {code} > where processDeadServersAndRecoverLostRegions() would put dead servers in SSH > and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would > see loss locality more often than retaining locality during cluster restart. > Note: the code I was looking at is close to branch-1 and branch-1.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)