[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598505#comment-14598505 ] Hudson commented on HBASE-13937: SUCCESS: Integrated in HBase-0.98 #1035 (See [https://builds.apache.org/job/HBase-0.98/1035/]) HBASE-13937 Partially revert HBASE-13172 (enis: rev 13773d8e27104df45dbaf536f8ddf9399337ad21) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598440#comment-14598440 ] Hudson commented on HBASE-13937: SUCCESS: Integrated in HBase-1.3-IT #3 (See [https://builds.apache.org/job/HBase-1.3-IT/3/]) HBASE-13937 Partially revert HBASE-13172 (enis: rev 0271afc1b7558c85c293675b25ff77d405f39a37) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598410#comment-14598410 ] Hudson commented on HBASE-13937: FAILURE: Integrated in HBase-1.2 #26 (See [https://builds.apache.org/job/HBase-1.2/26/]) HBASE-13937 Partially revert HBASE-13172 (enis: rev 582099424dad2644e99a8c7588616e3f50e9b00c) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598405#comment-14598405 ] Hudson commented on HBASE-13937: SUCCESS: Integrated in HBase-1.2-IT #18 (See [https://builds.apache.org/job/HBase-1.2-IT/18/]) HBASE-13937 Partially revert HBASE-13172 (enis: rev 582099424dad2644e99a8c7588616e3f50e9b00c) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598307#comment-14598307 ] Hudson commented on HBASE-13937: SUCCESS: Integrated in HBase-1.1 #553 (See [https://builds.apache.org/job/HBase-1.1/553/]) HBASE-13937 Partially revert HBASE-13172 (enis: rev c88804cc7cc503a8cafc1d7b1f0b51f2349c3c3f) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598293#comment-14598293 ] Hudson commented on HBASE-13937: FAILURE: Integrated in HBase-1.3 #12 (See [https://builds.apache.org/job/HBase-1.3/12/]) HBASE-13937 Partially revert HBASE-13172 (enis: rev 0271afc1b7558c85c293675b25ff77d405f39a37) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598119#comment-14598119 ] Hudson commented on HBASE-13937: FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #988 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/988/]) HBASE-13937 Partially revert HBASE-13172 (enis: rev 13773d8e27104df45dbaf536f8ddf9399337ad21) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598043#comment-14598043 ] Enis Soztutar commented on HBASE-13937: --- bq. Hey Enis Soztutar your master patch really doesn't apply; the code you're aiming to remove is already not there. I think you should try re-creating it. Sorry for the confusion. v3 patch is badly named. It should have been v3-branch-1. HBASE-13172 is not committed to master, so this is not needed there. Let me commit this shortly. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597809#comment-14597809 ] Nick Dimiduk commented on HBASE-13937: -- Where are we with this one? I think 1.1.1 and 1.2.0 should include it (FYI [~busbey]) > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596887#comment-14596887 ] Nick Dimiduk commented on HBASE-13937: -- With patch v3, same loop of {{TestDistributedLogSplitting}} on branch-1.1 is passing consistently for me; +1 stands. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596780#comment-14596780 ] Nick Dimiduk commented on HBASE-13937: -- Hey [~enis] your master patch really doesn't apply; the code you're aiming to remove is already not there. I think you should try re-creating it. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594898#comment-14594898 ] Hadoop QA commented on HBASE-13937: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12740882/hbase-13937_v3-branch-1.1.patch against branch-1.1 branch at commit 04c25e0f355aaa6ded37b0477ce126a693756b81. ATTACHMENT ID: 12740882 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn post-site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14488//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14488//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14488//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14488//console This message is automatically generated. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594386#comment-14594386 ] Hadoop QA commented on HBASE-13937: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12740771/hbase-13937_v3.patch against master branch at commit db08013ebeeaa85802d9795cc72b4c29c5338a47. ATTACHMENT ID: 12740771 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14480//console This message is automatically generated. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3.patch, hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594157#comment-14594157 ] Duo Zhang commented on HBASE-13937: --- {quote} thus guaranteeing that the region server cannot accept any more writes. {quote} But still readable right? Does HBase guarantee this level of consistency? 1. A writes row to HBase. 2 A tells B to read the row. 3. B can read the row from HBase. If not, then I think remove recoverLease is enough here. Otherwise we still to make sure that the regionserver can not process any request. And I think this discussion should be in the parent issue, so +1 on patch v3 :) > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, > hbase-13937_v3.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594118#comment-14594118 ] Andrew Purtell commented on HBASE-13937: bq. Do you mind re-review according to above? The check for isDeadServer is moved out of the retry loop: {code} @@ -903,13 +901,14 @@ public class ServerManager { public boolean isServerReachable(ServerName server) { if (server == null) throw new NullPointerException("Passed server is null"); +synchronized (this.onlineServers) { + if (this.deadservers.isDeadServer(server)) { +return false; + } +} + RetryCounter retryCounter = pingRetryCounterFactory.create(); while (retryCounter.shouldRetry()) { - synchronized (this.onlineServers) { -if (this.deadservers.isDeadServer(server)) { - return false; -} - } try { AdminService.BlockingInterface admin = getRsAdmin(server); if (admin != null) { {code} Yes, I was wrong about the second part. The patch under review here does: {code} @@ -917,11 +916,6 @@ public class ServerManager { return info != null && info.hasServerName() && server.getStartcode() == info.getServerName().getStartCode(); } - } catch (RegionServerStoppedException | ServerNotRunningYetException e) { -if (LOG.isDebugEnabled()) { - LOG.debug("Couldn't reach " + server, e); -} -break; {code} I must have gone back to the wrong tab and looked at the patch on the original issue. lgtm, FWIW > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594090#comment-14594090 ] Enis Soztutar commented on HBASE-13937: --- bq. Looking at the V2 patch. So we check once if the server is in the dead list and then proceed to ping. This patch hoists out this check: I think the patch does the exact opposite. It keeps the {{synchronized (this.onlineServers) }} part, but removes the {{catch (RegionServerStoppedException | ServerNotRunningYetException e)}} part. The intent is to apply v2 directly without reverting the prev patch. bq. This lgtm for application to 0.98, modulo the multicatch (Java 7+ only) will need to be converted to equivalent Java 6 idiom. Do you mind re-review according to above? > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593947#comment-14593947 ] Nick Dimiduk commented on HBASE-13937: -- {{TestDistributedLogSplitting}} is passing consistently on my side as well. +1 > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593936#comment-14593936 ] Devaraj Das commented on HBASE-13937: - +1 > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593935#comment-14593935 ] Andrew Purtell commented on HBASE-13937: bq. We will not treat exceptions coming from server ping differently, but instead will keep retrying to ping. Reply Looking at the V2 patch. So we check once if the server is in the dead list and then proceed to ping. This patch hoists out this check: {code} + synchronized (this.onlineServers) { +if (this.deadservers.isDeadServer(server)) { + return false; +} + } {code} that HBASE-13172 put into the ping loop. We retain this change from HBASE-13172: {code} @@ -851,13 +858,21 @@ public class ServerManager { return info != null && info.hasServerName() && server.getStartcode() == info.getServerName().getStartCode(); } + } catch (RegionServerStoppedException | ServerNotRunningYetException e) { +if (LOG.isDebugEnabled()) { + LOG.debug("Couldn't reach " + server, e); +} +break; } catch (IOException ioe) { -LOG.debug("Couldn't reach " + server + ", try=" + retryCounter.getAttemptTimes() - + " of " + retryCounter.getMaxAttempts(), ioe); +if (LOG.isDebugEnabled()) { + LOG.debug("Couldn't reach " + server + ", try=" + retryCounter.getAttemptTimes() + " of " + + retryCounter.getMaxAttempts(), ioe); +} try { retryCounter.sleepUntilNextRetry(); } catch(InterruptedException ie) { Thread.currentThread().interrupt(); + break; } } } {code} that breaks out of the ping loop if we catch RegionServerStoppedException or ServerNotRunningYetException. This lgtm for application to 0.98, modulo the multicatch (Java 7+ only) will need to be converted to equivalent Java 6 idiom. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593797#comment-14593797 ] Enis Soztutar commented on HBASE-13937: --- [~Apache9] we have a fencing mechanism for region servers already via using HDFS {{recoverLease()}}. Once the zk session expiry happens, master renames the WAL directory for the RS and also starts recoverLease on all WAL files, thus guaranteeing that the region server cannot accept any more writes. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592928#comment-14592928 ] Duo Zhang commented on HBASE-13937: --- {quote} In theory we should not have isServerReachable() at all. It is a parallel cluster membership mechanism to the already existing zk based one {quote} I'm not familiar with the code of AM and SM, but I think zk is not enough to keep things consistency. The EPHEMERAL node on zookeeper disappeared does not mean the server is really dead. We still need a method like isServerReachable to decide whether the server is really dead. One example is in HDFS HA, a fencing is needed before transforming standby namenode to active namenode. {quote} If we get a connection exception or smt, we cannot assume the server is dead. {quote} Agree(and excuse me, what is smt?), in general only the server tells us it is dead then we can make sure the server is dead. And after googling I think connection refused is also not that stable(a firewall or backlog queue full can also cause connection refused). So I think we need fencing like what HDFS HA does? Yeah, you may challenge that if the machine is crashed, how can we make sure the server is dead...Honestly I do not have perfect solution. Maybe we could introduce a DeadServerManager and make several levels of consistency, the lowest level does not do fencing at all, the medium level do fencing with a timeout, and the highest level will do fencing for ever(let a person stop it maybe) Thanks. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592858#comment-14592858 ] Enis Soztutar commented on HBASE-13937: --- The problems with exiting early is the same problems that 13605 is all about. In theory we should not have isServerReachable() at all. It is a parallel cluster membership mechanism to the already existing zk based one. If we get a connection exception or smt, we cannot assume the server is dead. It may be a temporary network partition for the master or something else. If we return here earlier, the rest of AM assumes that server is dead, and for example it assumes that region unassign can safely continue etc. The patch goes back to the more conservative approach in terms of the semantics. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592814#comment-14592814 ] Duo Zhang commented on HBASE-13937: --- For isServerReachable, I think we can say a server is 'dead' if one of the following conditions is satisfied 1. Server tells us it is dead(I'm not sure whether a RegionServerStoppedException is enough, maybe it will be thrown before regionserver completely shutdown?) 2. It is a server with another start code. 3. We get a connection refused(not connect timeout). So I think remove the code is reasonable, but we should catch a connection refused exception then(Does our rpc framework throw this exception out, and also we do not need retry here...)? Otherwise if we do not restart a regionserver then we will be stuck in the loop for a long time... And also I do not think it is safe to return 'not reachable' if timeout... Thanks. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592810#comment-14592810 ] Enis Soztutar commented on HBASE-13937: --- The behavior will be reverted back to previous one before 13172 is committed (1.0.1, 1.1.0 and 0.98.12). We will not treat exceptions coming from server ping differently, but instead will keep retrying to ping. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592734#comment-14592734 ] stack commented on HBASE-13937: --- What will the change in behavior be [~enis] with this patch in place? > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592720#comment-14592720 ] Enis Soztutar commented on HBASE-13937: --- [~Apache9] FYI. > Partially revert HBASE-13172 > - > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)