[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598505#comment-14598505
 ] 

Hudson commented on HBASE-13937:


SUCCESS: Integrated in HBase-0.98 #1035 (See 
[https://builds.apache.org/job/HBase-0.98/1035/])
HBASE-13937 Partially revert HBASE-13172 (enis: rev 
13773d8e27104df45dbaf536f8ddf9399337ad21)
* hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java


> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598440#comment-14598440
 ] 

Hudson commented on HBASE-13937:


SUCCESS: Integrated in HBase-1.3-IT #3 (See 
[https://builds.apache.org/job/HBase-1.3-IT/3/])
HBASE-13937 Partially revert HBASE-13172 (enis: rev 
0271afc1b7558c85c293675b25ff77d405f39a37)
* hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java


> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598410#comment-14598410
 ] 

Hudson commented on HBASE-13937:


FAILURE: Integrated in HBase-1.2 #26 (See 
[https://builds.apache.org/job/HBase-1.2/26/])
HBASE-13937 Partially revert HBASE-13172 (enis: rev 
582099424dad2644e99a8c7588616e3f50e9b00c)
* hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java


> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598405#comment-14598405
 ] 

Hudson commented on HBASE-13937:


SUCCESS: Integrated in HBase-1.2-IT #18 (See 
[https://builds.apache.org/job/HBase-1.2-IT/18/])
HBASE-13937 Partially revert HBASE-13172 (enis: rev 
582099424dad2644e99a8c7588616e3f50e9b00c)
* hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java


> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598307#comment-14598307
 ] 

Hudson commented on HBASE-13937:


SUCCESS: Integrated in HBase-1.1 #553 (See 
[https://builds.apache.org/job/HBase-1.1/553/])
HBASE-13937 Partially revert HBASE-13172 (enis: rev 
c88804cc7cc503a8cafc1d7b1f0b51f2349c3c3f)
* hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java


> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598293#comment-14598293
 ] 

Hudson commented on HBASE-13937:


FAILURE: Integrated in HBase-1.3 #12 (See 
[https://builds.apache.org/job/HBase-1.3/12/])
HBASE-13937 Partially revert HBASE-13172 (enis: rev 
0271afc1b7558c85c293675b25ff77d405f39a37)
* hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java


> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598119#comment-14598119
 ] 

Hudson commented on HBASE-13937:


FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #988 (See 
[https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/988/])
HBASE-13937 Partially revert HBASE-13172 (enis: rev 
13773d8e27104df45dbaf536f8ddf9399337ad21)
* hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java


> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.0.2, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-23 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598043#comment-14598043
 ] 

Enis Soztutar commented on HBASE-13937:
---

bq. Hey Enis Soztutar your master patch really doesn't apply; the code you're 
aiming to remove is already not there. I think you should try re-creating it.
Sorry for the confusion. v3 patch is badly named. It should have been 
v3-branch-1. HBASE-13172 is not committed to master, so this is not needed 
there. 

Let me commit this shortly. 

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-23 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597809#comment-14597809
 ] 

Nick Dimiduk commented on HBASE-13937:
--

Where are we with this one? I think 1.1.1 and 1.2.0 should include it (FYI 
[~busbey])

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-22 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596887#comment-14596887
 ] 

Nick Dimiduk commented on HBASE-13937:
--

With patch v3, same loop of {{TestDistributedLogSplitting}} on branch-1.1 is 
passing consistently for me; +1 stands.

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-22 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596780#comment-14596780
 ] 

Nick Dimiduk commented on HBASE-13937:
--

Hey [~enis] your master patch really doesn't apply; the code you're aiming to 
remove is already not there. I think you should try re-creating it.

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594898#comment-14594898
 ] 

Hadoop QA commented on HBASE-13937:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12740882/hbase-13937_v3-branch-1.1.patch
  against branch-1.1 branch at commit 04c25e0f355aaa6ded37b0477ce126a693756b81.
  ATTACHMENT ID: 12740882

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14488//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14488//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14488//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14488//console

This message is automatically generated.

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3-branch-1.1.patch, hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594386#comment-14594386
 ] 

Hadoop QA commented on HBASE-13937:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12740771/hbase-13937_v3.patch
  against master branch at commit db08013ebeeaa85802d9795cc72b4c29c5338a47.
  ATTACHMENT ID: 12740771

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14480//console

This message is automatically generated.

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3.patch, hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-19 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594157#comment-14594157
 ] 

Duo Zhang commented on HBASE-13937:
---

{quote}
thus guaranteeing that the region server cannot accept any more writes.
{quote}
But still readable right? Does HBase guarantee this level of consistency?
1. A writes row to HBase.
2  A tells B to read the row.
3. B can read the row from HBase.

If not, then I think remove recoverLease is enough here. Otherwise we still to 
make sure that the regionserver can not process any request.

And I think this discussion should be in the parent issue, so +1 on patch v3 :)

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
> hbase-13937_v3.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-19 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594118#comment-14594118
 ] 

Andrew Purtell commented on HBASE-13937:


bq. Do you mind re-review according to above? 

The check for isDeadServer is moved out of the retry loop:
{code}
@@ -903,13 +901,14 @@ public class ServerManager {
   public boolean isServerReachable(ServerName server) {
 if (server == null) throw new NullPointerException("Passed server is 
null");
 
+synchronized (this.onlineServers) {
+  if (this.deadservers.isDeadServer(server)) {
+return false;
+  }
+}
+
 RetryCounter retryCounter = pingRetryCounterFactory.create();
 while (retryCounter.shouldRetry()) {
-  synchronized (this.onlineServers) {
-if (this.deadservers.isDeadServer(server)) {
-  return false;
-}
-  }
   try {
 AdminService.BlockingInterface admin = getRsAdmin(server);
 if (admin != null) {
{code}

Yes, I was wrong about the second part. The patch under review here does:
{code}
@@ -917,11 +916,6 @@ public class ServerManager {
   return info != null && info.hasServerName()
 && server.getStartcode() == info.getServerName().getStartCode();
 }
-  } catch (RegionServerStoppedException | ServerNotRunningYetException e) {
-if (LOG.isDebugEnabled()) {
-  LOG.debug("Couldn't reach " + server, e);
-}
-break;
{code}
I must have gone back to the wrong tab and looked at the patch on the original 
issue.

lgtm, FWIW

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-19 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594090#comment-14594090
 ] 

Enis Soztutar commented on HBASE-13937:
---

bq. Looking at the V2 patch. So we check once if the server is in the dead list 
and then proceed to ping. This patch hoists out this check:
I think the patch does the exact opposite. It keeps the {{synchronized 
(this.onlineServers) }} part, but removes the  
{{catch (RegionServerStoppedException | ServerNotRunningYetException e)}} part. 
The intent is to apply v2 directly without reverting the prev patch. 

bq. This lgtm for application to 0.98, modulo the multicatch (Java 7+ only) 
will need to be converted to equivalent Java 6 idiom.
Do you mind re-review according to above? 

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-19 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593947#comment-14593947
 ] 

Nick Dimiduk commented on HBASE-13937:
--

{{TestDistributedLogSplitting}} is passing consistently on my side as well. +1

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-19 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593936#comment-14593936
 ] 

Devaraj Das commented on HBASE-13937:
-

+1 

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-19 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593935#comment-14593935
 ] 

Andrew Purtell commented on HBASE-13937:


bq. We will not treat exceptions coming from server ping differently, but 
instead will keep retrying to ping.
Reply

Looking at the V2 patch. So we check once if the server is in the dead list and 
then proceed to ping. This patch hoists out this check:
{code}
+  synchronized (this.onlineServers) {
+if (this.deadservers.isDeadServer(server)) {
+  return false;
+}
+  }
{code}
that HBASE-13172 put into the ping loop. We retain this change from HBASE-13172:
{code}
@@ -851,13 +858,21 @@ public class ServerManager {
   return info != null && info.hasServerName()
 && server.getStartcode() == info.getServerName().getStartCode();
 }
+  } catch (RegionServerStoppedException | ServerNotRunningYetException e) {
+if (LOG.isDebugEnabled()) {
+  LOG.debug("Couldn't reach " + server, e);
+}
+break;
   } catch (IOException ioe) {
-LOG.debug("Couldn't reach " + server + ", try=" + 
retryCounter.getAttemptTimes()
-  + " of " + retryCounter.getMaxAttempts(), ioe);
+if (LOG.isDebugEnabled()) {
+  LOG.debug("Couldn't reach " + server + ", try=" + 
retryCounter.getAttemptTimes() + " of "
+  + retryCounter.getMaxAttempts(), ioe);
+}
 try {
   retryCounter.sleepUntilNextRetry();
 } catch(InterruptedException ie) {
   Thread.currentThread().interrupt();
+  break;
 }
   }
 }
{code}
that breaks out of the ping loop if we catch RegionServerStoppedException or 
ServerNotRunningYetException.

This lgtm for application to 0.98, modulo the multicatch (Java 7+ only) will 
need to be converted to equivalent Java 6 idiom.

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-19 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593797#comment-14593797
 ] 

Enis Soztutar commented on HBASE-13937:
---

[~Apache9] we have a fencing mechanism for region servers already via using 
HDFS {{recoverLease()}}. Once the zk session expiry happens, master renames the 
WAL directory for the RS and also starts recoverLease on all WAL files, thus 
guaranteeing that the region server cannot accept any more writes.  

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-18 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592928#comment-14592928
 ] 

Duo Zhang commented on HBASE-13937:
---

{quote}
In theory we should not have isServerReachable() at all. It is a parallel 
cluster membership mechanism to the already existing zk based one
{quote}
I'm not familiar with the code of AM and SM, but I think zk is not enough to 
keep things consistency. The EPHEMERAL node on zookeeper disappeared does not 
mean the server is really dead. We still need a method like isServerReachable 
to decide whether the server is really dead. One example is in HDFS HA, a 
fencing is needed before transforming standby namenode to active namenode.

{quote}
If we get a connection exception or smt, we cannot assume the server is dead.
{quote}
Agree(and excuse me, what is smt?), in general only the server tells us it is 
dead then we can make sure the server is dead. And after googling I think 
connection refused is also not that stable(a firewall or backlog queue full can 
also cause connection refused). So I think we need fencing like what HDFS HA 
does?  

Yeah, you may challenge that if the machine is crashed, how can we make sure 
the server is dead...Honestly I do not have perfect solution. Maybe we could 
introduce a DeadServerManager and make several levels of consistency, the 
lowest level does not do fencing at all, the medium level do fencing with a 
timeout, and the highest level will do fencing for ever(let a person stop it 
maybe)

Thanks.

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-18 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592858#comment-14592858
 ] 

Enis Soztutar commented on HBASE-13937:
---

The problems with exiting early is the same problems that 13605 is all about. 
In theory we should not have isServerReachable() at all. It is a parallel 
cluster membership mechanism to the already existing zk based one. If we get a 
connection exception or smt, we cannot assume the server is dead. It may be a 
temporary network partition for the master or something else. If we return here 
earlier, the rest of AM assumes that server is dead, and for example it assumes 
that region unassign can safely continue etc. 

The patch goes back to the more conservative approach in terms of the 
semantics. 

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-18 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592814#comment-14592814
 ] 

Duo Zhang commented on HBASE-13937:
---

For isServerReachable, I think we can say a server is 'dead' if one of the 
following conditions is satisfied

1. Server tells us it is dead(I'm not sure whether a 
RegionServerStoppedException is enough, maybe it will be thrown before 
regionserver completely shutdown?)
2. It is a server with another start code.
3. We get a connection refused(not connect timeout).

So I think remove the code is reasonable, but we should catch a connection 
refused exception then(Does our rpc framework throw this exception out, and 
also we do not need retry here...)? Otherwise if we do not restart a 
regionserver then we will be stuck in the loop for a long time...

And also I do not think it is safe to return 'not reachable' if timeout...

Thanks.

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-18 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592810#comment-14592810
 ] 

Enis Soztutar commented on HBASE-13937:
---

The behavior will be reverted back to previous one before 13172 is committed 
(1.0.1, 1.1.0 and 0.98.12). We will not treat exceptions coming from server 
ping differently, but instead will keep retrying to ping. 

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-18 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592734#comment-14592734
 ] 

stack commented on HBASE-13937:
---

What will the change in behavior be [~enis] with this patch in place?

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-18 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592720#comment-14592720
 ] 

Enis Soztutar commented on HBASE-13937:
---

[~Apache9] FYI. 

> Partially revert HBASE-13172 
> -
>
> Key: HBASE-13937
> URL: https://issues.apache.org/jira/browse/HBASE-13937
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
> Attachments: hbase-13937_v1.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)