[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2015-01-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272369#comment-14272369
 ] 

Hudson commented on HBASE-12028:


FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #751 (See 
[https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/751/])
Amend HBASE-12787 Backport HBASE-12028 (Abort the RegionServer when it's 
handler threads die) to 0.98 (Alicia Ying Shu); Fix Hadoop 1 build (apurtell: 
rev 81e6831af812a02742a9ae76d0fa184eb7255719)
* hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java


> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, 
> hbase-12028-v5-master.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2015-01-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272363#comment-14272363
 ] 

Hudson commented on HBASE-12028:


SUCCESS: Integrated in HBase-0.98 #786 (See 
[https://builds.apache.org/job/HBase-0.98/786/])
Amend HBASE-12787 Backport HBASE-12028 (Abort the RegionServer when it's 
handler threads die) to 0.98 (Alicia Ying Shu); Fix Hadoop 1 build (apurtell: 
rev 81e6831af812a02742a9ae76d0fa184eb7255719)
* hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java


> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, 
> hbase-12028-v5-master.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2015-01-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272330#comment-14272330
 ] 

Hudson commented on HBASE-12028:


SUCCESS: Integrated in HBase-0.98 #785 (See 
[https://builds.apache.org/job/HBase-0.98/785/])
HBASE-12787 Backport HBASE-12028 (Abort the RegionServer when it's handler 
threads die) to 0.98 (Alicia Ying Shu) (apurtell: rev 
b4b1b9c46308747b14620d1010526562a3fc4ff5)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestSimpleRpcScheduler.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java
* hbase-common/src/main/resources/hbase-default.xml
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RWQueueRpcExecutor.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SimpleRpcSchedulerFactory.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/BalancedQueueRpcExecutor.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java


> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, 
> hbase-12028-v5-master.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2015-01-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272288#comment-14272288
 ] 

Hudson commented on HBASE-12028:


FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #750 (See 
[https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/750/])
HBASE-12787 Backport HBASE-12028 (Abort the RegionServer when it's handler 
threads die) to 0.98 (Alicia Ying Shu) (apurtell: rev 
b4b1b9c46308747b14620d1010526562a3fc4ff5)
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RWQueueRpcExecutor.java
* hbase-common/src/main/resources/hbase-default.xml
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/BalancedQueueRpcExecutor.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestSimpleRpcScheduler.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SimpleRpcSchedulerFactory.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java


> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, 
> hbase-12028-v5-master.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2015-01-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263348#comment-14263348
 ] 

Hudson commented on HBASE-12028:


SUCCESS: Integrated in HBase-1.0 #626 (See 
[https://builds.apache.org/job/HBase-1.0/626/])
HBASE-12028 Abort the RegionServer, when it's handler threads die (Alicia Ying 
Shu) (enis: rev f960f2a9062a4ab3bccdcd2718f001eed54c9d18)
* hbase-common/src/main/resources/hbase-default.xml
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SimpleRpcSchedulerFactory.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RpcSchedulerFactory.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/BalancedQueueRpcExecutor.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestRpcHandlerException.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RWQueueRpcExecutor.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java


> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, 
> hbase-12028-v5-master.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2015-01-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263316#comment-14263316
 ] 

Hudson commented on HBASE-12028:


FAILURE: Integrated in HBase-1.1 #45 (See 
[https://builds.apache.org/job/HBase-1.1/45/])
HBASE-12028 Abort the RegionServer, when it's handler threads die (Alicia Ying 
Shu) (enis: rev ecbdc45d3d68d83ee001a56b2735b5f5dc63b3e2)
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RWQueueRpcExecutor.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SimpleRpcSchedulerFactory.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestRpcHandlerException.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RpcSchedulerFactory.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java
* hbase-common/src/main/resources/hbase-default.xml
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/BalancedQueueRpcExecutor.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java


> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, 
> hbase-12028-v5-master.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2015-01-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263306#comment-14263306
 ] 

Hudson commented on HBASE-12028:


SUCCESS: Integrated in HBase-TRUNK #5984 (See 
[https://builds.apache.org/job/HBase-TRUNK/5984/])
HBASE-12028 Abort the RegionServer, when it's handler threads die (Alicia Ying 
Shu) (enis: rev 820f629423f21fbd1dcc7a383955443a2595fd5d)
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SimpleRpcSchedulerFactory.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
* hbase-common/src/main/resources/hbase-default.xml
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RpcSchedulerFactory.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/BalancedQueueRpcExecutor.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RWQueueRpcExecutor.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestRpcHandlerException.java


> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2014-12-30 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261725#comment-14261725
 ] 

Andrew Purtell commented on HBASE-12028:


In the discussion "Considering a RpcSchedulerFactory change in 0.98 for 
HBASE-12028" on dev@phoenix, James would like binary compatibility for their 
4.2 release if possible. We can do that with reflection I think, but let's do 
it in a backport issue instead of here, or decide not to do it there. See 
HBASE-12787. 

> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2014-12-30 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261591#comment-14261591
 ] 

Andrew Purtell commented on HBASE-12028:


I mailed dev@phoenix and copied dev@hbase. Let's see what is the response. 

> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2014-12-30 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261578#comment-14261578
 ] 

Andrew Purtell commented on HBASE-12028:


I was planning to raise this with the Phoenix devs because even if we drop the 
interface change (doable but ugly) they would want to receive a useful 
Abortable. 

> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2014-12-30 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261522#comment-14261522
 ] 

Enis Soztutar commented on HBASE-12028:
---

I think we can commit to 0.98 as well, if [~ayingshu] provides a 0.98 patch 
with changed default behavior (we want 0.98 to be disabled by default). It will 
break Phoenix compilation though since the new method is in an interface, and 
not in a base class, with newer 0.98.x version unless we make a change in 
Phoenix. 

> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2014-12-30 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261377#comment-14261377
 ] 

Andrew Purtell commented on HBASE-12028:


Are you intending a commit to 0.98 also? The key change is new constructors for 
passing in an Abortable to RPC schedulers, and existing constructors are 
retained and deprecated. This seems fine as long as default configuration is 
current behavior. 

> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2014-12-30 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261361#comment-14261361
 ] 

Enis Soztutar commented on HBASE-12028:
---

Some offline discussions with Alicia, she reverted the 
RpcSchedulerFactory.Context change to enabled Phoenix to be able to compile 
with both 0.98 and 1.1+ versions in v5. 
+1 for the patch. Added some release notes to the issue. This will be ON by 
default. Will commit to branch-1+ unless objection.  

> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)