[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272369#comment-14272369 ] Hudson commented on HBASE-12028: FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #751 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/751/]) Amend HBASE-12787 Backport HBASE-12028 (Abort the RegionServer when it's handler threads die) to 0.98 (Alicia Ying Shu); Fix Hadoop 1 build (apurtell: rev 81e6831af812a02742a9ae76d0fa184eb7255719) * hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, > hbase-12028-v5-master.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272363#comment-14272363 ] Hudson commented on HBASE-12028: SUCCESS: Integrated in HBase-0.98 #786 (See [https://builds.apache.org/job/HBase-0.98/786/]) Amend HBASE-12787 Backport HBASE-12028 (Abort the RegionServer when it's handler threads die) to 0.98 (Alicia Ying Shu); Fix Hadoop 1 build (apurtell: rev 81e6831af812a02742a9ae76d0fa184eb7255719) * hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, > hbase-12028-v5-master.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272330#comment-14272330 ] Hudson commented on HBASE-12028: SUCCESS: Integrated in HBase-0.98 #785 (See [https://builds.apache.org/job/HBase-0.98/785/]) HBASE-12787 Backport HBASE-12028 (Abort the RegionServer when it's handler threads die) to 0.98 (Alicia Ying Shu) (apurtell: rev b4b1b9c46308747b14620d1010526562a3fc4ff5) * hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestSimpleRpcScheduler.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java * hbase-common/src/main/resources/hbase-default.xml * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RWQueueRpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SimpleRpcSchedulerFactory.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/BalancedQueueRpcExecutor.java * hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, > hbase-12028-v5-master.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272288#comment-14272288 ] Hudson commented on HBASE-12028: FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #750 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/750/]) HBASE-12787 Backport HBASE-12028 (Abort the RegionServer when it's handler threads die) to 0.98 (Alicia Ying Shu) (apurtell: rev b4b1b9c46308747b14620d1010526562a3fc4ff5) * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RWQueueRpcExecutor.java * hbase-common/src/main/resources/hbase-default.xml * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/BalancedQueueRpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java * hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestSimpleRpcScheduler.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SimpleRpcSchedulerFactory.java * hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, > hbase-12028-v5-master.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263348#comment-14263348 ] Hudson commented on HBASE-12028: SUCCESS: Integrated in HBase-1.0 #626 (See [https://builds.apache.org/job/HBase-1.0/626/]) HBASE-12028 Abort the RegionServer, when it's handler threads die (Alicia Ying Shu) (enis: rev f960f2a9062a4ab3bccdcd2718f001eed54c9d18) * hbase-common/src/main/resources/hbase-default.xml * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SimpleRpcSchedulerFactory.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RpcSchedulerFactory.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/BalancedQueueRpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java * hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestRpcHandlerException.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RWQueueRpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java * hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, > hbase-12028-v5-master.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263316#comment-14263316 ] Hudson commented on HBASE-12028: FAILURE: Integrated in HBase-1.1 #45 (See [https://builds.apache.org/job/HBase-1.1/45/]) HBASE-12028 Abort the RegionServer, when it's handler threads die (Alicia Ying Shu) (enis: rev ecbdc45d3d68d83ee001a56b2735b5f5dc63b3e2) * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RWQueueRpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SimpleRpcSchedulerFactory.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java * hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java * hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestRpcHandlerException.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RpcSchedulerFactory.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java * hbase-common/src/main/resources/hbase-default.xml * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/BalancedQueueRpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, > hbase-12028-v5-master.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263306#comment-14263306 ] Hudson commented on HBASE-12028: SUCCESS: Integrated in HBase-TRUNK #5984 (See [https://builds.apache.org/job/HBase-TRUNK/5984/]) HBASE-12028 Abort the RegionServer, when it's handler threads die (Alicia Ying Shu) (enis: rev 820f629423f21fbd1dcc7a383955443a2595fd5d) * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SimpleRpcSchedulerFactory.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java * hbase-common/src/main/resources/hbase-default.xml * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RpcSchedulerFactory.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/BalancedQueueRpcExecutor.java * hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java * hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RWQueueRpcExecutor.java * hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestRpcHandlerException.java > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261725#comment-14261725 ] Andrew Purtell commented on HBASE-12028: In the discussion "Considering a RpcSchedulerFactory change in 0.98 for HBASE-12028" on dev@phoenix, James would like binary compatibility for their 4.2 release if possible. We can do that with reflection I think, but let's do it in a backport issue instead of here, or decide not to do it there. See HBASE-12787. > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261591#comment-14261591 ] Andrew Purtell commented on HBASE-12028: I mailed dev@phoenix and copied dev@hbase. Let's see what is the response. > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261578#comment-14261578 ] Andrew Purtell commented on HBASE-12028: I was planning to raise this with the Phoenix devs because even if we drop the interface change (doable but ugly) they would want to receive a useful Abortable. > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261522#comment-14261522 ] Enis Soztutar commented on HBASE-12028: --- I think we can commit to 0.98 as well, if [~ayingshu] provides a 0.98 patch with changed default behavior (we want 0.98 to be disabled by default). It will break Phoenix compilation though since the new method is in an interface, and not in a base class, with newer 0.98.x version unless we make a change in Phoenix. > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261377#comment-14261377 ] Andrew Purtell commented on HBASE-12028: Are you intending a commit to 0.98 also? The key change is new constructors for passing in an Abortable to RPC schedulers, and existing constructors are retained and deprecated. This seems fine as long as default configuration is current behavior. > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261361#comment-14261361 ] Enis Soztutar commented on HBASE-12028: --- Some offline discussions with Alicia, she reverted the RpcSchedulerFactory.Context change to enabled Phoenix to be able to compile with both 0.98 and 1.1+ versions in v5. +1 for the patch. Added some release notes to the issue. This will be ON by default. Will commit to branch-1+ unless objection. > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)