[jira] [Updated] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-12028: -- Attachment: hbase-12028-v5-branch-1.patch hbase-12028-v5-master.patch Attaching final patch committed (with some whitespace/formatting changes) > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, > hbase-12028-v5-master.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-12028: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've pushed this to 1.0+. Thanks Alicia for the patch. > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-12028: -- Fix Version/s: 1.1.0 2.0.0 1.0.0 > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Fix For: 1.0.0, 2.0.0, 1.1.0 > > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-12028: -- Release Note: Adds a configuration property "hbase.regionserver.handler.abort.on.error.percent" for aborting the region server when some of it's handler threads die. The default value is 0.5 causing an abort in the RS when half of it's handler threads die. A handler thread only dies in case of a serious software bug, or a non-recoverable Error (StackOverflow, OOM, etc) is thrown. These are possible values for the configuration: * -1 => Disable aborting * 0 => Abort if even a single handler has died * 0.x => Abort only when this percent of handlers have died * 1 => Abort only all of the handers have died > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12028) Abort the RegionServer, when it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-12028: -- Summary: Abort the RegionServer, when it's handler threads die (was: Abort the RegionServer, when one of it's handler threads die) > Abort the RegionServer, when it's handler threads die > - > > Key: HBASE-12028 > URL: https://issues.apache.org/jira/browse/HBASE-12028 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Sudarshan Kadambi >Assignee: Alicia Ying Shu > Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, > hbase-12028-v4.patch, hbase-12028-v5.patch > > > Over in HBase-11813, a user identified an issue where in all the RPC handler > threads would exit with StackOverflow errors due to an unchecked > recursion-terminating condition. Our clusters demonstrated the same trace. > While the patch posted for HBASE-11813 got our clusters to be merry again, > the breakdown surfaced some larger issues. > When the RegionServer had all it's RPC handler threads dead, it continued to > have regions assigned it. Clearly, it wouldn't be able to serve reads and > writes on those regions. A second issue was that when a user tried to disable > or drop a table, the master would try to communicate to the regionserver for > region unassignment. Since the same handler threads seem to be used for > master <-> RS communication as well, the master ended up hanging on the RS > indefinitely. Eventually, the master stopped responding to all table > meta-operations. > A handler thread should never exit, and if it does, it seems like the more > prudent thing to do would be for the RS to abort. This way, at least recovery > can be undertaken and the regions could be reassigned elsewhere. I also think > that the master<->RS communication should get its own exclusive threadpool, > but I'll wait until this issue has been sufficiently discussed before opening > an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)