[jira] [Updated] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2015-01-02 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-12028:
--
Attachment: hbase-12028-v5-branch-1.patch
hbase-12028-v5-master.patch

Attaching final patch committed (with some whitespace/formatting changes)

> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5-branch-1.patch, 
> hbase-12028-v5-master.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2015-01-02 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-12028:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

I've pushed this to 1.0+. Thanks Alicia for the patch. 

> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2014-12-30 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-12028:
--
Fix Version/s: 1.1.0
   2.0.0
   1.0.0

> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2014-12-30 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-12028:
--
Release Note: 
Adds a configuration property 
"hbase.regionserver.handler.abort.on.error.percent" for aborting the region 
server when some of it's handler threads die. The default value is 0.5 causing 
an abort in the RS when half of it's handler threads die. A handler thread only 
dies in case of a serious software bug, or a non-recoverable Error 
(StackOverflow, OOM, etc) is thrown. 
These are possible values for the configuration:
   * -1  => Disable aborting
   * 0   => Abort if even a single handler has died
   * 0.x => Abort only when this percent of handlers have died
   * 1   => Abort only all of the handers have died


> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12028) Abort the RegionServer, when it's handler threads die

2014-12-30 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-12028:
--
Summary: Abort the RegionServer, when it's handler threads die  (was: Abort 
the RegionServer, when one of it's handler threads die)

> Abort the RegionServer, when it's handler threads die
> -
>
> Key: HBASE-12028
> URL: https://issues.apache.org/jira/browse/HBASE-12028
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: Sudarshan Kadambi
>Assignee: Alicia Ying Shu
> Attachments: Hbase-12028-v3.patch, Hbase-12028.patch, 
> hbase-12028-v4.patch, hbase-12028-v5.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)