[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold
[ https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633048#comment-14633048 ] Duo Zhang commented on HBASE-14059: --- {quote} In the issue I ran into it was a bad region causing the RS to be blocked for a long time. {quote} More details? What does 'bad' mean? And you said the region is back to normal when you kill the RS, so I think there maybe another bug? In general, I agree with you, we should offline a RS if admin calls always fail. But it should be used to fix a 'bad' RS, not a 'bad' region. If there is a 'bad' region that can not be fixed by reassign, then as [~chenheng] said, the 'bad' region will kill all regionservers in your cluster... Thanks. > We should add a RS to the dead servers list if admin calls fail more than a > threshold > - > > Key: HBASE-14059 > URL: https://issues.apache.org/jira/browse/HBASE-14059 > Project: HBase > Issue Type: Bug > Components: master, regionserver, rpc >Affects Versions: 0.98.13 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > > I ran into this problem twice this week: calls from the HBase master to a RS > can timeout since the RS call queue size has been maxed out, however since > the RS is not dead (ephemeral znode still present) the master keeps > attempting to perform admin tasks like trying to open or close a region but > those operations eventually fail after we run out of retries or the > assignment manager attempts to re-assign to other RSs. From the side effects > of this I've noticed master operations to be fully blocked or RITs since we > cannot close the region and open the region in a new location since RS is not > dead. > A potential solution for this is to add the RS to the list of dead RSs after > certain number of calls from the master to the RS fail. > I've noticed only the problem in 0.98.x but it should be present in all > versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold
[ https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631524#comment-14631524 ] Esteban Gutierrez commented on HBASE-14059: --- I don't think thats a good idea because even if we have this blacklisted region, other regions served by this RS might not be available too. The safe side should be to add the RS to the list of dead servers, one way could be to extend the Canary to remove the ephemeral znode for this RS to trigger the recovery of the regions. We could make the threshold configurable (e.g. time based, # of failed admin ops, etc.) or we we could do this from the master in a much more coordinated way. > We should add a RS to the dead servers list if admin calls fail more than a > threshold > - > > Key: HBASE-14059 > URL: https://issues.apache.org/jira/browse/HBASE-14059 > Project: HBase > Issue Type: Bug > Components: master, regionserver, rpc >Affects Versions: 0.98.13 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > > I ran into this problem twice this week: calls from the HBase master to a RS > can timeout since the RS call queue size has been maxed out, however since > the RS is not dead (ephemeral znode still present) the master keeps > attempting to perform admin tasks like trying to open or close a region but > those operations eventually fail after we run out of retries or the > assignment manager attempts to re-assign to other RSs. From the side effects > of this I've noticed master operations to be fully blocked or RITs since we > cannot close the region and open the region in a new location since RS is not > dead. > A potential solution for this is to add the RS to the list of dead RSs after > certain number of calls from the master to the RS fail. > I've noticed only the problem in 0.98.x but it should be present in all > versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold
[ https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630851#comment-14630851 ] Heng Chen commented on HBASE-14059: --- So I think just add this RS to dead server list can't solve the problem, because the bad region could be transited into other RS and cause it's call queue maxed out. I think a better solution is to add the bad region into blacklist, and skip the request on this region to avoid the rs been blocked. > We should add a RS to the dead servers list if admin calls fail more than a > threshold > - > > Key: HBASE-14059 > URL: https://issues.apache.org/jira/browse/HBASE-14059 > Project: HBase > Issue Type: Bug > Components: master, regionserver, rpc >Affects Versions: 0.98.13 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > > I ran into this problem twice this week: calls from the HBase master to a RS > can timeout since the RS call queue size has been maxed out, however since > the RS is not dead (ephemeral znode still present) the master keeps > attempting to perform admin tasks like trying to open or close a region but > those operations eventually fail after we run out of retries or the > assignment manager attempts to re-assign to other RSs. From the side effects > of this I've noticed master operations to be fully blocked or RITs since we > cannot close the region and open the region in a new location since RS is not > dead. > A potential solution for this is to add the RS to the list of dead RSs after > certain number of calls from the master to the RS fail. > I've noticed only the problem in 0.98.x but it should be present in all > versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold
[ https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630811#comment-14630811 ] Esteban Gutierrez commented on HBASE-14059: --- It could happen but it should be very rare. It would require some kind of corruption in the store files that we shouldn't be able to detect at the time the RS opens the region. However, in this specific issue I ran into killing the RS it successfully restored the cluster stability. > We should add a RS to the dead servers list if admin calls fail more than a > threshold > - > > Key: HBASE-14059 > URL: https://issues.apache.org/jira/browse/HBASE-14059 > Project: HBase > Issue Type: Bug > Components: master, regionserver, rpc >Affects Versions: 0.98.13 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > > I ran into this problem twice this week: calls from the HBase master to a RS > can timeout since the RS call queue size has been maxed out, however since > the RS is not dead (ephemeral znode still present) the master keeps > attempting to perform admin tasks like trying to open or close a region but > those operations eventually fail after we run out of retries or the > assignment manager attempts to re-assign to other RSs. From the side effects > of this I've noticed master operations to be fully blocked or RITs since we > cannot close the region and open the region in a new location since RS is not > dead. > A potential solution for this is to add the RS to the list of dead RSs after > certain number of calls from the master to the RS fail. > I've noticed only the problem in 0.98.x but it should be present in all > versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold
[ https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630802#comment-14630802 ] Heng Chen commented on HBASE-14059: --- If the region server been shut down, is it possible that the bad region transit to other RS and cause other RS's call queue maxed out? > We should add a RS to the dead servers list if admin calls fail more than a > threshold > - > > Key: HBASE-14059 > URL: https://issues.apache.org/jira/browse/HBASE-14059 > Project: HBase > Issue Type: Bug > Components: master, regionserver, rpc >Affects Versions: 0.98.13 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > > I ran into this problem twice this week: calls from the HBase master to a RS > can timeout since the RS call queue size has been maxed out, however since > the RS is not dead (ephemeral znode still present) the master keeps > attempting to perform admin tasks like trying to open or close a region but > those operations eventually fail after we run out of retries or the > assignment manager attempts to re-assign to other RSs. From the side effects > of this I've noticed master operations to be fully blocked or RITs since we > cannot close the region and open the region in a new location since RS is not > dead. > A potential solution for this is to add the RS to the list of dead RSs after > certain number of calls from the master to the RS fail. > I've noticed only the problem in 0.98.x but it should be present in all > versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold
[ https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630771#comment-14630771 ] Esteban Gutierrez commented on HBASE-14059: --- In the issue I ran into it was a bad region causing the RS to be blocked for a long time. From the point of view of the master the RS was doing fine since it was getting region load info info and the RS ephemeral znode was present. However, since the call queue length was maxed out, some operations like closing a region or opening a region were not successful, hence this cluster ended up with regions in transition very frequently due assignment issues on this RS. > We should add a RS to the dead servers list if admin calls fail more than a > threshold > - > > Key: HBASE-14059 > URL: https://issues.apache.org/jira/browse/HBASE-14059 > Project: HBase > Issue Type: Bug > Components: master, regionserver, rpc >Affects Versions: 0.98.13 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > > I ran into this problem twice this week: calls from the HBase master to a RS > can timeout since the RS call queue size has been maxed out, however since > the RS is not dead (ephemeral znode still present) the master keeps > attempting to perform admin tasks like trying to open or close a region but > those operations eventually fail after we run out of retries or the > assignment manager attempts to re-assign to other RSs. From the side effects > of this I've noticed master operations to be fully blocked or RITs since we > cannot close the region and open the region in a new location since RS is not > dead. > A potential solution for this is to add the RS to the list of dead RSs after > certain number of calls from the master to the RS fail. > I've noticed only the problem in 0.98.x but it should be present in all > versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold
[ https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630745#comment-14630745 ] Heng Chen commented on HBASE-14059: --- Why the region server's call queue been full, which operation blocks ? > We should add a RS to the dead servers list if admin calls fail more than a > threshold > - > > Key: HBASE-14059 > URL: https://issues.apache.org/jira/browse/HBASE-14059 > Project: HBase > Issue Type: Bug > Components: master, regionserver, rpc >Affects Versions: 0.98.13 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > > I ran into this problem twice this week: calls from the HBase master to a RS > can timeout since the RS call queue size has been maxed out, however since > the RS is not dead (ephemeral znode still present) the master keeps > attempting to perform admin tasks like trying to open or close a region but > those operations eventually fail after we run out of retries or the > assignment manager attempts to re-assign to other RSs. From the side effects > of this I've noticed master operations to be fully blocked or RITs since we > cannot close the region and open the region in a new location since RS is not > dead. > A potential solution for this is to add the RS to the list of dead RSs after > certain number of calls from the master to the RS fail. > I've noticed only the problem in 0.98.x but it should be present in all > versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold
[ https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623013#comment-14623013 ] Esteban Gutierrez commented on HBASE-14059: --- Initially we should quarantine the RS by adding it to the list of draining RSs. > We should add a RS to the dead servers list if admin calls fail more than a > threshold > - > > Key: HBASE-14059 > URL: https://issues.apache.org/jira/browse/HBASE-14059 > Project: HBase > Issue Type: Bug > Components: master, regionserver, rpc >Affects Versions: 0.98.13 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > > I ran into this problem twice this week: calls from the HBase master to a RS > can timeout since the RS call queue size has been maxed out, however since > the RS is not dead (ephemeral znode still present) the master keeps > attempting to perform admin tasks like trying to open or close a region but > those operations eventually fail after we run out of retries or the > assignment manager attempts to re-assign to other RSs. From the side effects > of this I've noticed master operations to be fully blocked or RITs since we > cannot close the region and open the region in a new location since RS is not > dead. > A potential solution for this is to add the RS to the list of dead RSs after > certain number of calls from the master to the RS fail. > I've noticed only the problem in 0.98.x but it should be present in all > versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)