[jira] [Created] (HBASE-25815) RSGroupBasedLoadBalancer online status never updates after being set to true for the first time
Caroline created HBASE-25815: Summary: RSGroupBasedLoadBalancer online status never updates after being set to true for the first time Key: HBASE-25815 URL: https://issues.apache.org/jira/browse/HBASE-25815 Project: HBase Issue Type: Bug Reporter: Caroline Once the RSGroupBasedLoadBalancer is “online” (it has found the hbase:meta and hbase:rsgroup tables), it will never update the status again. ** That means if hbase:meta or hbase:rsgroup ever go offline, the balancer doesn’t update its status to “offline,” so some of the code paths will go through the “online” code path even though the catalog tables aren’t available to be read from or written to (in particular, anything that calls RSGroupInfoManagerImpl#flushConfig). Also, in the RSGroupInfoManagerImpl#flushConfig code path, the call to write to hbase:rsgroup comes before the update to the rsGroupMap and tableMap which are stored in memory (see order of [these lines of code|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java#L664-L670]), so if hbase:rsgroup goes offline after the RSGroupBasedLoadBalancer is already marked as “online,” exceptions thrown while trying to write to an offline hbase:rsgroup table prevent the in-memory rsGroupMap and tableMap from being updated. In terms of the order just mentioned, in-memory state should be updated first. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25730) HBase RegionServer Canary reports failure if an online regionserver is hosting 0 regions
Caroline created HBASE-25730: Summary: HBase RegionServer Canary reports failure if an online regionserver is hosting 0 regions Key: HBASE-25730 URL: https://issues.apache.org/jira/browse/HBASE-25730 Project: HBase Issue Type: Bug Components: canary Reporter: Caroline Assignee: Caroline The probability of this happening increases with system rsgroup enabled (i.e. an rsgroup dedicated to system tables, which have few regions) and paired with table-level load balancing. As long as the server is alive and able to accept/serve regions, we should not consider it an error case if it is currently serving no regions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25469) Create RIT servlet in HMaster to track more detailed RIT info not captured in metrics
Caroline created HBASE-25469: Summary: Create RIT servlet in HMaster to track more detailed RIT info not captured in metrics Key: HBASE-25469 URL: https://issues.apache.org/jira/browse/HBASE-25469 Project: HBase Issue Type: Improvement Reporter: Caroline -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25329) Dump region hashes in logs for the regions that are stuck in transition for more than a configured amount of time
Caroline created HBASE-25329: Summary: Dump region hashes in logs for the regions that are stuck in transition for more than a configured amount of time Key: HBASE-25329 URL: https://issues.apache.org/jira/browse/HBASE-25329 Project: HBase Issue Type: Improvement Reporter: Caroline We have metrics for number of RITs as well as number of RITs above a certain threshold, but we don't have any way of keeping track of the region hashes of those RITs. It would be beneficial to emit those region hashes as a metric, as well as log them, so that we don't accidentally lose this information for debugging the RIT at a later tiime. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23172) HBase Canary region success count metrics reflect column family successes, not region successes
Caroline created HBASE-23172: Summary: HBase Canary region success count metrics reflect column family successes, not region successes Key: HBASE-23172 URL: https://issues.apache.org/jira/browse/HBASE-23172 Project: HBase Issue Type: Improvement Components: canary Affects Versions: 2.2.1, 2.1.5, 2.0.0, 1.5.0, 1.4.0, 1.3.0, 3.0.0 Reporter: Caroline Assignee: Caroline HBase Canary reads once per column family per region. The current "region success count" should actually be "column family success count," which means we need another metric that actually reflects region success count. Additionally, the region read and write latencies only store the latencies of the last column family of the region read. Instead of a map of regions to a single latency value and success value, we should map each region to a list of such values. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-22804) Provide an API to get list of successful regions and total expected regions in Canary
Caroline created HBASE-22804: Summary: Provide an API to get list of successful regions and total expected regions in Canary Key: HBASE-22804 URL: https://issues.apache.org/jira/browse/HBASE-22804 Project: HBase Issue Type: Improvement Components: canary Affects Versions: 2.1.5, 2.0.0, 1.4.0, 1.3.0, 3.0.0, 1.5.0, 2.2.1 Reporter: Caroline Assignee: Caroline At present HBase Canary tool only prints the successes as part of logs. Providing an API to get the list of successes, as well as total number of expected regions, will make it easier to get a more accurate availability estimate. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (HBASE-22378) HBase Canary fails with TableNotFoundException when table deleted during Canary run
Caroline created HBASE-22378: Summary: HBase Canary fails with TableNotFoundException when table deleted during Canary run Key: HBASE-22378 URL: https://issues.apache.org/jira/browse/HBASE-22378 Project: HBase Issue Type: Bug Components: canary Affects Versions: 1.4.0, 1.3.0, 1.5.0 Reporter: Caroline In 1.3.2 branch-1, we saw a drastic increase in TableNotFoundExceptions thrown by HBase Canary. We traced the issue back to Canary trying to call isTableEnabled() on temporary tables that were deleted in the middle of the Canary run. In this version of HBase Canary, Canary throws TableNotFoundException (and then fails) if a table is deleted between admin.listTables() and admin.tableExists() function calls in RegionMonitor's sniff() method. Following the goal of sniff(), which is to query all existing tables, in order to reduce noise we should skip over a table (i.e. don't check if it was enabled, or do anything else with it at all) if it was returned in listTables() but deleted before Canary can query it. Temporary tables which are not meant to be kept should not throw TableNotFoundExceptions which fail the Canary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)