[jira] [Created] (HBASE-25815) RSGroupBasedLoadBalancer online status never updates after being set to true for the first time

2021-04-26 Thread Caroline (Jira)
Caroline created HBASE-25815:


 Summary: RSGroupBasedLoadBalancer online status never updates 
after being set to true for the first time
 Key: HBASE-25815
 URL: https://issues.apache.org/jira/browse/HBASE-25815
 Project: HBase
  Issue Type: Bug
Reporter: Caroline


Once the RSGroupBasedLoadBalancer is “online” (it has found the hbase:meta and 
hbase:rsgroup tables), it will never update the status again. ** That means if 
hbase:meta or hbase:rsgroup ever go offline, the balancer doesn’t update its 
status to “offline,” so some of the code paths will go through the “online” 
code path even though the catalog tables aren’t available to be read from or 
written to (in particular, anything that calls 
RSGroupInfoManagerImpl#flushConfig).

Also, in the RSGroupInfoManagerImpl#flushConfig code path, the call to write to 
hbase:rsgroup comes before the update to the rsGroupMap and tableMap which are 
stored in memory (see order of [these lines of 
code|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java#L664-L670]),
 so if hbase:rsgroup goes offline after the RSGroupBasedLoadBalancer is already 
marked as “online,” exceptions thrown while trying to write to an offline 
hbase:rsgroup table prevent the in-memory rsGroupMap and tableMap from being 
updated. In terms of the order just mentioned, in-memory state should be 
updated first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25730) HBase RegionServer Canary reports failure if an online regionserver is hosting 0 regions

2021-04-02 Thread Caroline (Jira)
Caroline created HBASE-25730:


 Summary: HBase RegionServer Canary reports failure if an online 
regionserver is hosting 0 regions
 Key: HBASE-25730
 URL: https://issues.apache.org/jira/browse/HBASE-25730
 Project: HBase
  Issue Type: Bug
  Components: canary
Reporter: Caroline
Assignee: Caroline


The probability of this happening increases with system rsgroup enabled (i.e. 
an rsgroup dedicated to system tables, which have few regions) and paired with 
table-level load balancing. As long as the server is alive and able to 
accept/serve regions, we should not consider it an error case if it is 
currently serving no regions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25469) Create RIT servlet in HMaster to track more detailed RIT info not captured in metrics

2021-01-06 Thread Caroline (Jira)
Caroline created HBASE-25469:


 Summary: Create RIT servlet in HMaster to track more detailed RIT 
info not captured in metrics
 Key: HBASE-25469
 URL: https://issues.apache.org/jira/browse/HBASE-25469
 Project: HBase
  Issue Type: Improvement
Reporter: Caroline






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25329) Dump region hashes in logs for the regions that are stuck in transition for more than a configured amount of time

2020-11-24 Thread Caroline (Jira)
Caroline created HBASE-25329:


 Summary: Dump region hashes in logs for the regions that are stuck 
in transition for more than a configured amount of time
 Key: HBASE-25329
 URL: https://issues.apache.org/jira/browse/HBASE-25329
 Project: HBase
  Issue Type: Improvement
Reporter: Caroline


We have metrics for number of RITs as well as number of RITs above a certain 
threshold, but we don't have any way of keeping track of the region hashes of 
those RITs. It would be beneficial to emit those region hashes as a metric, as 
well as log them, so that we don't accidentally lose this information for 
debugging the RIT at a later tiime.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23172) HBase Canary region success count metrics reflect column family successes, not region successes

2019-10-14 Thread Caroline (Jira)
Caroline created HBASE-23172:


 Summary: HBase Canary region success count metrics reflect column 
family successes, not region successes
 Key: HBASE-23172
 URL: https://issues.apache.org/jira/browse/HBASE-23172
 Project: HBase
  Issue Type: Improvement
  Components: canary
Affects Versions: 2.2.1, 2.1.5, 2.0.0, 1.5.0, 1.4.0, 1.3.0, 3.0.0
Reporter: Caroline
Assignee: Caroline


HBase Canary reads once per column family per region. The current "region 
success count" should actually be "column family success count," which means we 
need another metric that actually reflects region success count. Additionally, 
the region read and write latencies only store the latencies of the last column 
family of the region read. Instead of a map of regions to a single latency 
value and success value, we should map each region to a list of such values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-22804) Provide an API to get list of successful regions and total expected regions in Canary

2019-08-06 Thread Caroline (JIRA)
Caroline created HBASE-22804:


 Summary: Provide an API to get list of successful regions and 
total expected regions in Canary
 Key: HBASE-22804
 URL: https://issues.apache.org/jira/browse/HBASE-22804
 Project: HBase
  Issue Type: Improvement
  Components: canary
Affects Versions: 2.1.5, 2.0.0, 1.4.0, 1.3.0, 3.0.0, 1.5.0, 2.2.1
Reporter: Caroline
Assignee: Caroline


At present HBase Canary tool only prints the successes as part of logs. 
Providing an API to get the list of successes, as well as total number of 
expected regions, will make it easier to get a more accurate availability 
estimate.
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (HBASE-22378) HBase Canary fails with TableNotFoundException when table deleted during Canary run

2019-05-07 Thread Caroline (JIRA)
Caroline created HBASE-22378:


 Summary: HBase Canary fails with TableNotFoundException when table 
deleted during Canary run
 Key: HBASE-22378
 URL: https://issues.apache.org/jira/browse/HBASE-22378
 Project: HBase
  Issue Type: Bug
  Components: canary
Affects Versions: 1.4.0, 1.3.0, 1.5.0
Reporter: Caroline


In 1.3.2 branch-1, we saw a drastic increase in TableNotFoundExceptions thrown 
by HBase Canary. We traced the issue back to Canary trying to call 
isTableEnabled() on temporary tables that were deleted in the middle of the 
Canary run.

In this version of HBase Canary, Canary throws TableNotFoundException (and then 
fails) if a table is deleted between admin.listTables() and admin.tableExists() 
function calls in RegionMonitor's sniff() method. Following the goal of 
sniff(), which is to query all existing tables, in order to reduce noise we 
should skip over a table (i.e. don't check if it was enabled, or do anything 
else with it at all) if it was returned in listTables() but deleted before 
Canary can query it. Temporary tables which are not meant to be kept should not 
throw TableNotFoundExceptions which fail the Canary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)