Tushar Ahuja created HBASE-29409: ------------------------------------ Summary: Server level meta cache clearing frequently on IO exceptions Key: HBASE-29409 URL: https://issues.apache.org/jira/browse/HBASE-29409 Project: HBase Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Tushar Ahuja
Hbase client version: 2.3.0 Hbase version: 2.1.7 Java version: 8 Hbase client repo: [https://github.com/apache/hbase/tree/master/hbase-client] Tag used: rel/2.3.0 >From my application, I'm making 2 types of hbase calls 1 Single get 2. Bulk gets Coming to bulk gets first: {code:java} public Result[] get(List<Get> gets) throws IOException {code} Now intermittently, I saw latency spikes in my metrics. On enabling the metrics flag (hbase.client.metrics.enable) , I noticed a higher number of these metrics {noformat} MetricsConnection_metaCacheNumClearServer{noformat} Upon enabling trace logs over the MetaCache class ( {noformat} org/apache/hadoop/hbase/client/MetaCache.java{noformat} ) , I noticed a pattern: In case of a CallTimeoutException while making bulk get calls to Hbase, the region cache for the entire server is cleared {code:java} Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: CallTimeoutException: 1 time, servers with issues: datanode2-az-prod-ci,16020,1747212636218 23-06-2025 12:11:53.123 [pool-7-thread-13] TRACE o.a.hadoop.hbase.client.MetaCache - Removed all cached region locations that map to datanode2-az-prod-ci,16020,1747212636218{code} Shortly after, the meta cache for the evicted regions is repopulated as well as requests come in. But in the time window between clearing and repopulating the cache, I notice an increased number of timeouts in my application. Upon looking at the hbase client code, I noticed this code block {code:java} private void cleanServerCache(ServerName server, Throwable regionException) { if (ClientExceptionsUtil.isMetaClearingException(regionException)) { // We want to make sure to clear the cache in case there were location-related exceptions. // We don't to clear the cache for every possible exception that comes through, however. asyncProcess.connection.clearCaches(server); } } {code} {code:java} public static boolean isMetaClearingException(Throwable cur) { cur = findException(cur); if (cur == null) { return true; } return !isSpecialException(cur) || (cur instanceof RegionMovedException) || cur instanceof NotServingRegionException; } {code} {code:java} public static boolean isSpecialException(Throwable cur) { return (cur instanceof RegionMovedException || cur instanceof RegionOpeningException || cur instanceof RegionTooBusyException || cur instanceof RpcThrottlingException || cur instanceof MultiActionResultTooLarge || cur instanceof RetryImmediatelyException || cur instanceof CallQueueTooBigException || cur instanceof CallDroppedException || cur instanceof NotServingRegionException || cur instanceof RequestTooBigException); } {code} Since CallTimeoutException is not treated as a special exception, the cache for the server is cleared. This leads to missed cached for the row keys and timeouts in my application till the cache is repopulated I have couple of questions here: * Since intermittent network issues / timeouts are expected, why is the cache for the complete server cleared in this case? Is this a bug or a deliberate design choice ? * I can also see some other tickets regarding MetaCache issues: HBASE-28941 , HBASE-27531 , HBASE-27521 . * What can I do to fix this issue ? * Will upgrading the client in any way help me fix this ? Client upgrade would be relatively simpler for me rather than a complete Hbase version upgrade. I'm using Java 8. So would need some client compatible with Hbase 2.1.7 and Java 8. Similarly for single get calls (Not bulk), I see logs for region level meta clearing. The quantum is very less so that is not a cause for immediate concern. But I assume the similar kind of reasoning should hold there as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)