Tushar Ahuja created HBASE-29409:
------------------------------------

             Summary: Server level meta cache clearing frequently on IO 
exceptions
                 Key: HBASE-29409
                 URL: https://issues.apache.org/jira/browse/HBASE-29409
             Project: HBase
          Issue Type: Improvement
    Affects Versions: 2.3.0
            Reporter: Tushar Ahuja


Hbase client version: 2.3.0

Hbase version: 2.1.7

Java version: 8

Hbase client repo: [https://github.com/apache/hbase/tree/master/hbase-client]

Tag used: rel/2.3.0

 

>From my application, I'm making 2 types of hbase calls

1  Single get

2. Bulk gets

 

Coming to bulk gets first:

 
{code:java}
public Result[] get(List<Get> gets) throws IOException {code}
 

 

Now intermittently, I saw latency spikes in my metrics. On enabling the metrics 
flag (hbase.client.metrics.enable) , I noticed a higher number of these metrics
{noformat}
MetricsConnection_metaCacheNumClearServer{noformat}
 
Upon enabling trace logs over the MetaCache class (
{noformat}
org/apache/hadoop/hbase/client/MetaCache.java{noformat}
) , I noticed a pattern:

 

 

In case of a CallTimeoutException while making bulk get calls to Hbase, the 
region cache for the entire server is cleared

 
{code:java}
Caused by: java.lang.RuntimeException: 
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 
action: CallTimeoutException: 1 time, servers with issues: 
datanode2-az-prod-ci,16020,1747212636218

23-06-2025 12:11:53.123   [pool-7-thread-13] TRACE 
o.a.hadoop.hbase.client.MetaCache - Removed all cached region locations that 
map to datanode2-az-prod-ci,16020,1747212636218{code}
 

Shortly after, the meta cache for the evicted regions is repopulated as well as 
requests come in. But in the time window between clearing and repopulating the 
cache, I notice an increased number of timeouts in my application.

 

Upon looking at the hbase client code, I noticed this code block
 
{code:java}
private void cleanServerCache(ServerName server, Throwable regionException) {
if (ClientExceptionsUtil.isMetaClearingException(regionException)) {
// We want to make sure to clear the cache in case there were location-related 
exceptions.
// We don't to clear the cache for every possible exception that comes through, 
however.
asyncProcess.connection.clearCaches(server);
}
}      {code}
{code:java}
public static boolean isMetaClearingException(Throwable cur) {
cur = findException(cur);

if (cur == null) {
return true;
}
return !isSpecialException(cur) || (cur instanceof RegionMovedException)
|| cur instanceof NotServingRegionException;
} {code}
{code:java}
public static boolean isSpecialException(Throwable cur) {
return (cur instanceof RegionMovedException || cur instanceof 
RegionOpeningException
|| cur instanceof RegionTooBusyException || cur instanceof 
RpcThrottlingException
|| cur instanceof MultiActionResultTooLarge || cur instanceof 
RetryImmediatelyException
|| cur instanceof CallQueueTooBigException || cur instanceof 
CallDroppedException
|| cur instanceof NotServingRegionException || cur instanceof 
RequestTooBigException);
} {code}
 

Since CallTimeoutException is not treated as a special exception, the cache for 
the server is cleared. This leads to missed cached for the row keys and 
timeouts in my application till the cache is repopulated

 

I have couple of questions here:
 * Since intermittent network issues / timeouts are expected, why is the cache 
for the complete server cleared in this case? Is this a bug or a deliberate 
design choice ?
 * I can also see some other tickets regarding MetaCache issues: HBASE-28941 , 
HBASE-27531 , HBASE-27521 . 
 * What can I do to fix this issue ?
 * Will upgrading the client in any way help me fix this ? Client upgrade would 
be relatively simpler for me rather than a complete Hbase version upgrade. I'm 
using Java 8. So would need some client compatible with Hbase 2.1.7 and Java 8.

 

Similarly for single get calls (Not bulk), I see logs for region level meta 
clearing. The quantum is very less so that is not a cause for immediate 
concern. But I assume the similar kind of reasoning should hold there as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to