Tushar Ahuja created HBASE-29409:
------------------------------------
Summary: Server level meta cache clearing frequently on IO
exceptions
Key: HBASE-29409
URL: https://issues.apache.org/jira/browse/HBASE-29409
Project: HBase
Issue Type: Improvement
Affects Versions: 2.3.0
Reporter: Tushar Ahuja
Hbase client version: 2.3.0
Hbase version: 2.1.7
Java version: 8
Hbase client repo: [https://github.com/apache/hbase/tree/master/hbase-client]
Tag used: rel/2.3.0
>From my application, I'm making 2 types of hbase calls
1 Single get
2. Bulk gets
Coming to bulk gets first:
{code:java}
public Result[] get(List<Get> gets) throws IOException {code}
Now intermittently, I saw latency spikes in my metrics. On enabling the metrics
flag (hbase.client.metrics.enable) , I noticed a higher number of these metrics
{noformat}
MetricsConnection_metaCacheNumClearServer{noformat}
Upon enabling trace logs over the MetaCache class (
{noformat}
org/apache/hadoop/hbase/client/MetaCache.java{noformat}
) , I noticed a pattern:
In case of a CallTimeoutException while making bulk get calls to Hbase, the
region cache for the entire server is cleared
{code:java}
Caused by: java.lang.RuntimeException:
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1
action: CallTimeoutException: 1 time, servers with issues:
datanode2-az-prod-ci,16020,1747212636218
23-06-2025 12:11:53.123 [pool-7-thread-13] TRACE
o.a.hadoop.hbase.client.MetaCache - Removed all cached region locations that
map to datanode2-az-prod-ci,16020,1747212636218{code}
Shortly after, the meta cache for the evicted regions is repopulated as well as
requests come in. But in the time window between clearing and repopulating the
cache, I notice an increased number of timeouts in my application.
Upon looking at the hbase client code, I noticed this code block
{code:java}
private void cleanServerCache(ServerName server, Throwable regionException) {
if (ClientExceptionsUtil.isMetaClearingException(regionException)) {
// We want to make sure to clear the cache in case there were location-related
exceptions.
// We don't to clear the cache for every possible exception that comes through,
however.
asyncProcess.connection.clearCaches(server);
}
} {code}
{code:java}
public static boolean isMetaClearingException(Throwable cur) {
cur = findException(cur);
if (cur == null) {
return true;
}
return !isSpecialException(cur) || (cur instanceof RegionMovedException)
|| cur instanceof NotServingRegionException;
} {code}
{code:java}
public static boolean isSpecialException(Throwable cur) {
return (cur instanceof RegionMovedException || cur instanceof
RegionOpeningException
|| cur instanceof RegionTooBusyException || cur instanceof
RpcThrottlingException
|| cur instanceof MultiActionResultTooLarge || cur instanceof
RetryImmediatelyException
|| cur instanceof CallQueueTooBigException || cur instanceof
CallDroppedException
|| cur instanceof NotServingRegionException || cur instanceof
RequestTooBigException);
} {code}
Since CallTimeoutException is not treated as a special exception, the cache for
the server is cleared. This leads to missed cached for the row keys and
timeouts in my application till the cache is repopulated
I have couple of questions here:
* Since intermittent network issues / timeouts are expected, why is the cache
for the complete server cleared in this case? Is this a bug or a deliberate
design choice ?
* I can also see some other tickets regarding MetaCache issues: HBASE-28941 ,
HBASE-27531 , HBASE-27521 .
* What can I do to fix this issue ?
* Will upgrading the client in any way help me fix this ? Client upgrade would
be relatively simpler for me rather than a complete Hbase version upgrade. I'm
using Java 8. So would need some client compatible with Hbase 2.1.7 and Java 8.
Similarly for single get calls (Not bulk), I see logs for region level meta
clearing. The quantum is very less so that is not a cause for immediate
concern. But I assume the similar kind of reasoning should hold there as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)