[
https://issues.apache.org/jira/browse/HBASE-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17985756#comment-17985756
]
Sergey Soldatov commented on HBASE-29409:
-----------------------------------------
I would say that HBASE-27531 resolved the issue. It restored the logic that
HBASE-21775 broke.
> Server level meta cache clearing frequently on IO exceptions
> ------------------------------------------------------------
>
> Key: HBASE-29409
> URL: https://issues.apache.org/jira/browse/HBASE-29409
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 2.3.0
> Reporter: Tushar Ahuja
> Priority: Major
>
> {code:java}
> Hbase client version: 2.3.0
> Hbase version: 2.1.7
> Java version: 8
> Hbase client repo: https://github.com/apache/hbase/tree/master/hbase-client
> Tag used: rel/2.3.0
> {code}
> From my application, I'm making 2 types of hbase calls
> 1 Single get
> 2. Bulk gets
>
> Coming to bulk gets first:
> {code:java}
> public Result[] get(List<Get> gets) throws IOException {code}
> Now intermittently, I saw latency spikes in my metrics. On enabling the
> metrics flag (hbase.client.metrics.enable) , I noticed a higher number of
> these metrics
> {noformat}
> MetricsConnection_metaCacheNumClearServer{noformat}
>
> Upon enabling trace logs over the MetaCache class (
> {noformat}
> org/apache/hadoop/hbase/client/MetaCache.java{noformat}
> ) , I noticed a pattern:
> In case of a CallTimeoutException while making bulk get calls to Hbase, the
> region cache for the entire server is cleared
> {code:java}
> Caused by: java.lang.RuntimeException:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1
> action: CallTimeoutException: 1 time, servers with issues:
> datanode2-az-prod-ci,16020,1747212636218
> 23-06-2025 12:11:53.123 [pool-7-thread-13] TRACE
> o.a.hadoop.hbase.client.MetaCache - Removed all cached region locations that
> map to datanode2-az-prod-ci,16020,1747212636218{code}
>
> Shortly after, the meta cache for the evicted regions is repopulated as well
> as requests come in. But in the time window between clearing and repopulating
> the cache, I notice an increased number of timeouts in my application.
> Upon looking at the hbase client code, I noticed this code block
> {code:java}
> private void cleanServerCache(ServerName server, Throwable regionException) {
> if (ClientExceptionsUtil.isMetaClearingException(regionException)) {
> // We want to make sure to clear the cache in case there were
> location-related exceptions.
> // We don't to clear the cache for every possible exception that comes
> through, however.
> asyncProcess.connection.clearCaches(server);
> }
> } {code}
> {code:java}
> public static boolean isMetaClearingException(Throwable cur) {
> cur = findException(cur);
> if (cur == null) {
> return true;
> }
> return !isSpecialException(cur) || (cur instanceof RegionMovedException)
> || cur instanceof NotServingRegionException;
> } {code}
> {code:java}
> public static boolean isSpecialException(Throwable cur) {
> return (cur instanceof RegionMovedException || cur instanceof
> RegionOpeningException
> || cur instanceof RegionTooBusyException || cur instanceof
> RpcThrottlingException
> || cur instanceof MultiActionResultTooLarge || cur instanceof
> RetryImmediatelyException
> || cur instanceof CallQueueTooBigException || cur instanceof
> CallDroppedException
> || cur instanceof NotServingRegionException || cur instanceof
> RequestTooBigException);
> } {code}
>
> Since CallTimeoutException is not treated as a special exception, the cache
> for the server is cleared. This leads to missed cached for the row keys and
> timeouts in my application till the cache is repopulated
> I have couple of questions here:
> * Since intermittent network issues / timeouts are expected, why is the
> cache for the complete server cleared in this case? Is this a bug or a
> deliberate design choice ?
> * I can also see some other tickets regarding MetaCache issues: HBASE-28941
> , HBASE-27531 , HBASE-27521 .
> * Since my client version is relatively older, is this handled in the recent
> clients ?
>
> Similarly for single get calls (Not bulk), I see logs for region level meta
> clearing. The quantum is very less so that is not a cause for immediate
> concern. But I assume the similar kind of reasoning should hold there as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)