[jira] [Commented] (HBASE-29409) Server level meta cache clearing frequently on IO exceptions

Sergey Soldatov (Jira) Tue, 24 Jun 2025 00:31:49 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17985756#comment-17985756
 ]


Sergey Soldatov commented on HBASE-29409:
-----------------------------------------

I would say that HBASE-27531 resolved the issue. It restored the logic that 
HBASE-21775 broke.

> Server level meta cache clearing frequently on IO exceptions
> ------------------------------------------------------------
>
>                 Key: HBASE-29409
>                 URL: https://issues.apache.org/jira/browse/HBASE-29409
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.3.0
>            Reporter: Tushar Ahuja
>            Priority: Major
>
> {code:java}
> Hbase client version: 2.3.0
> Hbase version: 2.1.7
> Java version: 8
> Hbase client repo: https://github.com/apache/hbase/tree/master/hbase-client
> Tag used: rel/2.3.0
> {code}
> From my application, I'm making 2 types of hbase calls
> 1  Single get
> 2. Bulk gets
>  
> Coming to bulk gets first:
> {code:java}
> public Result[] get(List<Get> gets) throws IOException {code}
> Now intermittently, I saw latency spikes in my metrics. On enabling the 
> metrics flag (hbase.client.metrics.enable) , I noticed a higher number of 
> these metrics
> {noformat}
> MetricsConnection_metaCacheNumClearServer{noformat}
>  
> Upon enabling trace logs over the MetaCache class (
> {noformat}
> org/apache/hadoop/hbase/client/MetaCache.java{noformat}
> ) , I noticed a pattern:
> In case of a CallTimeoutException while making bulk get calls to Hbase, the 
> region cache for the entire server is cleared
> {code:java}
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 
> action: CallTimeoutException: 1 time, servers with issues: 
> datanode2-az-prod-ci,16020,1747212636218
> 23-06-2025 12:11:53.123   [pool-7-thread-13] TRACE 
> o.a.hadoop.hbase.client.MetaCache - Removed all cached region locations that 
> map to datanode2-az-prod-ci,16020,1747212636218{code}
>  
> Shortly after, the meta cache for the evicted regions is repopulated as well 
> as requests come in. But in the time window between clearing and repopulating 
> the cache, I notice an increased number of timeouts in my application.
> Upon looking at the hbase client code, I noticed this code block
> {code:java}
> private void cleanServerCache(ServerName server, Throwable regionException) {
> if (ClientExceptionsUtil.isMetaClearingException(regionException)) {
> // We want to make sure to clear the cache in case there were 
> location-related exceptions.
> // We don't to clear the cache for every possible exception that comes 
> through, however.
> asyncProcess.connection.clearCaches(server);
> }
> }      {code}
> {code:java}
> public static boolean isMetaClearingException(Throwable cur) {
> cur = findException(cur);
> if (cur == null) {
> return true;
> }
> return !isSpecialException(cur) || (cur instanceof RegionMovedException)
> || cur instanceof NotServingRegionException;
> } {code}
> {code:java}
> public static boolean isSpecialException(Throwable cur) {
> return (cur instanceof RegionMovedException || cur instanceof 
> RegionOpeningException
> || cur instanceof RegionTooBusyException || cur instanceof 
> RpcThrottlingException
> || cur instanceof MultiActionResultTooLarge || cur instanceof 
> RetryImmediatelyException
> || cur instanceof CallQueueTooBigException || cur instanceof 
> CallDroppedException
> || cur instanceof NotServingRegionException || cur instanceof 
> RequestTooBigException);
> } {code}
>  
> Since CallTimeoutException is not treated as a special exception, the cache 
> for the server is cleared. This leads to missed cached for the row keys and 
> timeouts in my application till the cache is repopulated
> I have couple of questions here:
>  * Since intermittent network issues / timeouts are expected, why is the 
> cache for the complete server cleared in this case? Is this a bug or a 
> deliberate design choice ?
>  * I can also see some other tickets regarding MetaCache issues: HBASE-28941 
> , HBASE-27531 , HBASE-27521 . 
>  * Since my client version is relatively older, is this handled in the recent 
> clients ? 
>  
> Similarly for single get calls (Not bulk), I see logs for region level meta 
> clearing. The quantum is very less so that is not a cause for immediate 
> concern. But I assume the similar kind of reasoning should hold there as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-29409) Server level meta cache clearing frequently on IO exceptions

Reply via email to