[jira] [Comment Edited] (HBASE-18005) read replica: handle the case that region server hosting both primary replica and meta region is down

Lei Chen (JIRA) Thu, 11 May 2017 08:19:42 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-18005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006621#comment-16006621
 ]


Lei Chen edited comment on HBASE-18005 at 5/11/17 3:18 PM:
-----------------------------------------------------------

Thanks for the explanation and update.
Yes, there is a gap between the primary meta region and its replica, defined by 
hbase.regionserver.meta.storefile.refresh.period, plus there is no notification 
mechanism at present, setting the hbase.meta.replica.count to 2 or 3 is indeed 
not a complete solution but an improve.
Meanwhile, there is also a gap between the primary meta region and the one 
cached on the client side.
The difference between the two gaps is how the gap is closed. The first one 
refreshes with a fixed interval while the second one updates when a miss is 
encountered.
Please correct me if I'm wrong, the worst case I can imagine is 
1. The locations of a primary region p1 and its replica r1 have changed. 
2. The primary meta updates but its replica is not, due to the fixed interval
3. The region server that serves both primaries  goes down
4. A client has not updated its meta cache after p1 and r1 was relocated, and 
now makes a get request to p1

That being said, I agree with you that the cached location of the replicas is 
still worth trying, and should be pardoned from clearing the meta cache, as you 
have proposed in the patch.


was (Author: leochen4891):
Thanks for the explanation and update.
Yes, there is a gap between the primary meta region and its replica, defined by 
hbase.regionserver.meta.storefile.refresh.period, plus there is no notification 
mechanism at present, setting the hbase.meta.replica.count to 2 or 3 is indeed 
not a complete solution but an improve.
Meanwhile, there is also a gap between the primary meta region and the one 
cached on the client side.
The difference between the two gaps is how the gap is closed. The first one 
refreshes with a fixed interval while the second one updates see a miss.
Please correct me if I'm wrong, the worst case I can imagine is 
1. The locations of a primary region p1 and its replica r1 have changed. 
2. The primary meta updates but its replica is not, due to the fixed interval
3. The region server that serves both primaries  goes down
4. A client has not updated its meta cache after p1 and r1 was relocated, and 
now makes a get request to p1

That being said, I agree with you that the cached location of the replicas is 
still worth trying, and should be pardoned from clearing the meta cache, as you 
have proposed in the patch.

> read replica: handle the case that region server hosting both primary replica 
> and meta region is down
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-18005
>                 URL: https://issues.apache.org/jira/browse/HBASE-18005
>             Project: HBase
>          Issue Type: Bug
>            Reporter: huaxiang sun
>            Assignee: huaxiang sun
>         Attachments: HBASE-18005-master-001.patch
>
>
> Identified one corner case in testing  that when the region server hosting 
> both primary replica and the meta region is down, the client tries to reload 
> the primary replica location from meta table, it is supposed to clean up only 
> the cached location for specific replicaId, but it clears caches for all 
> replicas. Please see
> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java#L813
> Since it takes some time for regions to be reassigned (including meta 
> region), the following may throw exception
> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerWithReadReplicas.java#L173
> This exception needs to be caught and  it needs to get cached location (in 
> this case, the primary replica's location is not available). If there are 
> cached locations for other replicas, it can still go ahead to get stale 
> values from secondary replicas.
> With meta replica, it still helps to not clean up the caches for all replicas 
> as the info from primary meta replica is up-to-date.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (HBASE-18005) read replica: handle the case that region server hosting both primary replica and meta region is down

Reply via email to