[ https://issues.apache.org/jira/browse/HBASE-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224992#comment-14224992 ]
stack commented on HBASE-12534: ------------------------------- bq. .... It does make sense to have it then. Yeah. Configurable though as [~nkeywal] suggests > Wrong region location cache in client after regions are moved > ------------------------------------------------------------- > > Key: HBASE-12534 > URL: https://issues.apache.org/jira/browse/HBASE-12534 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.0 > Reporter: Liu Shaohui > Assignee: Liu Shaohui > Priority: Critical > Labels: client > Attachments: HBASE-12534-0.94-v1.diff, HBASE-12534-v1.diff > > > In our 0.94 hbase cluster, we found that client got wrong region location > cache and did not update it after a region is moved to another regionserver. > The reason is wrong client config and bug in RpcRetryingCaller of hbase > client. > The rpc configs are following: > {code} > hbase.rpc.timeout=1000 > hbase.client.pause=200 > hbase.client.operation.timeout=1200 > {code} > But the client retry number is 3 > {code} > hbase.client.retries.number=3 > {code} > Assumed that a region is at regionserver A before, and then it is moved to > regionserver B. The client try to make a call to regionserver A and get an > NotServingRegionException. For the rety number is not 1, the region server > location cache is not cleaned. See: RpcRetryingCaller.java#141 and > RegionServerCallable.java#127 > {code} > @Override > public void throwable(Throwable t, boolean retrying) { > if (t instanceof SocketTimeoutException || > .... > } else if (t instanceof NotServingRegionException && !retrying) { > // Purge cache entries for this specific region from hbase:meta cache > // since we don't call connect(true) when number of retries is 1. > getConnection().deleteCachedRegionLocation(location); > } > } > {code} > But the call did not retry and throw an SocketTimeoutException for the time > the call will take is larger than the operation timeout.See > RpcRetryingCaller.java#152 > {code} > expectedSleep = callable.sleep(pause, tries + 1); > // If, after the planned sleep, there won't be enough time left, we > stop now. > long duration = singleCallDuration(expectedSleep); > if (duration > callTimeout) { > String msg = "callTimeout=" + callTimeout + ", callDuration=" + > duration + > ": " + callable.getExceptionMessageAdditionalDetail(); > throw (SocketTimeoutException)(new > SocketTimeoutException(msg).initCause(t)); > } > {code} > At last, the wrong region location will never be not cleaned up . > [~lhofhansl] > In hbase 0.94, the MIN_RPC_TIMEOUT in singleCallDuration is 2000 in default, > which trigger this bug. > {code} > private long singleCallDuration(final long expectedSleep) { > return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime) > + MIN_RPC_TIMEOUT + expectedSleep; > } > {code} > But there is risk in master code too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)