Guangxu Cheng created KYLIN-4711:
------------------------------------
Summary: Change default value to 3 for
kylin.metadata.hbase-client-retries-number
Key: KYLIN-4711
URL: https://issues.apache.org/jira/browse/KYLIN-4711
Project: Kylin
Issue Type: Improvement
Affects Versions: v3.1.0
Reporter: Guangxu Cheng
Assignee: Guangxu Cheng
```shell
java.lang.RuntimeException:
org.apache.kylin.job.exception.PersistentException:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=1, exceptions:
Thu Aug 20 21:06:01 GMT+08:00 2020, RpcRetryingCaller
{globalStartTime=1597928761253, pause=1000, retries=1}
, org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Region
kylin_production_metadata,/execute_output/3adc92f2-edcd-2705-5a9c-ad0afe4a0808-01,1594348337103.48b9e5e9c3c7891750236fcec84b38d5.
is not online on xxx.xxx.xxx.xxx,16031,1558009276096
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3033)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1110)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2064)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33857)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2189)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745)
on xxx.xxx.xxx.xxx,16031,1558009276096
at
org.apache.kylin.job.execution.ExecutableManager.getOutput(ExecutableManager.java:174)
at
org.apache.kylin.job.execution.AbstractExecutable.getOutput(AbstractExecutable.java:450)
at
org.apache.kylin.job.execution.AbstractExecutable.isDiscarded(AbstractExecutable.java:561)
at
org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:165)
at
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:191)
at
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
at
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:191)
at
org.apache.kylin.job.impl.threadpool.DistributedScheduler$JobRunner.run(DistributedScheduler.java:110)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```
Recently, our build job failed occasionally. After analysis, it was found that
the reason for the failure was due to abnormal access to the MetaStore. We use
HBase as MetaStore.
When accessing HBase, the client will cache the region information of the
table in the client. When the region was moved, client will not actively update
the information in the cache. So the client will receive a
NotServingRegionException, the client will update the cache information when
retrying. But the number of retries in kylin is 1, which means that the clinet
will not try again.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)