Guangxu Cheng created KYLIN-4711:
------------------------------------

             Summary: Change default value to 3 for 
kylin.metadata.hbase-client-retries-number
                 Key: KYLIN-4711
                 URL: https://issues.apache.org/jira/browse/KYLIN-4711
             Project: Kylin
          Issue Type: Improvement
    Affects Versions: v3.1.0
            Reporter: Guangxu Cheng
            Assignee: Guangxu Cheng


```shell
 java.lang.RuntimeException: 
org.apache.kylin.job.exception.PersistentException: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=1, exceptions:
 Thu Aug 20 21:06:01 GMT+08:00 2020, RpcRetryingCaller

{globalStartTime=1597928761253, pause=1000, retries=1}

, org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Region 
kylin_production_metadata,/execute_output/3adc92f2-edcd-2705-5a9c-ad0afe4a0808-01,1594348337103.48b9e5e9c3c7891750236fcec84b38d5.
 is not online on xxx.xxx.xxx.xxx,16031,1558009276096
 at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3033)
 at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1110)
 at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2064)
 at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33857)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2189)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
 at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
 at java.lang.Thread.run(Thread.java:745)
 on xxx.xxx.xxx.xxx,16031,1558009276096
 at 
org.apache.kylin.job.execution.ExecutableManager.getOutput(ExecutableManager.java:174)
 at 
org.apache.kylin.job.execution.AbstractExecutable.getOutput(AbstractExecutable.java:450)
 at 
org.apache.kylin.job.execution.AbstractExecutable.isDiscarded(AbstractExecutable.java:561)
 at 
org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:165)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:191)
 at 
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:191)
 at 
org.apache.kylin.job.impl.threadpool.DistributedScheduler$JobRunner.run(DistributedScheduler.java:110)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 ```
 Recently, our build job failed occasionally. After analysis, it was found that 
the reason for the failure was due to abnormal access to the MetaStore. We use 
HBase as MetaStore. 
 When accessing HBase, the client will cache the region information of the 
table in the client. When the region was moved, client will not actively update 
the information in the cache. So the client will receive a 
NotServingRegionException, the client will update the cache information when 
retrying. But the number of retries in kylin is 1, which means that the clinet 
will not try again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to