Liang Geng created KYLIN-4307:
---------------------------------

             Summary: ExponentialBackoffRetry implemented a incorrect retry 
policy
                 Key: KYLIN-4307
                 URL: https://issues.apache.org/jira/browse/KYLIN-4307
             Project: Kylin
          Issue Type: Bug
          Components: Storage - HBase
    Affects Versions: v2.6.4
         Environment: HBase 1.1.5
            Reporter: Liang Geng
         Attachments: Screen Shot 2019-12-19 at 12.16.05.png

Today we had encountered a build error. The step "Build N-Dimension Cuboid : 
level 1" failed caused by WriteConflictException.

 

 
{code:java}
2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job 
4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update 
row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
1576696795196, to newTs: 1576696799048, operation result: true2019-12-19 
03:19:59,049 TRACE [Scheduler 1528161485 Job 
4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update 
row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
1576696795196, to newTs: 1576696799048, operation result: true2019-12-19 
03:20:02,775 INFO  [BadQueryDetector] service.BadQueryDetector:147 : Detect bad 
query.2019-12-19 03:20:03,254 TRACE [Scheduler 1528161485 Job 
d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update 
row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 
1576696793142, to newTs: 1576696803253, operation result: true2019-12-19 
03:20:03,256 TRACE [Scheduler 1528161485 Job 
d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update 
row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 
1576696803253, to newTs: 1576696803255, operation result: true2019-12-19 
03:20:04,187 INFO  [FetcherRunner 2062551616-143] 
threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2 actual 
running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10 discarded, 0 
others2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job 
4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update 
row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
1576696799048, to newTs: 1576696809122, operation result: true2019-12-19 
03:20:09,125 TRACE [Scheduler 1528161485 Job 
4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update 
row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
1576696809122, to newTs: 1576696809124, operation result: true2019-12-19 
03:20:13,263 TRACE [Scheduler 1528161485 Job 
d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update 
row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 
1576696803255, to newTs: 1576696813262, operation result: true2019-12-19 
03:20:24,133 INFO  [Scheduler 1528161485 Job 
4759f195-9443-2604-5358-d5804d9aaefc-879] 
persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 
seconds.2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job 
4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update 
row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
1576696809124, to newTs: 1576696819131, operation result: false2019-12-19 
03:20:25,284 TRACE [Scheduler 1528161485 Job 
d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update 
row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 
1576696813262, to newTs: 1576696825245, operation result: true2019-12-19 
03:20:25,299 ERROR [Scheduler 1528161485 Job 
4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : 
error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, 
name=Build N-Dimension Cuboid : level 1, 
state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException: 
Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, 
expect old TS 1576696809124, but it is 1576696819131 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
 at 
org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
 at 
org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
 at 
org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
 at 
org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
 at 
org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
 at 
org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
 at 
org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
 at 
org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
 at 
org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426) 
at 
org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
 at 
org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)
{code}
 

At 03:20:24,133, a thread "Scheduler 1528161485 Job 
4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the 
path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the 
exception was raised for some reasons. We don't know what kind of exception 
happened because the exception was eaten by doWithRetry.

ExponentialBackoffRetry will retry if the following exceptions happened: 
SocketTimeoutException, or ConnectException, or RetriesExhaustedException or 
any other exceptions configured in "kylin.resourcestore.connection-exceptions".

By default the retry policy is disabled, but we have enabled it by setting 
"kylin.resourcestore.reconnect-enabled". Because failed to call 
checkAndPutResourceImpl, the thread sleeps 1 second to back-off.

 
{code:java}
// A successful written, 1576696809122 -> 1576696809124// A successful written, 
1576696809122 -> 15766968091242019-12-19 03:20:09,125 TRACE [Scheduler 
1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] 
hbase.HBaseResourceStore:335 : Update row 
/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
1576696809122, to newTs: 1576696809124, operation result: true
// A written with exception but actually persistence to HBase, 1576696809124 -> 
1576696819131// ExponentialBackoffRetry.doWithRetry catched the exception, and 
then back-off2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job 
4759f195-9443-2604-5358-d5804d9aaefc-879] 
persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 seconds.
// Retry as back-off time reached. But this time, the invoking of checkAndPut 
failed because the content is already written.2019-12-19 03:20:25,228 TRACE 
[Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] 
hbase.HBaseResourceStore:335 : Update row 
/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
1576696809124, to newTs: 1576696819131, operation result: false
// checkAndPutResourceImpl throws WriteConflictException2019-12-19 03:20:25,299 
ERROR [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] 
common.MapReduceExecutable:198 : error execute 
MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build 
N-Dimension Cuboid : level 1, 
state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException: 
Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, 
expect old TS 1576696809124, but it is 1576696819131 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
 at 
org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
 at 
org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
 at 
org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
 at 
org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
 at 
org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
 at 
org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
 at 
org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
 at 
org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
 at 
org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426) 
at 
org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
 at 
org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)
{code}
 

As shown in the log, "expect old TS 1576696809124, but it is 1576696819131". 
The thread is trying to update with the same timestamp. That shows what? That 
shows the first written is successful but throws exceptions. And then, that 
exception was caught by doWithRetry. And a retry happened, cause the 
WriteConflictException.

I'm not a HBase expert, the picture above shows my rough guess: !Screen Shot 
2019-12-19 at 12.16.05.png!

Thus, we think ExponentialBackoffRetry implemented an incorrect retry policy. 
Determining whether to retry only by the type of exception is not reliable. At 
least, we should pull the last modified time to verify the written is 
successful or not. Right?
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to