[ 
https://issues.apache.org/jira/browse/KYLIN-4263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Geng updated KYLIN-4263:
------------------------------
    Description: 
    Recently, I notice there are several jobs stuck on running status at step 
3: "Extract Fact Table Distinct Columns".
 After deep dive the kylin.log, I found that the RetriesExhaustedException was 
raised when interactive with HBase.
 When the sub-task - MapReduceExecutable.doWork is invoked, the method 
"isDiscarded" (org/apache/kylin/engine/mr/common/MapReduceExecutable.java:151) 
will check the status of the task which will trigger RetriesExhaustedException. 
{code:java}
2019-11-15 03:03:00,827 ERROR [Scheduler 1760924857 Job 
758cd282-4b8a-8c07-2a37-d42186b74fba-174] dao.ExecutableDao:400 : error get job 
output id:758cd282-4b8a-8c07-2a37-d42186b74fba-02
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=1, exceptions:
Fri Nov 15 03:03:00 GMT+08:00 2019, 
RpcRetryingCaller{globalStartTime=1573758175808, pause=100, retries=1}, 
java.io.IOException: Call to HOSTNAME failed on local exception: 
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=143230, 
waitTime=5001, operationTimeout=5000 expired.
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:157)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:865)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:831)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.internalGetFromHTable(HBaseResourceStore.java:472)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.getFromHTable(HBaseResourceStore.java:450)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.getResourceImpl(HBaseResourceStore.java:273)
 at 
org.apache.kylin.common.persistence.ResourceStore.lambda$getResourceWithRetry$0(ResourceStore.java:289)
 at 
org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResourceWithRetry(ResourceStore.java:289)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:266)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:257)
 at 
org.apache.kylin.job.dao.ExecutableDao.readJobOutputResource(ExecutableDao.java:248)
 at org.apache.kylin.job.dao.ExecutableDao.getJobOutput(ExecutableDao.java:392)
 at 
org.apache.kylin.job.execution.ExecutableManager.getOutput(ExecutableManager.java:169)
 at 
org.apache.kylin.job.execution.AbstractExecutable.getOutput(AbstractExecutable.java:373)
 at 
org.apache.kylin.job.execution.AbstractExecutable.isDiscarded(AbstractExecutable.java:499)
 at 
org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:151)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
{code}
     Ideally, the exception will be caught by "catch (Exception e) 
(org/apache/kylin/engine/mr/common/MapReduceExecutable.java:197)". Indeed, the 
exception was caught, but the "logger.error("error execute " + this.toString(), 
e)org/apache/kylin/engine/mr/common/MapReduceExecutable.java:198" raises a new 
exception.
 Let's take a look inside the "toString" method. 
{code:java}
public String toString() {
   return Objects.toStringHelper(this).add("id", getId()).add("name", 
getName()).add("state", getStatus()).toString();
}{code}
    The method "getStatus" is invoked which will cause RuntimeException again 
(cause HBase timeout again). And then, the "RuntimeException" is caught by 
"catch (Exception e) 
org/apache/kylin/job/execution/AbstractExecutable.java:189" and raises a new 
exception - "ExecuteException". Let's OMIT one step AbstractExecutable.execute, 
it's too hard to describe cause the call stack is too deep. The 
"org.apache.kylin.job.execution.AbstractExecutable#execute" is invoked by 
"org.apache.kylin.job.execution.DefaultChainedExecutable#doWork", but there's 
no exception handling inside the "doWork" method.
{code:java}
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:157)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:865)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:831)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.internalGetFromHTable(HBaseResourceStore.java:472)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.getFromHTable(HBaseResourceStore.java:450)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.getResourceImpl(HBaseResourceStore.java:273)
 at 
org.apache.kylin.common.persistence.ResourceStore.lambda$getResourceWithRetry$0(ResourceStore.java:289)
 at 
org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResourceWithRetry(ResourceStore.java:289)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:266)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:257)
 at 
org.apache.kylin.job.dao.ExecutableDao.readJobOutputResource(ExecutableDao.java:248)
 at org.apache.kylin.job.dao.ExecutableDao.getJobOutput(ExecutableDao.java:392)
 at 
org.apache.kylin.job.execution.ExecutableManager.getOutput(ExecutableManager.java:169)
 at 
org.apache.kylin.job.execution.AbstractExecutable.getStatus(AbstractExecutable.java:267)
 at 
org.apache.kylin.job.execution.AbstractExecutable.toString(AbstractExecutable.java:527)
 at 
org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:198)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745){code}
    Finally, the "ExecuteException" is passed to 
"org.apache.kylin.job.impl.threadpool.DefaultScheduler.JobRunner#run". The 
"ExecuteException" is caught by "catch (ExecuteException e)
 " and prints out error log.
{code:java}
2019-11-15 03:03:04,667 ERROR [pool-9-thread-10] 
threadpool.DefaultScheduler:116 : ExecuteException 
job:52d3840d-02c7-f5cc-cb34-43ce3015d191
org.apache.kylin.job.exception.ExecuteException: java.lang.RuntimeException: 
org.apache.kylin.job.exception.PersistentException: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=1, exceptions:
Fri Nov 15 03:03:04 GMT+08:00 2019, 
RpcRetryingCaller{globalStartTime=1573758184665, pause=100, retries=1}, 
org.apache.hadoop.hbase.CallQueueTooBigException: Call queue is full on 
/0:0:0:0:0:0:0:0:16020, too many items queued ?
at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:189)
 at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
{code}
     And then the "MapReduceExecutable" sub-task is removed by 
"context.removeRunningJob(executable)". However the task is removed, BUT its 
status still RUNNING. In this situation, the status of the task is inconsistent.
{code:java}
2019-11-15 03:03:04,669 INFO [FetcherRunner 2113735343-72] 
threadpool.DefaultFetcherRunner:85 : Job Fetcher: 3 should running, 3 actual 
running, 0 stopped, 0 ready, 367 already succeed, 9 error, 5 discarded, 1 
others{code}
     I think the is a BUG, caused by inapproprate exception handling. For a 
quick fix, we should add a try-catch statement in the toString method. Or just 
kill this kind of tasks?

  was:
Recently, I notice there are several jobs stuck on running status at step 3: 
"Extract Fact Table Distinct Columns".
After deep dive the kylin.log, I found that the RetriesExhaustedException was 
raised when interactive with HBase.
When the sub-task - MapReduceExecutable.doWork is invoked, the method 
"isDiscarded" (org/apache/kylin/engine/mr/common/MapReduceExecutable.java:151) 
will check the status of the task which will trigger RetriesExhaustedException.

 
{code:java}
2019-11-15 03:03:00,827 ERROR [Scheduler 1760924857 Job 
758cd282-4b8a-8c07-2a37-d42186b74fba-174] dao.ExecutableDao:400 : error get job 
output id:758cd282-4b8a-8c07-2a37-d42186b74fba-02
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=1, exceptions:
Fri Nov 15 03:03:00 GMT+08:00 2019, 
RpcRetryingCaller{globalStartTime=1573758175808, pause=100, retries=1}, 
java.io.IOException: Call to HOSTNAME failed on local exception: 
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=143230, 
waitTime=5001, operationTimeout=5000 expired.
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:157)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:865)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:831)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.internalGetFromHTable(HBaseResourceStore.java:472)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.getFromHTable(HBaseResourceStore.java:450)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.getResourceImpl(HBaseResourceStore.java:273)
 at 
org.apache.kylin.common.persistence.ResourceStore.lambda$getResourceWithRetry$0(ResourceStore.java:289)
 at 
org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResourceWithRetry(ResourceStore.java:289)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:266)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:257)
 at 
org.apache.kylin.job.dao.ExecutableDao.readJobOutputResource(ExecutableDao.java:248)
 at org.apache.kylin.job.dao.ExecutableDao.getJobOutput(ExecutableDao.java:392)
 at 
org.apache.kylin.job.execution.ExecutableManager.getOutput(ExecutableManager.java:169)
 at 
org.apache.kylin.job.execution.AbstractExecutable.getOutput(AbstractExecutable.java:373)
 at 
org.apache.kylin.job.execution.AbstractExecutable.isDiscarded(AbstractExecutable.java:499)
 at 
org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:151)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
{code}
 


Ideally, the exception will be caught by "catch (Exception e) 
(org/apache/kylin/engine/mr/common/MapReduceExecutable.java:197)". Indeed, the 
exception was caught, but the "logger.error("error execute " + this.toString(), 
e)org/apache/kylin/engine/mr/common/MapReduceExecutable.java:198" raises a new 
exception.
Let's take a look inside the "toString" method.

 

 
{code:java}
public String toString() {
   return Objects.toStringHelper(this).add("id", getId()).add("name", 
getName()).add("state", getStatus()).toString();
}{code}
 

 

The method "getStatus" is invoked which will cause RuntimeException again 
(cause HBase timeout again). And then, the "RuntimeException" is caught by 
"catch (Exception e) 
org/apache/kylin/job/execution/AbstractExecutable.java:189" and raises a new 
exception - "ExecuteException". Let's OMIT one step AbstractExecutable.execute, 
it's too hard to describe cause the call stack is too deep. The 
"org.apache.kylin.job.execution.AbstractExecutable#execute" is invoked by 
"org.apache.kylin.job.execution.DefaultChainedExecutable#doWork", but there's 
no exception handling inside the "doWork" method.

 
{code:java}
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:157)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:865)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:831)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.internalGetFromHTable(HBaseResourceStore.java:472)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.getFromHTable(HBaseResourceStore.java:450)
 at 
org.apache.kylin.storage.hbase.HBaseResourceStore.getResourceImpl(HBaseResourceStore.java:273)
 at 
org.apache.kylin.common.persistence.ResourceStore.lambda$getResourceWithRetry$0(ResourceStore.java:289)
 at 
org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResourceWithRetry(ResourceStore.java:289)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:266)
 at 
org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:257)
 at 
org.apache.kylin.job.dao.ExecutableDao.readJobOutputResource(ExecutableDao.java:248)
 at org.apache.kylin.job.dao.ExecutableDao.getJobOutput(ExecutableDao.java:392)
 at 
org.apache.kylin.job.execution.ExecutableManager.getOutput(ExecutableManager.java:169)
 at 
org.apache.kylin.job.execution.AbstractExecutable.getStatus(AbstractExecutable.java:267)
 at 
org.apache.kylin.job.execution.AbstractExecutable.toString(AbstractExecutable.java:527)
 at 
org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:198)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
 at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
 at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745){code}
 

Finally, the "ExecuteException" is passed to 
"org.apache.kylin.job.impl.threadpool.DefaultScheduler.JobRunner#run". The 
"ExecuteException" is caught by "catch (ExecuteException e)
" and prints out error log.

 
{code:java}
2019-11-15 03:03:04,667 ERROR [pool-9-thread-10] 
threadpool.DefaultScheduler:116 : ExecuteException 
job:52d3840d-02c7-f5cc-cb34-43ce3015d191
org.apache.kylin.job.exception.ExecuteException: java.lang.RuntimeException: 
org.apache.kylin.job.exception.PersistentException: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=1, exceptions:
Fri Nov 15 03:03:04 GMT+08:00 2019, 
RpcRetryingCaller{globalStartTime=1573758184665, pause=100, retries=1}, 
org.apache.hadoop.hbase.CallQueueTooBigException: Call queue is full on 
/0:0:0:0:0:0:0:0:16020, too many items queued ?
at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:189)
 at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
{code}
 

And then the "MapReduceExecutable" sub-task is removed by 
"context.removeRunningJob(executable)". However the task is removed, BUT its 
status still RUNNING. In this situation, the status of the task is inconsistent.

 
{code:java}
2019-11-15 03:03:04,669 INFO [FetcherRunner 2113735343-72] 
threadpool.DefaultFetcherRunner:85 : Job Fetcher: 3 should running, 3 actual 
running, 0 stopped, 0 ready, 367 already succeed, 9 error, 5 discarded, 1 
others{code}
 

I think the is a BUG, caused by inapproprate exception handling. For a quick 
fix, we should add a try-catch statement in the toString method. Or just kill 
this kind of tasks?


> Inapproprate exception handling causes job stuck on running status
> ------------------------------------------------------------------
>
>                 Key: KYLIN-4263
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4263
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>    Affects Versions: v2.6.4
>         Environment: Kylin-2.6.4-release, HBase-1.1.5
>            Reporter: Liang Geng
>            Priority: Major
>
>     Recently, I notice there are several jobs stuck on running status at step 
> 3: "Extract Fact Table Distinct Columns".
>  After deep dive the kylin.log, I found that the RetriesExhaustedException 
> was raised when interactive with HBase.
>  When the sub-task - MapReduceExecutable.doWork is invoked, the method 
> "isDiscarded" 
> (org/apache/kylin/engine/mr/common/MapReduceExecutable.java:151) will check 
> the status of the task which will trigger RetriesExhaustedException. 
> {code:java}
> 2019-11-15 03:03:00,827 ERROR [Scheduler 1760924857 Job 
> 758cd282-4b8a-8c07-2a37-d42186b74fba-174] dao.ExecutableDao:400 : error get 
> job output id:758cd282-4b8a-8c07-2a37-d42186b74fba-02
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=1, exceptions:
> Fri Nov 15 03:03:00 GMT+08:00 2019, 
> RpcRetryingCaller{globalStartTime=1573758175808, pause=100, retries=1}, 
> java.io.IOException: Call to HOSTNAME failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=143230, 
> waitTime=5001, operationTimeout=5000 expired.
> at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:157)
>  at org.apache.hadoop.hbase.client.HTable.get(HTable.java:865)
>  at org.apache.hadoop.hbase.client.HTable.get(HTable.java:831)
>  at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.internalGetFromHTable(HBaseResourceStore.java:472)
>  at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.getFromHTable(HBaseResourceStore.java:450)
>  at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.getResourceImpl(HBaseResourceStore.java:273)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.lambda$getResourceWithRetry$0(ResourceStore.java:289)
>  at 
> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.getResourceWithRetry(ResourceStore.java:289)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:266)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:257)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.readJobOutputResource(ExecutableDao.java:248)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.getJobOutput(ExecutableDao.java:392)
>  at 
> org.apache.kylin.job.execution.ExecutableManager.getOutput(ExecutableManager.java:169)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.getOutput(AbstractExecutable.java:373)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.isDiscarded(AbstractExecutable.java:499)
>  at 
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:151)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {code}
>      Ideally, the exception will be caught by "catch (Exception e) 
> (org/apache/kylin/engine/mr/common/MapReduceExecutable.java:197)". Indeed, 
> the exception was caught, but the "logger.error("error execute " + 
> this.toString(), 
> e)org/apache/kylin/engine/mr/common/MapReduceExecutable.java:198" raises a 
> new exception.
>  Let's take a look inside the "toString" method. 
> {code:java}
> public String toString() {
>    return Objects.toStringHelper(this).add("id", getId()).add("name", 
> getName()).add("state", getStatus()).toString();
> }{code}
>     The method "getStatus" is invoked which will cause RuntimeException again 
> (cause HBase timeout again). And then, the "RuntimeException" is caught by 
> "catch (Exception e) 
> org/apache/kylin/job/execution/AbstractExecutable.java:189" and raises a new 
> exception - "ExecuteException". Let's OMIT one step 
> AbstractExecutable.execute, it's too hard to describe cause the call stack is 
> too deep. The "org.apache.kylin.job.execution.AbstractExecutable#execute" is 
> invoked by "org.apache.kylin.job.execution.DefaultChainedExecutable#doWork", 
> but there's no exception handling inside the "doWork" method.
> {code:java}
> at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:157)
>  at org.apache.hadoop.hbase.client.HTable.get(HTable.java:865)
>  at org.apache.hadoop.hbase.client.HTable.get(HTable.java:831)
>  at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.internalGetFromHTable(HBaseResourceStore.java:472)
>  at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.getFromHTable(HBaseResourceStore.java:450)
>  at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.getResourceImpl(HBaseResourceStore.java:273)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.lambda$getResourceWithRetry$0(ResourceStore.java:289)
>  at 
> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.getResourceWithRetry(ResourceStore.java:289)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:266)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:257)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.readJobOutputResource(ExecutableDao.java:248)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.getJobOutput(ExecutableDao.java:392)
>  at 
> org.apache.kylin.job.execution.ExecutableManager.getOutput(ExecutableManager.java:169)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.getStatus(AbstractExecutable.java:267)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.toString(AbstractExecutable.java:527)
>  at 
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:198)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745){code}
>     Finally, the "ExecuteException" is passed to 
> "org.apache.kylin.job.impl.threadpool.DefaultScheduler.JobRunner#run". The 
> "ExecuteException" is caught by "catch (ExecuteException e)
>  " and prints out error log.
> {code:java}
> 2019-11-15 03:03:04,667 ERROR [pool-9-thread-10] 
> threadpool.DefaultScheduler:116 : ExecuteException 
> job:52d3840d-02c7-f5cc-cb34-43ce3015d191
> org.apache.kylin.job.exception.ExecuteException: java.lang.RuntimeException: 
> org.apache.kylin.job.exception.PersistentException: 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=1, exceptions:
> Fri Nov 15 03:03:04 GMT+08:00 2019, 
> RpcRetryingCaller{globalStartTime=1573758184665, pause=100, retries=1}, 
> org.apache.hadoop.hbase.CallQueueTooBigException: Call queue is full on 
> /0:0:0:0:0:0:0:0:16020, too many items queued ?
> at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:189)
>  at 
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {code}
>      And then the "MapReduceExecutable" sub-task is removed by 
> "context.removeRunningJob(executable)". However the task is removed, BUT its 
> status still RUNNING. In this situation, the status of the task is 
> inconsistent.
> {code:java}
> 2019-11-15 03:03:04,669 INFO [FetcherRunner 2113735343-72] 
> threadpool.DefaultFetcherRunner:85 : Job Fetcher: 3 should running, 3 actual 
> running, 0 stopped, 0 ready, 367 already succeed, 9 error, 5 discarded, 1 
> others{code}
>      I think the is a BUG, caused by inapproprate exception handling. For a 
> quick fix, we should add a try-catch statement in the toString method. Or 
> just kill this kind of tasks?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to