ZHANGHONGJIA created KYLIN-5008:
-----------------------------------

             Summary: backend spark was failed, but corresponding job status is 
shown as finished in WebUI 
                 Key: KYLIN-5008
                 URL: https://issues.apache.org/jira/browse/KYLIN-5008
             Project: Kylin
          Issue Type: Bug
    Affects Versions: v4.0.0-beta
            Reporter: ZHANGHONGJIA
         Attachments: image-2021-06-10-16-46-35-919.png, merge-job.log

According to the log shown as below, the spark project was failed due to 
Container killed by YARN for exceeding memory limits , but in Kylin WebUI ,the 
status of the mergeJob is finished.  Besides, the amount of data in the segment 
after merged is as three times as the amount of actual data . It seems that 
kylin didn't monitor the failure of this merge job.

 

Here is the merge job log :

===============================================================
 at 
org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:43)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 ... 3 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 244 in stage 1108.0 failed 4 times, most recent failure: Lost task 244.3 
in stage 1108.0 (TID 78736, r4200h1-app.travelsky.com, executor 109): 
ExecutorLostFailure (executor 109 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 39.0 GB of 36 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or 
disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
 at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
 at scala.Option.foreach(Option.scala:257)
 at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
 at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
 ... 34 more

}
RetryInfo{
 overrideConf : \{spark.executor.memory=36618MB, 
spark.executor.memoryOverhead=7323MB},
 throwable : java.lang.RuntimeException: Error execute 
org.apache.kylin.engine.spark.job.CubeMergeJob
 at 
org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:92)
 at org.apache.spark.application.JobWorker$$anon$2.run(JobWorker.scala:55)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.apache.spark.SparkException: Job 
aborted.
 at 
org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate.updateLayout(BuildLayoutWithUpdate.java:70)
 at 
org.apache.kylin.engine.spark.job.CubeMergeJob.mergeSegments(CubeMergeJob.java:122)
 at 
org.apache.kylin.engine.spark.job.CubeMergeJob.doExecute(CubeMergeJob.java:82)
 at 
org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:298)
 at 
org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:89)
 ... 4 more
Caused by: org.apache.spark.SparkException: Job aborted.
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
 at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
 at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
 at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
 at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
 at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677)
 at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286)
 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272)
 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)
 at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:567)
 at 
org.apache.kylin.engine.spark.storage.ParquetStorage.saveTo(ParquetStorage.scala:28)
 at 
org.apache.kylin.engine.spark.job.CubeMergeJob.saveAndUpdateCuboid(CubeMergeJob.java:171)
 at 
org.apache.kylin.engine.spark.job.CubeMergeJob.access$000(CubeMergeJob.java:59)
 at 
org.apache.kylin.engine.spark.job.CubeMergeJob$1.build(CubeMergeJob.java:118)
 at 
org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:51)
 at 
org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:43)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 ... 3 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 428 in stage 360.0 failed 4 times, most recent failure: Lost task 428.3 in 
stage 360.0 (TID 26130, umetrip40-hdp2.6-140.travelsky.com, executor 1): 
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 48.4 GB of 46 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or 
disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
 at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
 at scala.Option.foreach(Option.scala:257)
 at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
 at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
 ... 34 more

}

 

The WebUI monitor:

!image-2021-06-10-16-46-35-919.png!

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to