Waleed Fateem created KUDU-3099:
-----------------------------------

             Summary: KuduBackup/KuduRestore System.exit(0) results in Spark on 
YARN failure with exitCode: 16
                 Key: KUDU-3099
                 URL: https://issues.apache.org/jira/browse/KUDU-3099
             Project: Kudu
          Issue Type: Bug
          Components: backup, spark
    Affects Versions: 1.11.0, 1.10.0
            Reporter: Waleed Fateem


When running KuduBackup/KuduRestore the underlying Spark application can fail 
when running on YARN even when the backup/restore tasks complete successfully. 
The following was from the Spark driver log:


{code:java}
INFO spark.SparkContext: Submitted application: Kudu Table Backup
..
INFO spark.SparkContext: Starting job: save at KuduBackup.scala:90
INFO scheduler.DAGScheduler: Got job 0 (save at KuduBackup.scala:90) with 200 
output partitions
scheduler.DAGScheduler: Final stage: ResultStage 0 (save at KuduBackup.scala:90)
..
INFO scheduler.DAGScheduler: Submitting 200 missing tasks from ResultStage 0 
(MapPartitionsRDD[2] at save at KuduBackup.scala:90) (first 15 tasks are for 
partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
INFO cluster.YarnClusterScheduler: Adding task set 0.0 with 200 tasks
..

INFO cluster.YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all 
completed, from pool 
INFO scheduler.DAGScheduler: Job 0 finished: save at KuduBackup.scala:90, took 
20.007488 s
..
INFO spark.SparkContext: Invoking stop() from shutdown hook
..
INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
..
INFO spark.SparkContext: Successfully stopped SparkContext
INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: 
Shutdown hook called before final status was reported.)
INFO util.ShutdownHookManager: Shutdown hook called{code}
Spark explicitly added this shutdown hook to catch System.exit() calls and in 
case this occurs before the SparkContext stops then the application status is 
considered a failure:
[https://github.com/apache/spark/blob/branch-2.3/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L299]

The System.exit() call added as part of KUDU-2787 can cause this race condition 
and that was merged in the 1.10.x and 1.11.x branches. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to