[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yu-Jhe Li updated SPARK-24492: ------------------------------ Description: Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and output file to S3, some tasks endless retry and all of them failed with TaskCommitDenied exception. This happened when we run Spark application on some network issue instances. (it runs well on healthy spot instances) Sorry, I can find a easy way to reproduce this issue, here's all I can provide. The Spark UI shows one task of stage 112 failed due to FetchFailedException (it is network issue) and attempt to retry a new stage 112 (retry 1). But in stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and keep retry (it never succeed and cause lots of S3 requests). !retry_stage.png! !螢幕快照 2018-05-16 上午11.10.57.png! !螢幕快照 2018-05-16 上午11.10.46.png! The other side, driver logs shows: # task 123.0 in stage 112.0 failed due to FetchFailedException (network issue cause corrupted file) # warning message from OutputCommitCoordinator # task 92.0 in stage 112.1 failed when writing rows # keep retry the failed tasks, but never succeed {noformat} 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:137) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:340) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.util.Utils$.copyStream(Utils.scala:348) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:395) ... 31 more Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 11718 of input buffer at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39) at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:206) ... 39 more ) 2018-05-16 02:39:058 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=10; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:39:059 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=3; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:002 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=4; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:002 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=9; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:004 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=15; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=12; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=20; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=22; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=25; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:006 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=16; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:006 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=7; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:007 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=0; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:007 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=17; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=1; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=13; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=5; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=34; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=6; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=8; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:009 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=26; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:009 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=19; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:009 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=14; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 92.0 in stage 112.1 (TID 43815, 10.124.39.217, executor 29): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: No such file or directory: s3a://xxx/1526437970755/_temporary/0/_temporary/attempt_20180516023940_0112_m_000092_0 at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:993) at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:734) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:539) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:502) at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50) at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:76) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:153) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:260) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261) ... 8 more 2018-05-16 02:40:026 WARN JobProgressListener:66 - Task start for unknown stage 112 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 51.0 in stage 112.1 (TID 43774, 10.26.158.82, executor 8): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 51, attemptNumber: 0 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 125.0 in stage 112.1 (TID 43848, 10.124.42.170, executor 84): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 125, attemptNumber: 0 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 122.0 in stage 112.1 (TID 43845, 172.31.18.157, executor 134): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 122, attemptNumber: 0 2018-05-16 02:40:027 WARN TaskSetManager:66 - Lost task 118.0 in stage 112.1 (TID 43841, 10.95.1.104, executor 100): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 118, attemptNumber: 0 2018-05-16 02:40:027 WARN TaskSetManager:66 - Lost task 79.0 in stage 112.1 (TID 43802, 172.31.22.115, executor 94): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 79, attemptNumber: 0 2018-05-16 02:40:027 WARN TaskSetManager:66 - Lost task 58.0 in stage 112.1 (TID 43781, 10.26.158.82, executor 8): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 58, attemptNumber: 0 2018-05-16 02:41:008 WARN TaskSetManager:66 - Lost task 51.1 in stage 112.1 (TID 46941, 172.31.26.185, executor 57): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 51, attemptNumber: 1 2018-05-16 02:41:010 WARN TaskSetManager:66 - Lost task 92.1 in stage 112.1 (TID 46937, 10.92.157.108, executor 153): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 92, attemptNumber: 1 2018-05-16 02:41:010 WARN TaskSetManager:66 - Lost task 125.1 in stage 112.1 (TID 46951, 10.31.223.165, executor 149): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 125, attemptNumber: 1 2018-05-16 02:41:011 WARN TaskSetManager:66 - Lost task 79.1 in stage 112.1 (TID 46975, 10.26.158.82, executor 8): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 79, attemptNumber: 1 2018-05-16 02:41:013 WARN TaskSetManager:66 - Lost task 58.1 in stage 112.1 (TID 46976, 10.29.28.124, executor 118): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 58, attemptNumber: 1 {noformat} I think we should have a way to avoid the endless attempt. was: Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and output file to S3, some tasks endless retry and all of them failed with TaskCommitDenied exception. This happened when we run Spark application on some network issue instances. (it runs well on healthy spot instances) Sorry, I can find a easy way to reproduce this issue, here's all I can provide. The Spark UI shows one task of stage 112 failed due to FetchFailedException (it is network issue) and attempt to retry a new stage 112 (retry 1). But in stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and keep retry (it never succeed and cause lots of S3 requests). !retry_stage.png! The other side, driver logs shows: # task 123.0 in stage 112.0 failed due to FetchFailedException (network issue cause corrupted file) # warning message from OutputCommitCoordinator # task 92.0 in stage 112.1 failed when writing rows # keep retry the failed tasks, but never succeed {noformat} 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:137) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:340) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.util.Utils$.copyStream(Utils.scala:348) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:395) ... 31 more Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 11718 of input buffer at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39) at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:206) ... 39 more ) 2018-05-16 02:39:058 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=10; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:39:059 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=3; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:002 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=4; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:002 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=9; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:004 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=15; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=12; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=20; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=22; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=25; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:006 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=16; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:006 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=7; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:007 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=0; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:007 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=17; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=1; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=13; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=5; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=34; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=6; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=8; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:009 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=26; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:009 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=19; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:009 WARN OutputCommitCoordinator:66 - Authorizing duplicate request to commit for attemptNumber=0 to commit for stage=112, partition=14; existingCommitter = 0. This can indicate dropped network traffic. 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 92.0 in stage 112.1 (TID 43815, 10.124.39.217, executor 29): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: No such file or directory: s3a://xxx/1526437970755/_temporary/0/_temporary/attempt_20180516023940_0112_m_000092_0 at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:993) at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:734) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:539) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:502) at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50) at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:76) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:153) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:260) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261) ... 8 more 2018-05-16 02:40:026 WARN JobProgressListener:66 - Task start for unknown stage 112 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 51.0 in stage 112.1 (TID 43774, 10.26.158.82, executor 8): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 51, attemptNumber: 0 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 125.0 in stage 112.1 (TID 43848, 10.124.42.170, executor 84): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 125, attemptNumber: 0 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 122.0 in stage 112.1 (TID 43845, 172.31.18.157, executor 134): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 122, attemptNumber: 0 2018-05-16 02:40:027 WARN TaskSetManager:66 - Lost task 118.0 in stage 112.1 (TID 43841, 10.95.1.104, executor 100): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 118, attemptNumber: 0 2018-05-16 02:40:027 WARN TaskSetManager:66 - Lost task 79.0 in stage 112.1 (TID 43802, 172.31.22.115, executor 94): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 79, attemptNumber: 0 2018-05-16 02:40:027 WARN TaskSetManager:66 - Lost task 58.0 in stage 112.1 (TID 43781, 10.26.158.82, executor 8): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 58, attemptNumber: 0 2018-05-16 02:41:008 WARN TaskSetManager:66 - Lost task 51.1 in stage 112.1 (TID 46941, 172.31.26.185, executor 57): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 51, attemptNumber: 1 2018-05-16 02:41:010 WARN TaskSetManager:66 - Lost task 92.1 in stage 112.1 (TID 46937, 10.92.157.108, executor 153): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 92, attemptNumber: 1 2018-05-16 02:41:010 WARN TaskSetManager:66 - Lost task 125.1 in stage 112.1 (TID 46951, 10.31.223.165, executor 149): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 125, attemptNumber: 1 2018-05-16 02:41:011 WARN TaskSetManager:66 - Lost task 79.1 in stage 112.1 (TID 46975, 10.26.158.82, executor 8): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 79, attemptNumber: 1 2018-05-16 02:41:013 WARN TaskSetManager:66 - Lost task 58.1 in stage 112.1 (TID 46976, 10.29.28.124, executor 118): TaskCommitDenied (Driver denied task commit) for job: 112, partition: 58, attemptNumber: 1 {noformat} I think we should have a way to avoid the endless attempt. > Endless attempted task when TaskCommitDenied exception > ------------------------------------------------------ > > Key: SPARK-24492 > URL: https://issues.apache.org/jira/browse/SPARK-24492 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.2.0 > Environment: Spark version: spark-2.2.0 > Reporter: Yu-Jhe Li > Priority: Critical > Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 > 2018-05-16 上午11.10.57.png > > > Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and > output file to S3, some tasks endless retry and all of them failed with > TaskCommitDenied exception. This happened when we run Spark application on > some network issue instances. (it runs well on healthy spot instances) > Sorry, I can find a easy way to reproduce this issue, here's all I can > provide. > The Spark UI shows one task of stage 112 failed due to FetchFailedException > (it is network issue) and attempt to retry a new stage 112 (retry 1). But in > stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and > keep retry (it never succeed and cause lots of S3 requests). > !retry_stage.png! > !螢幕快照 2018-05-16 上午11.10.57.png! > !螢幕快照 2018-05-16 上午11.10.46.png! > The other side, driver logs shows: > # task 123.0 in stage 112.0 failed due to FetchFailedException (network > issue cause corrupted file) > # warning message from OutputCommitCoordinator > # task 92.0 in stage 112.1 failed when writing rows > # keep retry the failed tasks, but never succeed > {noformat} > 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 > (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, > 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= > org.apache.spark.shuffle.FetchFailedException: Stream is corrupted > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:80) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) > at > org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) > at > org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:137) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:340) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:348) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:395) > ... 31 more > Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 11718 of input > buffer > at > net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39) > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:206) > ... 39 more > ) > 2018-05-16 02:39:058 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=10; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:39:059 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=3; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:002 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=4; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:002 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=9; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:004 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=15; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=12; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=20; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=22; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:005 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=25; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:006 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=16; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:006 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=7; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:007 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=0; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:007 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=17; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=1; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=13; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=5; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=34; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=6; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:008 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=8; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:009 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=26; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:009 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=19; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:009 WARN OutputCommitCoordinator:66 - Authorizing duplicate > request to commit for attemptNumber=0 to commit for stage=112, partition=14; > existingCommitter = 0. This can indicate dropped network traffic. > 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 92.0 in stage 112.1 > (TID 43815, 10.124.39.217, executor 29): org.apache.spark.SparkException: > Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://xxx/1526437970755/_temporary/0/_temporary/attempt_20180516023940_0112_m_000092_0 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:993) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:734) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:539) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:502) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:76) > at > org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:153) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:260) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261) > ... 8 more > 2018-05-16 02:40:026 WARN JobProgressListener:66 - Task start for unknown > stage 112 > 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 51.0 in stage 112.1 > (TID 43774, 10.26.158.82, executor 8): TaskCommitDenied (Driver denied task > commit) for job: 112, partition: 51, attemptNumber: 0 > 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 125.0 in stage 112.1 > (TID 43848, 10.124.42.170, executor 84): TaskCommitDenied (Driver denied task > commit) for job: 112, partition: 125, attemptNumber: 0 > 2018-05-16 02:40:026 WARN TaskSetManager:66 - Lost task 122.0 in stage 112.1 > (TID 43845, 172.31.18.157, executor 134): TaskCommitDenied (Driver denied > task commit) for job: 112, partition: 122, attemptNumber: 0 > 2018-05-16 02:40:027 WARN TaskSetManager:66 - Lost task 118.0 in stage 112.1 > (TID 43841, 10.95.1.104, executor 100): TaskCommitDenied (Driver denied task > commit) for job: 112, partition: 118, attemptNumber: 0 > 2018-05-16 02:40:027 WARN TaskSetManager:66 - Lost task 79.0 in stage 112.1 > (TID 43802, 172.31.22.115, executor 94): TaskCommitDenied (Driver denied task > commit) for job: 112, partition: 79, attemptNumber: 0 > 2018-05-16 02:40:027 WARN TaskSetManager:66 - Lost task 58.0 in stage 112.1 > (TID 43781, 10.26.158.82, executor 8): TaskCommitDenied (Driver denied task > commit) for job: 112, partition: 58, attemptNumber: 0 > 2018-05-16 02:41:008 WARN TaskSetManager:66 - Lost task 51.1 in stage 112.1 > (TID 46941, 172.31.26.185, executor 57): TaskCommitDenied (Driver denied task > commit) for job: 112, partition: 51, attemptNumber: 1 > 2018-05-16 02:41:010 WARN TaskSetManager:66 - Lost task 92.1 in stage 112.1 > (TID 46937, 10.92.157.108, executor 153): TaskCommitDenied (Driver denied > task commit) for job: 112, partition: 92, attemptNumber: 1 > 2018-05-16 02:41:010 WARN TaskSetManager:66 - Lost task 125.1 in stage 112.1 > (TID 46951, 10.31.223.165, executor 149): TaskCommitDenied (Driver denied > task commit) for job: 112, partition: 125, attemptNumber: 1 > 2018-05-16 02:41:011 WARN TaskSetManager:66 - Lost task 79.1 in stage 112.1 > (TID 46975, 10.26.158.82, executor 8): TaskCommitDenied (Driver denied task > commit) for job: 112, partition: 79, attemptNumber: 1 > 2018-05-16 02:41:013 WARN TaskSetManager:66 - Lost task 58.1 in stage 112.1 > (TID 46976, 10.29.28.124, executor 118): TaskCommitDenied (Driver denied task > commit) for job: 112, partition: 58, attemptNumber: 1 > {noformat} > I think we should have a way to avoid the endless attempt. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org