[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-24492: --- Summary: Endless attempted task when TaskCommitDenied exception writing to S3A (was: Endless attempted task when TaskCommitDenied exception) > Endless attempted task when TaskCommitDenied exception writing to S3A > - > > Key: SPARK-24492 > URL: https://issues.apache.org/jira/browse/SPARK-24492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yu-Jhe Li >Priority: Critical > Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 > 2018-05-16 上午11.10.57.png > > > Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and > output file to S3, some tasks endless retry and all of them failed with > TaskCommitDenied exception. This happened when we run Spark application on > some network issue instances. (it runs well on healthy spot instances) > Sorry, I can find a easy way to reproduce this issue, here's all I can > provide. > The Spark UI shows (in attachments) one task of stage 112 failed due to > FetchFailedException (it is network issue) and attempt to retry a new stage > 112 (retry 1). But in stage 112 (retry 1), all task failed due to > TaskCommitDenied exception, and keep retry (it never succeed and cause lots > of S3 requests). > On the other side, driver logs shows: > # task 123.0 in stage 112.0 failed due to FetchFailedException (network > issue cause corrupted file) > # warning message from OutputCommitCoordinator > # task 92.0 in stage 112.1 failed when writing rows > # keep retry the failed tasks, but never succeed > {noformat} > 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 > (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, > 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= > org.apache.spark.shuffle.FetchFailedException: Stream is corrupted > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executo
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li updated SPARK-24492: -- Description: Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and output file to S3, some tasks endless retry and all of them failed with TaskCommitDenied exception. This happened when we run Spark application on some network issue instances. (it runs well on healthy spot instances) Sorry, I can find a easy way to reproduce this issue, here's all I can provide. The Spark UI shows (in attachments) one task of stage 112 failed due to FetchFailedException (it is network issue) and attempt to retry a new stage 112 (retry 1). But in stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and keep retry (it never succeed and cause lots of S3 requests). On the other side, driver logs shows: # task 123.0 in stage 112.0 failed due to FetchFailedException (network issue cause corrupted file) # warning message from OutputCommitCoordinator # task 92.0 in stage 112.1 failed when writing rows # keep retry the failed tasks, but never succeed {noformat} 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:137) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:340) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327)
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li updated SPARK-24492: -- Environment: (was: Spark version: spark-2.2.0) > Endless attempted task when TaskCommitDenied exception > -- > > Key: SPARK-24492 > URL: https://issues.apache.org/jira/browse/SPARK-24492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yu-Jhe Li >Priority: Critical > Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 > 2018-05-16 上午11.10.57.png > > > Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and > output file to S3, some tasks endless retry and all of them failed with > TaskCommitDenied exception. This happened when we run Spark application on > some network issue instances. (it runs well on healthy spot instances) > Sorry, I can find a easy way to reproduce this issue, here's all I can > provide. > The Spark UI shows one task of stage 112 failed due to FetchFailedException > (it is network issue) and attempt to retry a new stage 112 (retry 1). But in > stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and > keep retry (it never succeed and cause lots of S3 requests). > !retry_stage.png! > !螢幕快照 2018-05-16 上午11.10.57.png! > !螢幕快照 2018-05-16 上午11.10.46.png! > On the other side, driver logs shows: > # task 123.0 in stage 112.0 failed due to FetchFailedException (network > issue cause corrupted file) > # warning message from OutputCommitCoordinator > # task 92.0 in stage 112.1 failed when writing rows > # keep retry the failed tasks, but never succeed > {noformat} > 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 > (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, > 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= > org.apache.spark.shuffle.FetchFailedException: Stream is corrupted > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.Thread
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li updated SPARK-24492: -- Shepherd: (was: Jiang Xingbo) > Endless attempted task when TaskCommitDenied exception > -- > > Key: SPARK-24492 > URL: https://issues.apache.org/jira/browse/SPARK-24492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yu-Jhe Li >Priority: Critical > Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 > 2018-05-16 上午11.10.57.png > > > Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and > output file to S3, some tasks endless retry and all of them failed with > TaskCommitDenied exception. This happened when we run Spark application on > some network issue instances. (it runs well on healthy spot instances) > Sorry, I can find a easy way to reproduce this issue, here's all I can > provide. > The Spark UI shows one task of stage 112 failed due to FetchFailedException > (it is network issue) and attempt to retry a new stage 112 (retry 1). But in > stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and > keep retry (it never succeed and cause lots of S3 requests). > !retry_stage.png! > !螢幕快照 2018-05-16 上午11.10.57.png! > !螢幕快照 2018-05-16 上午11.10.46.png! > On the other side, driver logs shows: > # task 123.0 in stage 112.0 failed due to FetchFailedException (network > issue cause corrupted file) > # warning message from OutputCommitCoordinator > # task 92.0 in stage 112.1 failed when writing rows > # keep retry the failed tasks, but never succeed > {noformat} > 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 > (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, > 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= > org.apache.spark.shuffle.FetchFailedException: Stream is corrupted > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWor
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li updated SPARK-24492: -- Description: Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and output file to S3, some tasks endless retry and all of them failed with TaskCommitDenied exception. This happened when we run Spark application on some network issue instances. (it runs well on healthy spot instances) Sorry, I can find a easy way to reproduce this issue, here's all I can provide. The Spark UI shows one task of stage 112 failed due to FetchFailedException (it is network issue) and attempt to retry a new stage 112 (retry 1). But in stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and keep retry (it never succeed and cause lots of S3 requests). !retry_stage.png! !螢幕快照 2018-05-16 上午11.10.57.png! !螢幕快照 2018-05-16 上午11.10.46.png! On the other side, driver logs shows: # task 123.0 in stage 112.0 failed due to FetchFailedException (network issue cause corrupted file) # warning message from OutputCommitCoordinator # task 92.0 in stage 112.1 failed when writing rows # keep retry the failed tasks, but never succeed {noformat} 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:137) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:340) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li updated SPARK-24492: -- Description: Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and output file to S3, some tasks endless retry and all of them failed with TaskCommitDenied exception. This happened when we run Spark application on some network issue instances. (it runs well on healthy spot instances) Sorry, I can find a easy way to reproduce this issue, here's all I can provide. The Spark UI shows one task of stage 112 failed due to FetchFailedException (it is network issue) and attempt to retry a new stage 112 (retry 1). But in stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and keep retry (it never succeed and cause lots of S3 requests). !retry_stage.png! !螢幕快照 2018-05-16 上午11.10.57.png! !螢幕快照 2018-05-16 上午11.10.46.png! The other side, driver logs shows: # task 123.0 in stage 112.0 failed due to FetchFailedException (network issue cause corrupted file) # warning message from OutputCommitCoordinator # task 92.0 in stage 112.1 failed when writing rows # keep retry the failed tasks, but never succeed {noformat} 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:137) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:340) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at or
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li updated SPARK-24492: -- Description: Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and output file to S3, some tasks endless retry and all of them failed with TaskCommitDenied exception. This happened when we run Spark application on some network issue instances. (it runs well on healthy spot instances) Sorry, I can find a easy way to reproduce this issue, here's all I can provide. The Spark UI shows one task of stage 112 failed due to FetchFailedException (it is network issue) and attempt to retry a new stage 112 (retry 1). But in stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and keep retry (it never succeed and cause lots of S3 requests). !retry_stage.png! !螢幕快照 2018-05-16 上午11.10.57.png! !螢幕快照 2018-05-16 上午11.10.46.png! The other side, driver logs shows: # task 123.0 in stage 112.0 failed due to FetchFailedException (network issue cause corrupted file) # warning message from OutputCommitCoordinator # task 92.0 in stage 112.1 failed when writing rows # keep retry the failed tasks, but never succeed {noformat} 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:137) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:340) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li updated SPARK-24492: -- Description: Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and output file to S3, some tasks endless retry and all of them failed with TaskCommitDenied exception. This happened when we run Spark application on some network issue instances. (it runs well on healthy spot instances) Sorry, I can find a easy way to reproduce this issue, here's all I can provide. The Spark UI shows one task of stage 112 failed due to FetchFailedException (it is network issue) and attempt to retry a new stage 112 (retry 1). But in stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and keep retry (it never succeed and cause lots of S3 requests). !retry_stage.png! The other side, driver logs shows: # task 123.0 in stage 112.0 failed due to FetchFailedException (network issue cause corrupted file) # warning message from OutputCommitCoordinator # task 92.0 in stage 112.1 failed when writing rows # keep retry the failed tasks, but never succeed {noformat} 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:137) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:340) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327)
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li updated SPARK-24492: -- Attachment: retry_stage.png 螢幕快照 2018-05-16 上午11.10.57.png 螢幕快照 2018-05-16 上午11.10.46.png > Endless attempted task when TaskCommitDenied exception > -- > > Key: SPARK-24492 > URL: https://issues.apache.org/jira/browse/SPARK-24492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Spark version: spark-2.2.0 >Reporter: Yu-Jhe Li >Priority: Critical > Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 > 2018-05-16 上午11.10.57.png > > > Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and > output file to S3, some tasks endless retry and all of them failed with > TaskCommitDenied exception. This happened when we run Spark application on > some network issue instances. (it runs well on healthy spot instances) > Sorry, I can find a easy way to reproduce this issue, here's all I can > provide. > The Spark UI shows one task of stage 112 failed due to FetchFailedException > (it is network issue) and attempt to retry a new stage 112 (retry 1). But in > stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and > keep retry (it never succeed and cause lots of S3 requests). > The other side, driver logs shows: > # task 123.0 in stage 112.0 failed due to FetchFailedException (network > issue cause corrupted file) > # warning message from OutputCommitCoordinator > # task 92.0 in stage 112.1 failed when writing rows > # keep retry the failed tasks, but never succeed > {noformat} > 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 > (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, > 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= > org.apache.spark.shuffle.FetchFailedException: Stream is corrupted > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li updated SPARK-24492: -- Description: Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and output file to S3, some tasks endless retry and all of them failed with TaskCommitDenied exception. This happened when we run Spark application on some network issue instances. (it runs well on healthy spot instances) Sorry, I can find a easy way to reproduce this issue, here's all I can provide. The Spark UI shows one task of stage 112 failed due to FetchFailedException (it is network issue) and attempt to retry a new stage 112 (retry 1). But in stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and keep retry (it never succeed and cause lots of S3 requests). The other side, driver logs shows: # task 123.0 in stage 112.0 failed due to FetchFailedException (network issue cause corrupted file) # warning message from OutputCommitCoordinator # task 92.0 in stage 112.1 failed when writing rows # keep retry the failed tasks, but never succeed {noformat} 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:137) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:340) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.apach
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li updated SPARK-24492: -- Description: Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and output file to S3, some tasks endless retry and all of them failed with TaskCommitDenied exception. This happened when we run Spark application on some network issue instances. (it runs well on healthy spot instances) Sorry, I can reproduce this issue easily, here's all I can provide. The Spark UI shows one task of stage 112 failed due to FetchFailedException (it is network issue) and attempt to retry a new stage 112 (retry 1). But in stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and keep retry (it never succeed and cause lots of S3 requests). The other side, driver logs shows: 1. task 123.0 in stage 112.0 failed due to FetchFailedException (network issue cause corrupted file) 2. warning message from OutputCommitCoordinator 3. task 92.0 in stage 112.1 failed when writing rows 4. keep retry the failed tasks, but never succeed {noformat} 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:137) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:340) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.apache.spark.
[jira] [Updated] (SPARK-24492) Endless attempted task when TaskCommitDenied exception
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li updated SPARK-24492: -- Description: Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and output file to S3, some tasks endless retry and all of them failed with TaskCommitDenied exception. This happened when we run Spark application on some network issue instances. (it runs well on healthy spot instances) Sorry, I can find a easy way to reproduce this issue, here's all I can provide. The Spark UI shows one task of stage 112 failed due to FetchFailedException (it is network issue) and attempt to retry a new stage 112 (retry 1). But in stage 112 (retry 1), all task failed due to TaskCommitDenied exception, and keep retry (it never succeed and cause lots of S3 requests). The other side, driver logs shows: 1. task 123.0 in stage 112.0 failed due to FetchFailedException (network issue cause corrupted file) 2. warning message from OutputCommitCoordinator 3. task 92.0 in stage 112.1 failed when writing rows 4. keep retry the failed tasks, but never succeed {noformat} 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:137) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:340) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:327) at org.a