[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
xuqing updated SPARK-18113: --------------------------- Description: Executor sends *AskPermissionToCommitOutput* to driver, driver receives this message and handles it. But executor does not get response in *spark.rpc.askTimeout*. The TaskAttemptNumber for this partition in authorizedCommittersByStage is locked forever. Driver enters into infinite loop. h4. Driver Log: 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ... 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 0 ... 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, stage: 2, partition: 24, attempt: 0 ... 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 28732, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ... 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 28732, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 1 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, stage: 2, partition: 24, attempt: 1 ... 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ... h4. Executor Log: ... 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) ... 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = AskPermissionToCommitOutput(2,24,0)] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.lang.Thread.run(Thread.java:785) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) ... 13 more ... 16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119) ... 16/10/25 05:39:24 INFO SparkHadoopMapRedUtil: attempt_201610250536_0002_m_000024_119: Not committed because the driver did not authorize commit ... was: Executor sends *AskPermissionToCommitOutput* to driver, and driver receive this message and handle it. But executor does not get response in *spark.rpc.askTimeout*. The TaskAttemptNumber for this partition in authorizedCommittersByStage is locked forever. The driver enters into infinite loop. h4. Driver Log: 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ... 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 0 ... 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, stage: 2, partition: 24, attempt: 0 ... 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 28732, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ... 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 28732, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 1 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, stage: 2, partition: 24, attempt: 1 ... 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ... h4. Executor Log: ... 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) ... 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = AskPermissionToCommitOutput(2,24,0)] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.lang.Thread.run(Thread.java:785) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) ... 13 more ... 16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119) ... 16/10/25 05:39:24 INFO SparkHadoopMapRedUtil: attempt_201610250536_0002_m_000024_119: Not committed because the driver did not authorize commit ... > AskPermissionToCommitOutput response lost, driver enter into task deadloop > -------------------------------------------------------------------------- > > Key: SPARK-18113 > URL: https://issues.apache.org/jira/browse/SPARK-18113 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 2.0.1 > Environment: # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) > Reporter: xuqing > Priority: Minor > > Executor sends *AskPermissionToCommitOutput* to driver, driver receives this > message and handles it. But executor does not get response in > *spark.rpc.askTimeout*. The TaskAttemptNumber for this partition in > authorizedCommittersByStage is locked forever. Driver enters into infinite > loop. > h4. Driver Log: > 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID > 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 0 > ... > 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 0 > ... > 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID > 28732, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID > 28732, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for > job: 2, partition: 24, attemptNumber: 1 > 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 1 > ... > 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 > (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > h4. Executor Log: > ... > 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) > ... > 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = > AskPermissionToCommitOutput(2,24,0)] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) > ... 13 more > ... > 16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119) > ... > 16/10/25 05:39:24 INFO SparkHadoopMapRedUtil: > attempt_201610250536_0002_m_000024_119: Not committed because the driver did > not authorize commit > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org