[ 
https://issues.apache.org/jira/browse/SPARK-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102960#comment-14102960
 ] 

Josh Rosen commented on SPARK-3139:
-----------------------------------

I used pssh + grep to search through the application logs on the workers and I 
couldn't find any ERRORs or Exceptions (I'm sure that I was searching the right 
log directories, since other searches return matches).

> Akka timeouts from ContextCleaner when cleaning shuffles
> --------------------------------------------------------
>
>                 Key: SPARK-3139
>                 URL: https://issues.apache.org/jira/browse/SPARK-3139
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>         Environment: 10 r3.2xlarge tests on EC2, running the 
> scala-agg-by-key-int spark-perf test against master commit 
> d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb.
>            Reporter: Josh Rosen
>            Priority: Blocker
>
> When running spark-perf tests on EC2, I have a job that's consistently 
> logging the following Akka exceptions:
> {code}
> 4/08/19 22:07:12 ERROR spark.ContextCleaner: Error cleaning shuffle 0
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at 
> org.apache.spark.storage.BlockManagerMaster.removeShuffle(BlockManagerMaster.scala:118)
>   at 
> org.apache.spark.ContextCleaner.doCleanupShuffle(ContextCleaner.scala:159)
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:131)
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:124)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:124)
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:120)
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:120)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1252)
>   at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:119)
>   at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
> {code}
> and
> {code}
> 14/08/19 22:07:12 ERROR storage.BlockManagerMaster: Failed to remove shuffle 0
> akka.pattern.AskTimeoutException: Timed out
>   at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
>   at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
>   at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411)
>   at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This doesn't seem to prevent the job from completing successfully, but it's 
> serious issue because it means that resources aren't being cleaned up.  The 
> test script, ScalaAggByKeyInt, runs each test 10 times, and I see the same 
> error after each test, so this seems deterministically reproducible.
> I'll look at the executor logs to see if I can find more info there.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to