[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161142#comment-14161142 ]
Matei Zaharia commented on SPARK-3633: -------------------------------------- In that case though, the problem might be that these maps are allocating more memory without that patch, and exceeding the spark.shuffle.memoryFraction, which would lead to other bugs. That is also consistent with it running faster. It would be good to investigate why exactly this happened (e.g. was this job just at the edge of exceeding its ulimit). It sounds like PageRank has exactly the problem described in SPARK-2711 (you have reduce tasks that are also writing map output files for the next stage, so they contain two ExternalAppendOnlyMaps, so the old version of ExternalAppendOnlyMap does not properly account for memory). That's going to lead to trouble in jobs that are also doing caching. > Fetches failure observed after SPARK-2711 > ----------------------------------------- > > Key: SPARK-3633 > URL: https://issues.apache.org/jira/browse/SPARK-3633 > Project: Spark > Issue Type: Bug > Components: Block Manager > Affects Versions: 1.1.0 > Reporter: Nishkam Ravi > Priority: Critical > > Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. > Recently upgraded to Spark 1.1. The workload fails with the following error > message(s): > {code} > 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, > c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, > c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) > 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages > {code} > In order to identify the problem, I carried out change set analysis. As I go > back in time, the error message changes to: > {code} > 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, > c1706.halxg.cloudera.com): java.io.FileNotFoundException: > /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 > (Too many open files) > java.io.FileOutputStream.open(Native Method) > java.io.FileOutputStream.<init>(FileOutputStream.java:221) > > org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) > > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > > org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} > All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org