Hi, Ryan

We have met similar errors and increasing executor memory solved it. Though
I am not sure about the detailed reason, it might be worth a try.

On Wed, Oct 29, 2014 at 1:34 PM, Ryan Williams [via Apache Spark User List]
<ml-node+s1001560n17605...@n3.nabble.com> wrote:

> My job is failing with the following error:
>
> 14/10/29 02:59:14 WARN scheduler.TaskSetManager: Lost task 1543.0 in stage
> 3.0 (TID 6266, demeter-csmau08-19.demeter.hpc.mssm.edu):
> java.io.FileNotFoundException:
> /data/05/dfs/dn/yarn/nm/usercache/willir31/appcache/application_1413512480649_0108/spark-local-20141028214722-43f1/26/shuffle_0_312_0.index
> (No such file or directory)
>         java.io.FileOutputStream.open(Native Method)
>         java.io.FileOutputStream.<init>(FileOutputStream.java:221)
>
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
>
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
>
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:733)
>
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:732)
>         scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:790)
>
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:732)
>
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:728)
>         scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:728)
>
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         org.apache.spark.scheduler.Task.run(Task.scala:56)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         java.lang.Thread.run(Thread.java:744)
>
>
> I get 4 of those on task 1543 before the job aborts. Interspersed in the 4
> task-1543 failures are a few instances of this failure on another task. Here
> is the entire App Master stdout dump
> <https://www.dropbox.com/s/m8c4o73o0bh7kf8/adam.108?dl=0>[1] (~2MB; stack
> traces towards the bottom, of course). I am running {Spark 1.1, Hadoop
> 2.3.0}.
>
> Here's a summary of the RDD manipulations I've done up to the point of
> failure:
>
>    - val A = [read a file in 1419 shards]
>       - the file is 177GB compressed but ends up being ~5TB uncompressed
>       / hydrated into scala objects (I think; see below for more discussion on
>       this point).
>       - some relevant Spark options:
>          - spark.default.parallelism=2000
>          - --master yarn-client
>          - --executor-memory 50g
>          - --driver-memory 10g
>          - --num-executors 100
>          - --executor-cores 4
>
>
>    - A.repartition(3000)
>       - 3000 was chosen in an attempt to mitigate shuffle-disk-spillage
>       that previous job attempts with 1000 or 1419 shards were mired in
>
>
>    - A.persist()
>
>
>    - A.count()  // succeeds
>       - screenshot of web UI with stats: http://cl.ly/image/3e130w3J1B2v
>       - I don't know why each task reports "8 TB" of "Input"; that metric
>       seems like it is always ludicrously high and I don't pay attention to it
>       typically.
>       - Each task shuffle-writes 3.5GB, for a total of 4.9TB
>          - Does that mean that 4.9TB is the uncompressed size of the file
>          that A was read from?
>          - 4.9TB is pretty close to the total amount of memory I've
>          configured the job to use: (50GB/executor) * (100 executors) ~= 5TB.
>          - Is that a coincidence, or are my executors shuffle-writing an
>          amount equal to all of their memory for some reason?
>
>
>    - val B = A.groupBy(...).filter(_._2.size == 2).map(_._2).flatMap(x =>
>    x).persist()
>       - my expectation is that ~all elements pass the filter step, so B
>       should ~equal to A, just to give a sense of the expected memory blowup.
>
>
>    - B.count()
>       - this *fails* while executing .groupBy(...) above
>
>
> I've found a few discussions of issues whose manifestations look *like*
> this, but nothing that is obviously the same issue. The closest hit I've
> seen is "Stage failure in BlockManager...
> <http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3ccangvg8qtk57frws+kaqtiuz9jsls5qjkxxjxttq9eh2-gsr...@mail.gmail.com%3E>"[2]
> on this list on 8/20; some key excerpts:
>
>    - "likely due to a bug in shuffle file consolidation"
>    - "hopefully fixed in 1.1 with this patch:
>    
> https://github.com/apache/spark/commit/78f2af582286b81e6dc9fa9d455ed2b369d933bd
>    "
>    - 78f2af5 <https://github.com/apache/spark/commit/78f2af5>[3]
>       implements pieces of #1609
>       <https://github.com/apache/spark/pull/1609>[4], on which mridulm
>       has a comment
>       <https://github.com/apache/spark/pull/1609#issuecomment-54393908>[5]
>       saying: "it got split into four issues, two of which got committed, not
>       sure of the other other two .... And the first one was regressed upon in
>       1.1.already."
>       - "Until 1.0.3 or 1.1 are released, the simplest solution is to
>    disable spark.shuffle.consolidateFiles."
>    - I've not tried this yet as I'm waiting on a re-run with some other
>       parameters tweaked first.
>       - Also, I can't tell if it's expected that this was fixed, known
>       that it subsequently regressed, etc., so hoping for some guidance there.
>
> So! Anyone else seen this? Is this related to the "bug in shuffle file
> consolidation"? Was it fixed? Did it regress? Are my confs or other steps
> unreasonable in some way? Any assistance would be appreciated, thanks.
>
> -Ryan
>
>
> [1] https://www.dropbox.com/s/m8c4o73o0bh7kf8/adam.108?dl=0
> [2]
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCANGvG8qtK57frWS+kaqTiUZ9jSLs5qJKXXjXTTQ9eh2-GsrmpA@...%3E
> <http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3ccangvg8qtk57frws+kaqtiuz9jsls5qjkxxjxttq9eh2-gsr...@mail.gmail.com%3E>
> [3] https://github.com/apache/spark/commit/78f2af5
> [4] https://github.com/apache/spark/pull/1609
> [5] https://github.com/apache/spark/pull/1609#issuecomment-54393908
>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files-tp17605.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dGlhbnNoYW9jdW5AZ21haWwuY29tfDF8NjkzNjc2OTQ4>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files-tp17605p17610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to