Ok, so I don't think the workers on the data nodes will be able to see my output directory on the edge node. I don't think stdout will work either, so I'll write to HDFS via rdd.saveAsTextFile(...)
On Wed, Sep 10, 2014 at 3:51 PM, Daniil Osipov <[email protected]> wrote: > Try providing full path to the file you want to write, and make sure the > directory exists and is writable by the Spark process. > > On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra <[email protected]> > wrote: > >> I have a spark program that worked in local mode, but throws an error in >> yarn-client mode on a cluster. On the edge node in my home directory, I >> have an output directory (called transout) which is ready to receive files. >> The spark job I'm running is supposed to write a few hundred files into >> that directory, once for each iteration of a foreach function. This works >> in local mode, and my only guess as to why this would fail in yarn-client >> mode is that the RDD is distributed across many nodes and the program is >> trying to use the PrintWriter on the datanodes, where the output directory >> doesn't exist. Is this what's happening? Any proposed solution? >> >> abbreviation of the code: >> >> import java.io.PrintWriter >> ... >> rdd.foreach { >> val outFile = new PrintWriter("transoutput/output.%s".format(id)) >> outFile.println("test") >> outFile.close() >> } >> >> Error: >> >> 14/09/10 16:57:09 WARN TaskSetManager: Lost TID 1826 (task 0.0:26) >> 14/09/10 16:57:09 WARN TaskSetManager: Loss was due to >> java.io.FileNotFoundException >> java.io.FileNotFoundException: transoutput/input.598718 (No such file or >> directory) >> at java.io.FileOutputStream.open(Native Method) >> at java.io.FileOutputStream.<init>(FileOutputStream.java:194) >> at java.io.FileOutputStream.<init>(FileOutputStream.java:84) >> at java.io.PrintWriter.<init>(PrintWriter.java:146) >> at >> com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:98) >> at >> com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:95) >> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) >> at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703) >> at >> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) >> at >> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) >> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) >> at org.apache.spark.scheduler.Task.run(Task.scala:51) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> at java.lang.Thread.run(Thread.java:662) >> > >
