I'm guessing it's a documentation issue, but certainly something could have broken.
- what version of Spark? -- 0.8.1 - what mode are you running with? (local, standalone, mesos, YARN) -- local on Windows - are you using the shell or a application - shell? - what language (scala / java / Python) - scala Can you provide a deeper error stacktrace from the executor? Look in the webui (port 4040) and in the stdout/stderr files. Also, give it a shot on the linux box to see if that works. Cheers! Andrew On Thu, Jan 2, 2014 at 1:31 PM, Philip Ogren <philip.og...@oracle.com>wrote: > Yep - that works great and is what I normally do. > > I perhaps should have framed my email as a bug report. The documentation > for saveAsTextFile says you can write results out to a local file but it > doesn't work for me per the described behavior. It also worked before and > now it doesn't. So, it seems like a bug. Should I file a Jira issue? I > haven't done that yet for this project but would be happy to. > > Thanks, > Philip > > > On 1/2/2014 11:23 AM, Andrew Ash wrote: > > For testing, maybe try using .collect and doing the comparison between > expected and actual in memory rather than on disk? > > > On Thu, Jan 2, 2014 at 12:54 PM, Philip Ogren <philip.og...@oracle.com>wrote: > >> I just tried your suggestion and get the same results with the >> _temporary directory. Thanks though. >> >> >> On 1/2/2014 10:28 AM, Andrew Ash wrote: >> >> You want to write it to a local file on the machine? Try using >> "file:///path/to/target/mydir/" instead >> >> I'm not sure what behavior would be if you did this on a multi-machine >> cluster though -- you may get a bit of data on each machine in that local >> directory. >> >> >> On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren <philip.og...@oracle.com>wrote: >> >>> I have a very simple Spark application that looks like the following: >>> >>> >>> var myRdd: RDD[Array[String]] = initMyRdd() >>> println(myRdd.first.mkString(", ")) >>> println(myRdd.count) >>> >>> myRdd.saveAsTextFile("hdfs://myserver:8020/mydir") >>> myRdd.saveAsTextFile("target/mydir/") >>> >>> >>> The println statements work as expected. The first saveAsTextFile >>> statement also works as expected. The second saveAsTextFile statement does >>> not (even if the first is commented out.) I get the exception pasted >>> below. If I inspect "target/mydir" I see that there is a directory called >>> _temporary/0/_temporary/attempt_201401020953_0000_m_000000_1 which contains >>> an empty part-00000 file. It's curious because this code worked before >>> with Spark 0.8.0 and now I am running on Spark 0.8.1. I happen to be >>> running this on Windows in "local" mode at the moment. Perhaps I should >>> try running it on my linux box. >>> >>> Thanks, >>> Philip >>> >>> >>> Exception in thread "main" org.apache.spark.SparkException: Job aborted: >>> Task 2.0:0 failed more than 0 times; aborting job >>> java.lang.NullPointerException >>> at >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827) >>> at >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825) >>> at >>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) >>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >>> at >>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825) >>> at >>> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440) >>> at org.apache.spark.scheduler.DAGScheduler.org >>> $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502) >>> at >>> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157) >>> >>> >>> >> >> > >