Hi Ravi, I have seen a similar issue before. You can try to set fs.hdfs.impl.disable.cache to true in your hadoop configuration. For example, suppose your hadoop configuration file is hadoopConf, you can use hadoopConf.setBoolean("fs.hdfs.impl.disable.cache", true)
Let me know if that helps. Best, Liquan On Wed, Jul 16, 2014 at 4:56 PM, rpandya <r...@iecommerce.com> wrote: > Matei - I tried using coalesce(numNodes, true), but it then seemed to run > too > few SNAP tasks - only 2 or 3 when I had specified 46. The job failed, > perhaps for unrelated reasons, with some odd exceptions in the log (at the > end of this message). But I really don't want to force data movement > between > nodes. The input data is in HDFS and should already be somewhat balanced > among the nodes. We've run this scenario using the simple "hadoop jar" > runner and a custom format jar to break the input into 8-line chunks > (paired > FASTQ). Ideally I'd like Spark to do the minimum data movement to balance > the work, feeding each task mostly from data local to that node. > > Daniel - that's a good thought, I could invoke a small stub for each task > that talks to a single local demon process over a socket, and serializes > all > the tasks on a given machine. > > Thanks, > > Ravi > > P.S. Log exceptions: > > 14/07/15 17:02:00 WARN yarn.ApplicationMaster: Unable to retrieve > SparkContext in spite of waiting for 100000, maxNumTries = 10 > Exception in thread "main" java.lang.NullPointerException > at > > org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized(ApplicationMaster.scala:233) > at > > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:110) > > ...and later... > > 14/07/15 17:11:07 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM > 14/07/15 17:11:07 INFO yarn.ApplicationMaster: AppMaster received a signal. > 14/07/15 17:11:07 WARN rdd.NewHadoopRDD: Exception in RecordReader.close() > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p9991.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- Liquan Pei Department of Physics University of Massachusetts Amherst