I have a Spark standalone cluster with 2 workers - Master and one slave thread run on a single machine -- Machine 1 Another slave running on a separate machine -- Machine 2
I am running a spark shell in the 2nd machine that reads a file from hdfs and does some calculations on them and stores the result in hdfs. This is how I read the file in spark shell - val file = sc.textFile("hdfs://localhost:9000/user/root/table.csv") And this is how I write the result back to a file - finalRDD.saveAsTextFile("hdfs://localhost:9000/user/root/output_file") When I run the code, it runs in the cluster and the job succeeds with each worker processing roughly half of the input file. I am also able to see the records processed in the webUI. But when I check HDFS in the 2nd machine, I find only one part of the output file. The other part is stored in the hdfs in the 1st machine. But even the part is not actually present in the proper hdfs location and is instead stored in a _temporary directory In machine 2 - root@worker:~# hadoop fs -ls ./output_file Found 2 items -rw-r--r-- 3 root supergroup 0 2015-07-06 16:12 output_file/_SUCCESS -rw-r--r-- 3 root supergroup 984337 2015-07-06 16:12 output_file/part-00000 In machine 1 - root@spark:~# hadoop fs -ls ./output_file/_temporary/0/task_201507061612_0003_m_000001 -rw-r--r-- 3 root supergroup 971824 2015-07-06 16:12 output_file/_temporary/0/ task_201507061612_0003_m_000001/part-00001 I have a couple of questions - 1. Shouldn't both parts be on the worker 2 ( since the hdfs referred to in the saveAsTextFile is the local hdfs) ? OR will the output be always split in the workers ? 2. Why is the output stored in the _temporary directory in machine 1 ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-cluster-Output-file-stored-in-temporary-directory-in-worker-tp23653.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org