If you don't need part-xxx files in the output but 1 file, then you should repartition (or coalesce) the RDD into 1 (This will be bottleneck since you are disabling the parallelism - its like giving everything to 1 machine to process). You are better off merging those part-xxx files afterwards spark in hdfs (use hadoop fs -getmerge)
Thanks Best Regards On Mon, Oct 20, 2014 at 10:01 AM, Rishi Yadav <ri...@infoobjects.com> wrote: > Write to hdfs and then get one file locally bu using "hdfs dfs > -getmerge..." > > > On Friday, October 17, 2014, Sean Owen <so...@cloudera.com> wrote: > >> You can save to a local file. What are you trying and what doesn't work? >> >> You can output one file by repartitioning to 1 partition but this is >> probably not a good idea as you are bottlenecking the output and some >> upstream computation by disabling parallelism. >> >> How about just combining the files on HDFS afterwards? or just reading >> all the files instead of 1? You can hdfs dfs -cat a bunch of files at >> once. >> >> On Fri, Oct 17, 2014 at 6:46 PM, Parthus <peng.wei....@gmail.com> wrote: >> > Hi, >> > >> > I have a spark mapreduce task which requires me to write the final rdd >> to an >> > existing local file (appending to this file). I tried two ways but >> neither >> > works well: >> > >> > 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write >> to >> > local, but I never make it work. Moreover, the result is not one file >> but a >> > series of part-xxxxx files which is not what I hope to get. >> > >> > 2. collect the rdd to an array and write it to the driver node using >> Java's >> > File IO. There are also two problems: 1) my RDD is huge(1TB), which >> cannot >> > fit into the memory of one driver node. I have to split the task into >> small >> > pieces and collect them part by part and write; 2) During the writing by >> > Java IO, the Spark Mapreduce task has to wait, which is not efficient. >> > >> > Could anybody provide me an efficient way to solve this problem? I wish >> that >> > the solution could be like: appending a huge rdd to a local file without >> > pausing the MapReduce during writing? >> > >> > >> > >> > >> > >> > >> > -- >> > View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > -- > - Rishi >