Write to hdfs and then get one file locally bu using "hdfs dfs -getmerge..."

On Friday, October 17, 2014, Sean Owen <so...@cloudera.com> wrote:

> You can save to a local file. What are you trying and what doesn't work?
>
> You can output one file by repartitioning to 1 partition but this is
> probably not a good idea as you are bottlenecking the output and some
> upstream computation by disabling parallelism.
>
> How about just combining the files on HDFS afterwards? or just reading
> all the files instead of 1? You can hdfs dfs -cat a bunch of files at
> once.
>
> On Fri, Oct 17, 2014 at 6:46 PM, Parthus <peng.wei....@gmail.com
> <javascript:;>> wrote:
> > Hi,
> >
> > I have a spark mapreduce task which requires me to write the final rdd
> to an
> > existing local file (appending to this file). I tried two ways but
> neither
> > works well:
> >
> > 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write
> to
> > local, but I never make it work. Moreover, the result is not one file
> but a
> > series of part-xxxxx files which is not what I hope to get.
> >
> > 2. collect the rdd to an array and write it to the driver node using
> Java's
> > File IO. There are also two problems: 1) my RDD is huge(1TB), which
> cannot
> > fit into the memory of one driver node. I have to split the task into
> small
> > pieces and collect them part by part and write; 2) During the writing by
> > Java IO, the Spark Mapreduce task has to wait, which is not efficient.
> >
> > Could anybody provide me an efficient way to solve this problem? I wish
> that
> > the solution could be like: appending a huge rdd to a local file without
> > pausing the MapReduce during writing?
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;>
> > For additional commands, e-mail: user-h...@spark.apache.org
> <javascript:;>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
>
>

-- 
- Rishi

Reply via email to