Re: How to write a RDD into One Local Existing File?

2014-10-20 Thread Akhil Das
If you don't need part-xxx files in the output but 1 file, then you should
repartition (or coalesce) the RDD into 1 (This will be bottleneck since you
are disabling the parallelism - its like giving everything to 1 machine to
process). You are better off merging those part-xxx files afterwards spark
in hdfs (use hadoop fs -getmerge)

Thanks
Best Regards

On Mon, Oct 20, 2014 at 10:01 AM, Rishi Yadav  wrote:

> Write to hdfs and then get one file locally bu using "hdfs dfs
> -getmerge..."
>
>
> On Friday, October 17, 2014, Sean Owen  wrote:
>
>> You can save to a local file. What are you trying and what doesn't work?
>>
>> You can output one file by repartitioning to 1 partition but this is
>> probably not a good idea as you are bottlenecking the output and some
>> upstream computation by disabling parallelism.
>>
>> How about just combining the files on HDFS afterwards? or just reading
>> all the files instead of 1? You can hdfs dfs -cat a bunch of files at
>> once.
>>
>> On Fri, Oct 17, 2014 at 6:46 PM, Parthus  wrote:
>> > Hi,
>> >
>> > I have a spark mapreduce task which requires me to write the final rdd
>> to an
>> > existing local file (appending to this file). I tried two ways but
>> neither
>> > works well:
>> >
>> > 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write
>> to
>> > local, but I never make it work. Moreover, the result is not one file
>> but a
>> > series of part-x files which is not what I hope to get.
>> >
>> > 2. collect the rdd to an array and write it to the driver node using
>> Java's
>> > File IO. There are also two problems: 1) my RDD is huge(1TB), which
>> cannot
>> > fit into the memory of one driver node. I have to split the task into
>> small
>> > pieces and collect them part by part and write; 2) During the writing by
>> > Java IO, the Spark Mapreduce task has to wait, which is not efficient.
>> >
>> > Could anybody provide me an efficient way to solve this problem? I wish
>> that
>> > the solution could be like: appending a huge rdd to a local file without
>> > pausing the MapReduce during writing?
>> >
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
> --
> - Rishi
>


Re: How to write a RDD into One Local Existing File?

2014-10-19 Thread Rishi Yadav
Write to hdfs and then get one file locally bu using "hdfs dfs -getmerge..."

On Friday, October 17, 2014, Sean Owen  wrote:

> You can save to a local file. What are you trying and what doesn't work?
>
> You can output one file by repartitioning to 1 partition but this is
> probably not a good idea as you are bottlenecking the output and some
> upstream computation by disabling parallelism.
>
> How about just combining the files on HDFS afterwards? or just reading
> all the files instead of 1? You can hdfs dfs -cat a bunch of files at
> once.
>
> On Fri, Oct 17, 2014 at 6:46 PM, Parthus  > wrote:
> > Hi,
> >
> > I have a spark mapreduce task which requires me to write the final rdd
> to an
> > existing local file (appending to this file). I tried two ways but
> neither
> > works well:
> >
> > 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write
> to
> > local, but I never make it work. Moreover, the result is not one file
> but a
> > series of part-x files which is not what I hope to get.
> >
> > 2. collect the rdd to an array and write it to the driver node using
> Java's
> > File IO. There are also two problems: 1) my RDD is huge(1TB), which
> cannot
> > fit into the memory of one driver node. I have to split the task into
> small
> > pieces and collect them part by part and write; 2) During the writing by
> > Java IO, the Spark Mapreduce task has to wait, which is not efficient.
> >
> > Could anybody provide me an efficient way to solve this problem? I wish
> that
> > the solution could be like: appending a huge rdd to a local file without
> > pausing the MapReduce during writing?
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > For additional commands, e-mail: user-h...@spark.apache.org
> 
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>

-- 
- Rishi


Re: How to write a RDD into One Local Existing File?

2014-10-17 Thread Sean Owen
You can save to a local file. What are you trying and what doesn't work?

You can output one file by repartitioning to 1 partition but this is
probably not a good idea as you are bottlenecking the output and some
upstream computation by disabling parallelism.

How about just combining the files on HDFS afterwards? or just reading
all the files instead of 1? You can hdfs dfs -cat a bunch of files at
once.

On Fri, Oct 17, 2014 at 6:46 PM, Parthus  wrote:
> Hi,
>
> I have a spark mapreduce task which requires me to write the final rdd to an
> existing local file (appending to this file). I tried two ways but neither
> works well:
>
> 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write to
> local, but I never make it work. Moreover, the result is not one file but a
> series of part-x files which is not what I hope to get.
>
> 2. collect the rdd to an array and write it to the driver node using Java's
> File IO. There are also two problems: 1) my RDD is huge(1TB), which cannot
> fit into the memory of one driver node. I have to split the task into small
> pieces and collect them part by part and write; 2) During the writing by
> Java IO, the Spark Mapreduce task has to wait, which is not efficient.
>
> Could anybody provide me an efficient way to solve this problem? I wish that
> the solution could be like: appending a huge rdd to a local file without
> pausing the MapReduce during writing?
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to write a RDD into One Local Existing File?

2014-10-17 Thread Parthus
Hi,

I have a spark mapreduce task which requires me to write the final rdd to an
existing local file (appending to this file). I tried two ways but neither
works well:

1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write to
local, but I never make it work. Moreover, the result is not one file but a
series of part-x files which is not what I hope to get.

2. collect the rdd to an array and write it to the driver node using Java's
File IO. There are also two problems: 1) my RDD is huge(1TB), which cannot
fit into the memory of one driver node. I have to split the task into small
pieces and collect them part by part and write; 2) During the writing by
Java IO, the Spark Mapreduce task has to wait, which is not efficient.

Could anybody provide me an efficient way to solve this problem? I wish that
the solution could be like: appending a huge rdd to a local file without
pausing the MapReduce during writing?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org