Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: February-25-14 3:02 PM
To: user@spark.apache.org
Cc: u...@spark.incubator.apache.org
Subject: Re: ETL on pyspark
It will only move a file to the final directory when it's successfully finished
writing it, so the file shouldn't have any
after recovery from the failure does
> it continue where it left off or will there be duplicates in the file?
>
> -A
> From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
> Sent: February-24-14 4:20 PM
> To: u...@spark.incubator.apache.org
> Subject: Re: ETL on pyspark
>
on pyspark
collect() means to bring all the data back to the master node, and there might
just be too much of it for that. How big is your file? If you can't bring it
back to the master node try saveAsTextFile to write it out to a filesystem (in
parallel).
Matei
On Feb 24, 2014, at 1: