Without (C), what is the best practice to implement the following scenario?

1. rdd = sc.textFile(FileA)
2. rdd = rdd.map(...)  // actually modifying the rdd
3. rdd.saveAsTextFile(FileA)

Since the rdd transformation is 'lazy', rdd will not materialize until
saveAsTextFile(), so FileA must still exist, but it must be deleted before
saveAsTextFile().

What I can think is:

3. rdd.saveAsTextFile(TempFile)
4. delete FileA
5. rename TempFile to FileA

This is not very convenient...

Thanks.

-----Original Message-----
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Tuesday, June 03, 2014 11:40 AM
To: user@spark.apache.org
Subject: Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing
file

(A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's output
format check and overwrite files in the destination directory.
But it won't clobber the directory entirely. I.e. if the directory already
had "part1" "part2" "part3" "part4" and you write a new job outputing only
two files ("part1", "part2") then it would leave the other two files intact,
confusingly.

(B) Semantics in Spark 1.0 and earlier: Runs Hadoop OutputFormat check which
means the directory must not exist already or an excpetion is thrown.

(C) Semantics proposed by Nicholas Chammas in this thread (AFAIK):
Spark will delete/clobber an existing destination directory if it exists,
then fully over-write it with new data.

I'm fine to add a flag that allows (B) for backwards-compatibility reasons,
but my point was I'd prefer not to have (C) even though I see some cases
where it would be useful.

- Patrick

On Mon, Jun 2, 2014 at 4:25 PM, Sean Owen <so...@cloudera.com> wrote:
> Is there a third way? Unless I miss something. Hadoop's OutputFormat 
> wants the target dir to not exist no matter what, so it's just a 
> question of whether Spark deletes it for you or errors.
>
> On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell <pwend...@gmail.com>
wrote:
>> We can just add back a flag to make it backwards compatible - it was 
>> just missed during the original PR.
>>
>> Adding a *third* set of "clobber" semantics, I'm slightly -1 on that 
>> for the following reasons:
>>
>> 1. It's scary to have Spark recursively deleting user files, could 
>> easily lead to users deleting data by mistake if they don't 
>> understand the exact semantics.
>> 2. It would introduce a third set of semantics here for saveAsXX...
>> 3. It's trivial for users to implement this with two lines of code 
>> (if output dir exists, delete it) before calling saveAsHadoopFile.
>>
>> - Patrick
>>

Reply via email to