(A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's
output format check and overwrite files in the destination directory.
But it won't clobber the directory entirely. I.e. if the directory
already had "part1" "part2" "part3" "part4" and you write a new job
outputing only two files ("part1", "part2") then it would leave the
other two files intact, confusingly.

(B) Semantics in Spark 1.0 and earlier: Runs Hadoop OutputFormat check
which means the directory must not exist already or an excpetion is
thrown.

(C) Semantics proposed by Nicholas Chammas in this thread (AFAIK):
Spark will delete/clobber an existing destination directory if it
exists, then fully over-write it with new data.

I'm fine to add a flag that allows (B) for backwards-compatibility
reasons, but my point was I'd prefer not to have (C) even though I see
some cases where it would be useful.

- Patrick

On Mon, Jun 2, 2014 at 4:25 PM, Sean Owen <so...@cloudera.com> wrote:
> Is there a third way? Unless I miss something. Hadoop's OutputFormat
> wants the target dir to not exist no matter what, so it's just a
> question of whether Spark deletes it for you or errors.
>
> On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell <pwend...@gmail.com> wrote:
>> We can just add back a flag to make it backwards compatible - it was
>> just missed during the original PR.
>>
>> Adding a *third* set of "clobber" semantics, I'm slightly -1 on that
>> for the following reasons:
>>
>> 1. It's scary to have Spark recursively deleting user files, could
>> easily lead to users deleting data by mistake if they don't understand
>> the exact semantics.
>> 2. It would introduce a third set of semantics here for saveAsXX...
>> 3. It's trivial for users to implement this with two lines of code (if
>> output dir exists, delete it) before calling saveAsHadoopFile.
>>
>> - Patrick
>>

Reply via email to