I remember that in the earlier version of that PR, I deleted files by calling HDFS API
we discussed and concluded that, it’s a bit scary to have something directly deleting user’s files in Spark Best, -- Nan Zhu On Monday, June 2, 2014 at 10:39 PM, Patrick Wendell wrote: > (A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's > output format check and overwrite files in the destination directory. > But it won't clobber the directory entirely. I.e. if the directory > already had "part1" "part2" "part3" "part4" and you write a new job > outputing only two files ("part1", "part2") then it would leave the > other two files intact, confusingly. > > (B) Semantics in Spark 1.0 and earlier: Runs Hadoop OutputFormat check > which means the directory must not exist already or an excpetion is > thrown. > > (C) Semantics proposed by Nicholas Chammas in this thread (AFAIK): > Spark will delete/clobber an existing destination directory if it > exists, then fully over-write it with new data. > > I'm fine to add a flag that allows (B) for backwards-compatibility > reasons, but my point was I'd prefer not to have (C) even though I see > some cases where it would be useful. > > - Patrick > > On Mon, Jun 2, 2014 at 4:25 PM, Sean Owen <so...@cloudera.com > (mailto:so...@cloudera.com)> wrote: > > Is there a third way? Unless I miss something. Hadoop's OutputFormat > > wants the target dir to not exist no matter what, so it's just a > > question of whether Spark deletes it for you or errors. > > > > On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell <pwend...@gmail.com > > (mailto:pwend...@gmail.com)> wrote: > > > We can just add back a flag to make it backwards compatible - it was > > > just missed during the original PR. > > > > > > Adding a *third* set of "clobber" semantics, I'm slightly -1 on that > > > for the following reasons: > > > > > > 1. It's scary to have Spark recursively deleting user files, could > > > easily lead to users deleting data by mistake if they don't understand > > > the exact semantics. > > > 2. It would introduce a third set of semantics here for saveAsXX... > > > 3. It's trivial for users to implement this with two lines of code (if > > > output dir exists, delete it) before calling saveAsHadoopFile. > > > > > > - Patrick