I remember that in the earlier version of that PR, I deleted files by calling 
HDFS API

we discussed and concluded that, it’s a bit scary to have something directly 
deleting user’s files in Spark

Best,  

--  
Nan Zhu


On Monday, June 2, 2014 at 10:39 PM, Patrick Wendell wrote:

> (A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's
> output format check and overwrite files in the destination directory.
> But it won't clobber the directory entirely. I.e. if the directory
> already had "part1" "part2" "part3" "part4" and you write a new job
> outputing only two files ("part1", "part2") then it would leave the
> other two files intact, confusingly.
>  
> (B) Semantics in Spark 1.0 and earlier: Runs Hadoop OutputFormat check
> which means the directory must not exist already or an excpetion is
> thrown.
>  
> (C) Semantics proposed by Nicholas Chammas in this thread (AFAIK):
> Spark will delete/clobber an existing destination directory if it
> exists, then fully over-write it with new data.
>  
> I'm fine to add a flag that allows (B) for backwards-compatibility
> reasons, but my point was I'd prefer not to have (C) even though I see
> some cases where it would be useful.
>  
> - Patrick
>  
> On Mon, Jun 2, 2014 at 4:25 PM, Sean Owen <so...@cloudera.com 
> (mailto:so...@cloudera.com)> wrote:
> > Is there a third way? Unless I miss something. Hadoop's OutputFormat
> > wants the target dir to not exist no matter what, so it's just a
> > question of whether Spark deletes it for you or errors.
> >  
> > On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell <pwend...@gmail.com 
> > (mailto:pwend...@gmail.com)> wrote:
> > > We can just add back a flag to make it backwards compatible - it was
> > > just missed during the original PR.
> > >  
> > > Adding a *third* set of "clobber" semantics, I'm slightly -1 on that
> > > for the following reasons:
> > >  
> > > 1. It's scary to have Spark recursively deleting user files, could
> > > easily lead to users deleting data by mistake if they don't understand
> > > the exact semantics.
> > > 2. It would introduce a third set of semantics here for saveAsXX...
> > > 3. It's trivial for users to implement this with two lines of code (if
> > > output dir exists, delete it) before calling saveAsHadoopFile.
> > >  
> > > - Patrick  

Reply via email to