[ 
https://issues.apache.org/jira/browse/SPARK-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1100:
-----------------------------------

    Assignee: Patrick Wendell  (was: Patrick Cogan)

> saveAsTextFile shouldn't clobber by default
> -------------------------------------------
>
>                 Key: SPARK-1100
>                 URL: https://issues.apache.org/jira/browse/SPARK-1100
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 0.9.0
>            Reporter: Diana Carroll
>            Assignee: Patrick Wendell
>             Fix For: 1.0.0
>
>
> If I call rdd.saveAsTextFile with an existing directory, it will cheerfully 
> and silently overwrite the files in there.  This is bad enough if it means 
> I've accidentally blown away the results of a job that might have taken 
> minutes or hours to run.  But it's worse if the second job happens to have 
> fewer partitions than the first...in that case, my output directory now 
> contains some "part" files from the earlier job, and some "part" files from 
> the later job.  The only way to know the difference is timestamp.
> I wonder if Spark's saveAsTextFile shouldn't work more like Hadoop MapReduce 
> which insists that the output directory not exist before the job starts.  
> Similarly HDFS won't override files by default.  Perhaps there could be an 
> optional argument for saveAsTextFile that indicates if it should delete the 
> existing directory before starting.  (I can't see any time I'd want to allow 
> writing to an existing directory with data already in it.  Would the mix of 
> output from different tasks ever be desirable?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to