GitHub user CodingCat opened a pull request: https://github.com/apache/incubator-spark/pull/626
[SPARK-1100] prevent Spark from overwriting directory silently and leaving dirty directory Thanks for Diana Carroll to report this issue the current saveAsTextFile/SequenceFile will overwrite the output directory silently if the directory already exists, this behaviour is not desirable because 1. overwriting the data silently is not user-friendly 2. if the partition number of two writing operation changed, then the output directory will contain the results generated by two runnings My fix includes: 1. add some new APIs with a flag for users to define whether he/she wants to overwrite the directory: if the flag is set to true, then the output directory is deleted first and then written into the new data to prevent the output directory contains results from multiple rounds of running; if the flag is set to false, Spark will throw an exception if the output directory already exists 2. I didn't change saveNewHadoopAPI because in the new API, the overwrite flag is defined by the implementation of RecordWriter, we don't need to control that in Spark 3. changed JavaAPI part 4. default behaviour is overwriting ----- Two questions 1. should we deprecate the old APIs without such a flag? 2. I noticed that Spark Streaming also called these APIs, I thought we don't need to change the related part in streaming? @tdas You can merge this pull request into a Git repository by running: $ git pull https://github.com/CodingCat/incubator-spark SPARK-1100 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-spark/pull/626.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #626 ---- commit 2ec87a1f63b4650036691e5bf5d484aae4e6d470 Author: CodingCat <zhunans...@gmail.com> Date: 2014-02-21T05:32:17Z add new APIs to enable users define whether to overwrite the output directory ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---