GitHub user CodingCat opened a pull request:
https://github.com/apache/incubator-spark/pull/626
[SPARK-1100] prevent Spark from overwriting directory silently and leaving
dirty directory
Thanks for Diana Carroll to report this issue
the current saveAsTextFile/SequenceFile will overwrite the output directory
silently if the directory already exists, this behaviour is not desirable
because
1. overwriting the data silently is not user-friendly
2. if the partition number of two writing operation changed, then the
output directory will contain the results generated by two runnings
My fix includes:
1. add some new APIs with a flag for users to define whether he/she wants
to overwrite the directory:
if the flag is set to true, then the output directory is deleted first and
then written into the new data to prevent the output directory contains results
from multiple rounds of running;
if the flag is set to false, Spark will throw an exception if the output
directory already exists
2. I didn't change saveNewHadoopAPI because in the new API, the overwrite
flag is defined by the implementation of RecordWriter, we don't need to control
that in Spark
3. changed JavaAPI part
4. default behaviour is overwriting
-----
Two questions
1. should we deprecate the old APIs without such a flag?
2. I noticed that Spark Streaming also called these APIs, I thought we
don't need to change the related part in streaming? @tdas
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/CodingCat/incubator-spark SPARK-1100
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-spark/pull/626.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #626
----
commit 2ec87a1f63b4650036691e5bf5d484aae4e6d470
Author: CodingCat <[email protected]>
Date: 2014-02-21T05:32:17Z
add new APIs to enable users define whether to overwrite the output
directory
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
[email protected] or file a JIRA ticket with INFRA.
---