GitHub user CodingCat opened a pull request:

    https://github.com/apache/incubator-spark/pull/626

    [SPARK-1100] prevent Spark from overwriting directory silently and leaving 
dirty directory

    Thanks for Diana Carroll to report this issue
    
    the current saveAsTextFile/SequenceFile will overwrite the output directory 
silently if the directory already exists, this behaviour is not desirable 
because
    
    1. overwriting the data silently is not user-friendly
    
    2. if the partition number of two writing operation changed, then the 
output directory will contain the results generated by two runnings
    
    My fix includes:
    
    1. add some new APIs with a flag for users to define whether he/she wants 
to overwrite the directory:
    
    if the flag is set to true, then the output directory is deleted first and 
then written into the new data to prevent the output directory contains results 
from multiple rounds of running; 
    
    if the flag is set to false, Spark will throw an exception if the output 
directory already exists
    
    2. I didn't change saveNewHadoopAPI because in the new API, the overwrite 
flag is defined by the implementation of RecordWriter, we don't need to control 
that in Spark
    
    3. changed JavaAPI part
    
    4. default behaviour is overwriting
    
    -----
    
    Two questions
    
    1. should we deprecate the old APIs without such a flag?
    
    2. I noticed that Spark Streaming also called these APIs, I thought we 
don't need to change the related part in streaming? @tdas 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/CodingCat/incubator-spark SPARK-1100

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-spark/pull/626.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #626
    
----
commit 2ec87a1f63b4650036691e5bf5d484aae4e6d470
Author: CodingCat <zhunans...@gmail.com>
Date:   2014-02-21T05:32:17Z

    add new APIs to enable users define whether to overwrite the output 
directory

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---

Reply via email to