Github user HeartSaVioR commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22952#discussion_r231429484
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -530,6 +530,8 @@ Here are the details of all the sources in Spark.
             "s3://a/dataset.txt"<br/>
             "s3n://a/b/dataset.txt"<br/>
             "s3a://a/b/c/dataset.txt"<br/>
    +        <br/>
    +        <code>renameCompletedFiles</code>: whether to rename completed 
files in previous batch (default: false). If the option is enabled, input file 
will be renamed with additional postfix "_COMPLETED_". This is useful to clean 
up old input files to save space in storage.
    --- End diff --
    
    Hi @dongjoon-hyun , thanks for pointing out good point! I was being 
concerned about only filesystem/HDFS case and not familiar with cloud 
environment.
    
    I guess we have possible options here:
    
    1. Rename in background thread. 
    
    For option 1, we may want to restrict the max files to enqueue, and when it 
reaches the max we may handle some of them synchronously. And we also may need 
to postpone JVM shutdown until all enqueued files are renamed.
    
    2. Provide additional option: delete (options are mutually exclusive)
    
    Actually the actions end users are expected to take are 1. moving to 
archive directory (with compression or not) 2. delete periodically. If 
moving/renaming require non-trivial cost, end users may want to just delete 
files directly without backing up.
    
    3. Document the overhead to description of option.
    
    While we can not clearly say how much the cost is, we can explain the fact 
the cleanup operation may affect processing of batch.
    
    Provided options are not mutually exclusive.
    
    cc. to @steveloughran - I think you're expert on cloud storage: could you 
provide your thought on this?
    also cc. to @zsxwing in case of missing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to