[GitHub] spark pull request #22952: [SPARK-20568][SS] Rename files which are complete...

HeartSaVioR Wed, 07 Nov 2018 14:06:50 -0800

Github user HeartSaVioR commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22952#discussion_r231695749
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -530,6 +530,8 @@ Here are the details of all the sources in Spark.
             "s3://a/dataset.txt"<br/>
             "s3n://a/b/dataset.txt"<br/>
             "s3a://a/b/c/dataset.txt"<br/>
    +        <br/>
    +        <code>renameCompletedFiles</code>: whether to rename completed 
files in previous batch (default: false). If the option is enabled, input file 
will be renamed with additional postfix "_COMPLETED_". This is useful to clean 
up old input files to save space in storage.
    --- End diff --
    
    @dongjoon-hyun 
    For Storm, it renames input file twice, 1. in process 2. completed 
(actually it is not a rename, but move to archive directory). HDFS spout is 
created at 2015 which I don't expect there's deep consideration on cloud 
storage.
    For Flink I have no idea, I'll explore how they handle it.
    
    I think the feature is just an essential thing in ETL situation: a comment 
in JIRA clearly shows why the feature is needed.
    
https://issues.apache.org/jira/browse/SPARK-20568?focusedCommentId=16356929&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16356929



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22952: [SPARK-20568][SS] Rename files which are complete...

Reply via email to