[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

zsxwing Wed, 28 Nov 2018 17:14:25 -0800

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22952#discussion_r237314690
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -530,6 +530,12 @@ Here are the details of all the sources in Spark.
             "s3://a/dataset.txt"<br/>
             "s3n://a/b/dataset.txt"<br/>
             "s3a://a/b/c/dataset.txt"<br/>
    +        <code>cleanSource</code>: option to clean up completed files after 
processing.<br/>
    +        Available options are "archive", "delete", "no_op". If the option 
is not provided, the default value is "no_op".<br/>
    +        When "archive" is provided, additional option 
<code>sourceArchiveDir</code> must be provided as well. The value of 
"sourceArchiveDir" must be outside of source path, to ensure archived files are 
never included to new source files again.<br/>
    +        Spark will move source files respecting its own path. For example, 
if the path of source file is "/a/b/dataset.txt" and the path of archive 
directory is "/archived/here", file will be moved to 
"/archived/here/a/b/dataset.txt"<br/>
    +        NOTE: Both archiving (via moving) or deleting completed files 
would introduce overhead (slow down) in each micro-batch, so you need to 
understand the cost for each operation in your file system before enabling this 
option. On the other hand, enbling this option would reduce the cost to list 
source files which is considered as a heavy operation.<br/>
    +        NOTE 2: The source path should not be used from multiple queries 
when enabling this option, because source files will be moved or deleted which 
behavior may impact the other queries.
    --- End diff --
    
    NOTE 3: Both delete and move actions are best effort. Failing to delete or 
move files will not fail the streaming query.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

Reply via email to