Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r231429484 --- Diff: docs/structured-streaming-programming-guide.md --- @@ -530,6 +530,8 @@ Here are the details of all the sources in Spark. "s3://a/dataset.txt"<br/> "s3n://a/b/dataset.txt"<br/> "s3a://a/b/c/dataset.txt"<br/> + <br/> + <code>renameCompletedFiles</code>: whether to rename completed files in previous batch (default: false). If the option is enabled, input file will be renamed with additional postfix "_COMPLETED_". This is useful to clean up old input files to save space in storage. --- End diff -- Hi @dongjoon-hyun , thanks for pointing out good point! I was being concerned about only filesystem/HDFS case and not familiar with cloud environment. I guess we have possible options here: 1. Rename in background thread. For option 1, we may want to restrict the max files to enqueue, and when it reaches the max we may handle some of them synchronously. And we also may need to postpone JVM shutdown until all enqueued files are renamed. 2. Provide additional option: delete (options are mutually exclusive) Actually the actions end users are expected to take are 1. moving to archive directory (with compression or not) 2. delete periodically. If moving/renaming require non-trivial cost, end users may want to just delete files directly without backing up. 3. Document the overhead to description of option. While we can not clearly say how much the cost is, we can explain the fact the cleanup operation may affect processing of batch. Provided options are not mutually exclusive. cc. to @steveloughran - I think you're expert on cloud storage: could you provide your thought on this? also cc. to @zsxwing in case of missing.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org