[ 
https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672805#comment-16672805
 ] 

Jungtaek Lim commented on SPARK-20568:
--------------------------------------

[~zsxwing]
I've thought about it a bit. I'm not familiar with file stream source, but if 
I'm not missing here, there's no "progressing" state of file: file should be 
processed in a batch once it is included.

So we have two options here:

1. Delete (or move out) files which are included in finished batch files in 
"sources" directory in checkpoint.
2. Delete (or move out) files which are included in "current" batch when batch 
is just completed.

If we move out files to some directory like "archive", I guess option 2 is 
safe. Moved files can be moved again to re-run previous batch if end users 
really want. Actually I haven't heard actual cases which remove some batches in 
checkpoint directory to rerun previous batch.

What do you think about the options?

> Delete files after processing in structured streaming
> -----------------------------------------------------
>
>                 Key: SPARK-20568
>                 URL: https://issues.apache.org/jira/browse/SPARK-20568
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 2.1.0, 2.2.1
>            Reporter: Saul Shanabrook
>            Priority: Major
>
> It would be great to be able to delete files after processing them with 
> structured streaming.
> For example, I am reading in a bunch of JSON files and converting them into 
> Parquet. If the JSON files are not deleted after they are processed, it 
> quickly fills up my hard drive. I originally [posted this on Stack 
> Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to 
> make a feature request for it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to