[jira] [Commented] (SPARK-20568) Delete files after processing in structured streaming

Jungtaek Lim (JIRA) Wed, 15 Aug 2018 22:26:31 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581979#comment-16581979
 ]


Jungtaek Lim commented on SPARK-20568:
--------------------------------------

For me, the feature looks like the missing spot for streaming query. Unlike 
batch query, streaming query has checkpoints which refer specific batches, and 
in theory we should be able to remove files which are processed before earliest 
available batch in checkpoints, because the query can't be rolled back prior to 
earliest checkpoint.

Stream processing assumes that events come infinitely, and based on assumption 
Spark should provide safer way(s) to move/remove old events which are going to 
be never accessed later. I believe Spark Kafka connector is able to deal with 
Kafka retention. Text source should also support this.

[~srowen] [~zsxwing] Could we revisit this? If we find benefit on supporting 
this, I'll think about how to do it and provide a patch.

> Delete files after processing in structured streaming
> -----------------------------------------------------
>
>                 Key: SPARK-20568
>                 URL: https://issues.apache.org/jira/browse/SPARK-20568
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 2.1.0, 2.2.1
>            Reporter: Saul Shanabrook
>            Priority: Major
>
> It would be great to be able to delete files after processing them with 
> structured streaming.
> For example, I am reading in a bunch of JSON files and converting them into 
> Parquet. If the JSON files are not deleted after they are processed, it 
> quickly fills up my hard drive. I originally [posted this on Stack 
> Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to 
> make a feature request for it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20568) Delete files after processing in structured streaming

Reply via email to