[ https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581979#comment-16581979 ]
Jungtaek Lim commented on SPARK-20568: -------------------------------------- For me, the feature looks like the missing spot for streaming query. Unlike batch query, streaming query has checkpoints which refer specific batches, and in theory we should be able to remove files which are processed before earliest available batch in checkpoints, because the query can't be rolled back prior to earliest checkpoint. Stream processing assumes that events come infinitely, and based on assumption Spark should provide safer way(s) to move/remove old events which are going to be never accessed later. I believe Spark Kafka connector is able to deal with Kafka retention. Text source should also support this. [~srowen] [~zsxwing] Could we revisit this? If we find benefit on supporting this, I'll think about how to do it and provide a patch. > Delete files after processing in structured streaming > ----------------------------------------------------- > > Key: SPARK-20568 > URL: https://issues.apache.org/jira/browse/SPARK-20568 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming > Affects Versions: 2.1.0, 2.2.1 > Reporter: Saul Shanabrook > Priority: Major > > It would be great to be able to delete files after processing them with > structured streaming. > For example, I am reading in a bunch of JSON files and converting them into > Parquet. If the JSON files are not deleted after they are processed, it > quickly fills up my hard drive. I originally [posted this on Stack > Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to > make a feature request for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org