[ 
https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356929#comment-16356929
 ] 

Julian commented on SPARK-20568:
--------------------------------

I've started with Data ingestion using Structured Streaming where we will be 
processing large amounts of csv data (later XML via kafka to which I hope to 
switch to the kafka structured streaming source). In short, about 6+GB per 
minute which we need to process/transform through Spark. On smaller scale / 
user data sets, I can understand wanting to keep the data, however on large 
scale ELT/ETL and/or streaming flows, we typically want to archive the last N 
hours/days for recovery purposes - the raw data is just too large to keep (note 
above is just one of already 30 data sources we have connected and many more 
are coming). Often upstream systems also can re-push the data, so keeping is 
not a problem for all sources. It is very useful for us to be able to move the 
data once it is processed. I have no choice but to implement a solution for 
this, but I at least know now I need to build something for this. I can think 
of some simple "hdfs dfs -mv" commands to achieve something like this but I'm 
not yet fully understanding the relationship between the input files, for each 
writer close() method and parallel nature on the HDP cluster. Also, I notice if 
the process dies and restarts, it reads the data again (at the moment) which 
would be a disaster with this much data! Need to figure that out to.

> Delete files after processing in structured streaming
> -----------------------------------------------------
>
>                 Key: SPARK-20568
>                 URL: https://issues.apache.org/jira/browse/SPARK-20568
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>            Reporter: Saul Shanabrook
>            Priority: Major
>
> It would be great to be able to delete files after processing them with 
> structured streaming.
> For example, I am reading in a bunch of JSON files and converting them into 
> Parquet. If the JSON files are not deleted after they are processed, it 
> quickly fills up my hard drive. I originally [posted this on Stack 
> Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to 
> make a feature request for it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to