[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786569#comment-16786569
 ] 

Jungtaek Lim commented on SPARK-24295:
--------------------------------------

[~iqbal_khattra] [~alfredo-gimenez-bv]

Would we be happy if File Stream Sink periodically checks the existence of 
output files (maybe from older one?) and mark them as deleted so that next 
compaction would filter out them? If you think this sounds good, I'll take a 
step forward.

cc. to [~tdas] [~zsxwing] [~joseph.torres] since this looks like ongoing issue 
which multiple end users have been affected.

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> ------------------------------------------------------------------------
>
>                 Key: SPARK-24295
>                 URL: https://issues.apache.org/jira/browse/SPARK-24295
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Iqbal Singh
>            Priority: Major
>         Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to