[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772296#comment-16772296
 ] 

Jungtaek Lim commented on SPARK-24295:
--------------------------------------

Please correct me if I'm missing here. I just skimmed the codebase to determine 
usage around FileStreamSinkLog, and looks like it's for other queries to speed 
up reading list of source when the query is chained.

If I'm not mistaken then I'm not sure Spark can purge this metadata, because 
Spark can't determine which files are processed by all queries, and it actually 
never happens because new query can be run at any time.

If end users periodically run file deletion based on their data retention 
policy, metadata should be purged but it cannot be done in Spark cause it 
doesn't even know file deletion is happening. So IMHO what [~iqbal_khattra] is 
doing actually seems to be the right approach - it's just not ideal because end 
users should understand about metadata and be able to modify it.

I can see the benefit of file sink metadata (avoid listing files which would 
take too long), but given that it can only grow and also would be out of sync 
when separate processes (like data retention) delete the part of sink output, 
we may need to have data retention on file sink (actually I would feel very 
strange if sink is removing its output) and purge metadata based on file 
deletion, or just don't rely on sink metadata at all.
(I feel we may need to have explicit option to not leverage file stream sink 
metadata - both source and sink.)

[~iqbal_khattra]
Could memory file index help on your case in the long run? Passing glob path 
would skip using file sink metadata.

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> ------------------------------------------------------------------------
>
>                 Key: SPARK-24295
>                 URL: https://issues.apache.org/jira/browse/SPARK-24295
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Iqbal Singh
>            Priority: Major
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to