Jungtaek Lim created SPARK-27188:
------------------------------------

             Summary: FileStreamSink: provide a new option to disable metadata 
log
                 Key: SPARK-27188
                 URL: https://issues.apache.org/jira/browse/SPARK-27188
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 3.0.0
            Reporter: Jungtaek Lim


>From SPARK-24295 we indicated various end users are struggling with dealing 
>with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary 
>readers which leverage metadata log to determine which files are safely read 
>(to ensure 'exactly-once'), pruning metadata log is not trivial to implement.

While we may be able to deal with checking deleted output files in 
FileStreamSink and get rid of them when compacting metadata, that operation 
would take additional overhead for running query. (I'll try to address this via 
another issue though.)

Back to the issue, 'exactly-once' via leveraging metadata is only possible when 
output directory is being read by Spark, and for other cases it should provide 
less guarantee. I think we could provide this as a workaround to mitigate such 
issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to