[ 
https://issues.apache.org/jira/browse/SPARK-29995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16981406#comment-16981406
 ] 

Jungtaek Lim commented on SPARK-29995:
--------------------------------------

The thing is, "exactly-once" on file stream sink is achieved only when 
downstream query reads metadata on the output directory. In other words, if you 
delete some of metadata, the query which writes to the output to the directory 
may crash if it's still running, even downstream query will miss reading quite 
number of files from the output directory. That would be OK if you're not 
reading the output from another Spark query.

> Structured Streaming file-sink log grow indefinitely
> ----------------------------------------------------
>
>                 Key: SPARK-29995
>                 URL: https://issues.apache.org/jira/browse/SPARK-29995
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.4.0
>            Reporter: zhang liming
>            Priority: Major
>         Attachments: file.png, task.png
>
>
> When i use structured streaming parquet sink, I've noticed that the 
> File-Sink-Log files keep getting bigger, they are in 
> \{$checkpoint/_spark_metadata/}, i don't think this is reasonable.
> And when they merge files,task batches take longer to run, just like the 
> screenshot below



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to