Hi!
I would like to socialize this issue we are currently facing:
The Structured Streaming default CheckpointFileManager leaks .crc files by
leaving them behind after users of this class (like
HDFSBackedStateStoreProvider) apply their cleanup methods.

This results in an unbounded creation of tiny files that eat away storage
by the block and, in our case, deteriorates the file system performance.

We correlated the processedRowsPerSecond reported by the
StreamingQueryProgress against a count of the .crc files in the storage
directory (checkpoint + state store). The performance impact we observe is
dramatic.

We are running on Kubernetes, using GlusterFS as the shared storage
provider.
[image: out processedRowsPerSecond vs. files in storage_process.png]
I have created a JIRA ticket with additional detail:

https://issues.apache.org/jira/browse/SPARK-17475

This is also related to an earlier discussion about the state store
unbounded disk-size growth, which was left unresolved back then:
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-State-Store-storage-behavior-for-the-Stream-Deduplication-function-td34883.html

If there's any additional detail I should add/research, please let me know.

kind regards, Gerard.

Reply via email to