Hi! I would like to socialize this issue we are currently facing: The Structured Streaming default CheckpointFileManager leaks .crc files by leaving them behind after users of this class (like HDFSBackedStateStoreProvider) apply their cleanup methods.
This results in an unbounded creation of tiny files that eat away storage by the block and, in our case, deteriorates the file system performance. We correlated the processedRowsPerSecond reported by the StreamingQueryProgress against a count of the .crc files in the storage directory (checkpoint + state store). The performance impact we observe is dramatic. We are running on Kubernetes, using GlusterFS as the shared storage provider. [image: out processedRowsPerSecond vs. files in storage_process.png] I have created a JIRA ticket with additional detail: https://issues.apache.org/jira/browse/SPARK-17475 This is also related to an earlier discussion about the state store unbounded disk-size growth, which was left unresolved back then: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-State-Store-storage-behavior-for-the-Stream-Deduplication-function-td34883.html If there's any additional detail I should add/research, please let me know. kind regards, Gerard.