Nice finding!

Given you already pointed out previous issue which fixed similar issue, it
would be also easy for you to craft the patch and verify whether the fix
resolves your issue. Looking forward to see your patch.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Jun 12, 2019 at 8:23 PM Gerard Maas <gerard.m...@gmail.com> wrote:

> Ooops - linked the wrong JIRA ticket:  (that other one is related)
>
> https://issues.apache.org/jira/browse/SPARK-28025
>
> On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas <gerard.m...@gmail.com> wrote:
>
>> Hi!
>> I would like to socialize this issue we are currently facing:
>> The Structured Streaming default CheckpointFileManager leaks .crc files
>> by leaving them behind after users of this class (like
>> HDFSBackedStateStoreProvider) apply their cleanup methods.
>>
>> This results in an unbounded creation of tiny files that eat away storage
>> by the block and, in our case, deteriorates the file system performance.
>>
>> We correlated the processedRowsPerSecond reported by the
>> StreamingQueryProgress against a count of the .crc files in the storage
>> directory (checkpoint + state store). The performance impact we observe is
>> dramatic.
>>
>> We are running on Kubernetes, using GlusterFS as the shared storage
>> provider.
>> [image: out processedRowsPerSecond vs. files in storage_process.png]
>> I have created a JIRA ticket with additional detail:
>>
>> https://issues.apache.org/jira/browse/SPARK-17475
>>
>> This is also related to an earlier discussion about the state store
>> unbounded disk-size growth, which was left unresolved back then:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-State-Store-storage-behavior-for-the-Stream-Deduplication-function-td34883.html
>>
>> If there's any additional detail I should add/research, please let me
>> know.
>>
>> kind regards, Gerard.
>>
>>
>>

-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Reply via email to