[StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Hi!
I would like to socialize this issue we are currently facing:
The Structured Streaming default CheckpointFileManager leaks .crc files by
leaving them behind after users of this class (like
HDFSBackedStateStoreProvider) apply their cleanup methods.

This results in an unbounded creation of tiny files that eat away storage
by the block and, in our case, deteriorates the file system performance.

We correlated the processedRowsPerSecond reported by the
StreamingQueryProgress against a count of the .crc files in the storage
directory (checkpoint + state store). The performance impact we observe is
dramatic.

We are running on Kubernetes, using GlusterFS as the shared storage
provider.
[image: out processedRowsPerSecond vs. files in storage_process.png]
I have created a JIRA ticket with additional detail:

https://issues.apache.org/jira/browse/SPARK-17475

This is also related to an earlier discussion about the state store
unbounded disk-size growth, which was left unresolved back then:
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-State-Store-storage-behavior-for-the-Stream-Deduplication-function-td34883.html

If there's any additional detail I should add/research, please let me know.

kind regards, Gerard.


Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Ooops - linked the wrong JIRA ticket:  (that other one is related)

https://issues.apache.org/jira/browse/SPARK-28025

On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas  wrote:

> Hi!
> I would like to socialize this issue we are currently facing:
> The Structured Streaming default CheckpointFileManager leaks .crc files by
> leaving them behind after users of this class (like
> HDFSBackedStateStoreProvider) apply their cleanup methods.
>
> This results in an unbounded creation of tiny files that eat away storage
> by the block and, in our case, deteriorates the file system performance.
>
> We correlated the processedRowsPerSecond reported by the
> StreamingQueryProgress against a count of the .crc files in the storage
> directory (checkpoint + state store). The performance impact we observe is
> dramatic.
>
> We are running on Kubernetes, using GlusterFS as the shared storage
> provider.
> [image: out processedRowsPerSecond vs. files in storage_process.png]
> I have created a JIRA ticket with additional detail:
>
> https://issues.apache.org/jira/browse/SPARK-17475
>
> This is also related to an earlier discussion about the state store
> unbounded disk-size growth, which was left unresolved back then:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-State-Store-storage-behavior-for-the-Stream-Deduplication-function-td34883.html
>
> If there's any additional detail I should add/research, please let me know.
>
> kind regards, Gerard.
>
>
>


Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Jungtaek Lim
Nice finding!

Given you already pointed out previous issue which fixed similar issue, it
would be also easy for you to craft the patch and verify whether the fix
resolves your issue. Looking forward to see your patch.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Jun 12, 2019 at 8:23 PM Gerard Maas  wrote:

> Ooops - linked the wrong JIRA ticket:  (that other one is related)
>
> https://issues.apache.org/jira/browse/SPARK-28025
>
> On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas  wrote:
>
>> Hi!
>> I would like to socialize this issue we are currently facing:
>> The Structured Streaming default CheckpointFileManager leaks .crc files
>> by leaving them behind after users of this class (like
>> HDFSBackedStateStoreProvider) apply their cleanup methods.
>>
>> This results in an unbounded creation of tiny files that eat away storage
>> by the block and, in our case, deteriorates the file system performance.
>>
>> We correlated the processedRowsPerSecond reported by the
>> StreamingQueryProgress against a count of the .crc files in the storage
>> directory (checkpoint + state store). The performance impact we observe is
>> dramatic.
>>
>> We are running on Kubernetes, using GlusterFS as the shared storage
>> provider.
>> [image: out processedRowsPerSecond vs. files in storage_process.png]
>> I have created a JIRA ticket with additional detail:
>>
>> https://issues.apache.org/jira/browse/SPARK-17475
>>
>> This is also related to an earlier discussion about the state store
>> unbounded disk-size growth, which was left unresolved back then:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-State-Store-storage-behavior-for-the-Stream-Deduplication-function-td34883.html
>>
>> If there's any additional detail I should add/research, please let me
>> know.
>>
>> kind regards, Gerard.
>>
>>
>>

-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior