Re: Bloom Filter for Rocksdb

2023-10-29 Thread xiangyu feng
Hi Kean,

I would like to share with you our analysis of the pros and cons about
enabling Bloomfilter in production.

Pros:
By enabling BloomFilter, RocksDB.get() can filter out data files that not
contains this key for sure and hence reduce some random disk reads. This
performance improvement is determined by the state access pattern of the
operator and the random access performance of the disk. In some cases,
operator will always use the latest data or a fixed portion of the data
which are always cached in RocksDB's BlockCache. Then the improvement will
not be significant. In some cases, operator will random access the keys in
RocksDB, enabling bloomfliter in RocksDB will help a lot in this situation.

Cons:
By enabling BloomFilter, RocksDB's compaction process will add bloom filter
information for new generated SST files. This operation executes
asynchronously in the background, will not affect rocksdb's read and write
performance but will cost extra cpu usage/disk space.

Trade-offs:
The length of the bits in bloomfilter will influence the accuracy. Also
more bits means more CPU cost in generation.

So in general, if your job has sufficient CPU resources and random state
access pattern, I would recommend you enabling bloomfilter longer than
10bits.

Hope this helps you.

Regards,
Xiangyu

David Anderson  于2023年10月30日周一 10:41写道:

> I believe bloom filters are off by default because they add overhead and
> aren't always helpful. I.e., in workloads that are write heavy and have few
> reads, bloom filters aren't worth the overhead.
>
> David
>
> On Fri, Oct 20, 2023 at 11:31 AM Mate Czagany  wrote:
>
>> Hi,
>>
>> There have been no reports about setting this configuration causing any
>> issues. I would guess it's off by default because it can increase the
>> memory usage by an unpredictable amount.
>>
>> I would say feel free to enable it, from what you've said I also think
>> that this would improve the performance of your jobs. But make sure to
>> configure your jobs so that they will be able to accommodate the potential
>> memory footprint growth. Also please read the following resources to know
>> more about RocksDBs bloom filter:
>> https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter
>> https://rocksdb.org/blog/2014/09/12/new-bloom-filter-format.html
>>
>> Regards,
>> Mate
>>
>>
>> Kenan Kılıçtepe  ezt írta (időpont: 2023. okt.
>> 20., P, 15:50):
>>
>>> Can someone tell the exact performance effect of enabling bloom filter?
>>> May enabling it cause some unpredictable performance problems?
>>>
>>> I read what it is and how it works and it makes sense but  I also asked
>>> myself why the default value of state.backend.rocksdb.use-bloom-filter is
>>> false.
>>>
>>> We have a 5 servers flink cluster, processing real time IoT data coming
>>> from 5 million devices and for a lot of jobs, we keep different states for
>>> each device.
>>>
>>> Sometimes we have performance issues and when I check the flamegraph on
>>> the test server I always see rocksdb.get() is the blocker. I just want to
>>> increase rocksdb performance.
>>>
>>> Thanks
>>>
>>>


Re: Bloom Filter for Rocksdb

2023-10-29 Thread David Anderson
I believe bloom filters are off by default because they add overhead and
aren't always helpful. I.e., in workloads that are write heavy and have few
reads, bloom filters aren't worth the overhead.

David

On Fri, Oct 20, 2023 at 11:31 AM Mate Czagany  wrote:

> Hi,
>
> There have been no reports about setting this configuration causing any
> issues. I would guess it's off by default because it can increase the
> memory usage by an unpredictable amount.
>
> I would say feel free to enable it, from what you've said I also think
> that this would improve the performance of your jobs. But make sure to
> configure your jobs so that they will be able to accommodate the potential
> memory footprint growth. Also please read the following resources to know
> more about RocksDBs bloom filter:
> https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter
> https://rocksdb.org/blog/2014/09/12/new-bloom-filter-format.html
>
> Regards,
> Mate
>
>
> Kenan Kılıçtepe  ezt írta (időpont: 2023. okt. 20.,
> P, 15:50):
>
>> Can someone tell the exact performance effect of enabling bloom filter?
>> May enabling it cause some unpredictable performance problems?
>>
>> I read what it is and how it works and it makes sense but  I also asked
>> myself why the default value of state.backend.rocksdb.use-bloom-filter is
>> false.
>>
>> We have a 5 servers flink cluster, processing real time IoT data coming
>> from 5 million devices and for a lot of jobs, we keep different states for
>> each device.
>>
>> Sometimes we have performance issues and when I check the flamegraph on
>> the test server I always see rocksdb.get() is the blocker. I just want to
>> increase rocksdb performance.
>>
>> Thanks
>>
>>


Re: Bloom Filter for Rocksdb

2023-10-20 Thread Mate Czagany
Hi,

There have been no reports about setting this configuration causing any
issues. I would guess it's off by default because it can increase the
memory usage by an unpredictable amount.

I would say feel free to enable it, from what you've said I also think that
this would improve the performance of your jobs. But make sure to configure
your jobs so that they will be able to accommodate the potential memory
footprint growth. Also please read the following resources to know more
about RocksDBs bloom filter:
https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter
https://rocksdb.org/blog/2014/09/12/new-bloom-filter-format.html

Regards,
Mate


Kenan Kılıçtepe  ezt írta (időpont: 2023. okt. 20.,
P, 15:50):

> Can someone tell the exact performance effect of enabling bloom filter?
> May enabling it cause some unpredictable performance problems?
>
> I read what it is and how it works and it makes sense but  I also asked
> myself why the default value of state.backend.rocksdb.use-bloom-filter is
> false.
>
> We have a 5 servers flink cluster, processing real time IoT data coming
> from 5 million devices and for a lot of jobs, we keep different states for
> each device.
>
> Sometimes we have performance issues and when I check the flamegraph on
> the test server I always see rocksdb.get() is the blocker. I just want to
> increase rocksdb performance.
>
> Thanks
>
>


Re: Bloom Filter for Rocksdb

2023-10-20 Thread Kartoglu, Emre
I don’t know much about the performance improvements that may come from using 
bloom filters, but I believe you can also improve RocksDB performance by 
increasing managed memory 
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-managed-fraction
 which RocksDB uses.



From: Kenan Kılıçtepe 
Date: Friday, 20 October 2023 at 14:51
To: user 
Subject: [EXTERNAL] Bloom Filter for Rocksdb


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Can someone tell the exact performance effect of enabling bloom filter?
May enabling it cause some unpredictable performance problems?

I read what it is and how it works and it makes sense but  I also asked myself 
why the default value of state.backend.rocksdb.use-bloom-filter is false.

We have a 5 servers flink cluster, processing real time IoT data coming from 5 
million devices and for a lot of jobs, we keep different states for each device.

Sometimes we have performance issues and when I check the flamegraph on the 
test server I always see rocksdb.get() is the blocker. I just want to increase 
rocksdb performance.

Thanks



Bloom Filter for Rocksdb

2023-10-20 Thread Kenan Kılıçtepe
Can someone tell the exact performance effect of enabling bloom filter?
May enabling it cause some unpredictable performance problems?

I read what it is and how it works and it makes sense but  I also asked
myself why the default value of state.backend.rocksdb.use-bloom-filter is
false.

We have a 5 servers flink cluster, processing real time IoT data coming
from 5 million devices and for a lot of jobs, we keep different states for
each device.

Sometimes we have performance issues and when I check the flamegraph on the
test server I always see rocksdb.get() is the blocker. I just want to
increase rocksdb performance.

Thanks