RocksDB State Backend GET returns null intermittently

2023-06-21 Thread Prabhu Joseph
Hi,

RocksDB State Backend GET call on a key that was PUT into the state like
100 ms earlier but is not returned intermittently. The issue never happened
with the HashDB State backend. We are trying to increase block cache size,
write buffer size, and enable bloom filter as per the doc: -
https://flink.apache.org/2021/01/18/using-rocksdb-state-backend-in-apache-flink-when-and-how/

Any ideas on what could be wrong or how to debug this?

Thanks,
Prabhu Joseph


Re: RocksDB State Backend GET returns null intermittently

2023-06-24 Thread Hangxiang Yu
Hi, Prabhu.

This is a correctness issue. IIUC, It should not be related to the size of
the block cache, write buffer, or whether the bloom filter is enabled.

Is your job a DataStream job? Does the job contain a custom Serializer? You
could check or share the logic of the Serializer, as this is one of the
main differences between RocksDBStateBackend and HashMapStateBackend
(HashMapStateBackend does not perform serialization and deserialization).

On Wed, Jun 21, 2023 at 3:44 PM Prabhu Joseph 
wrote:

> Hi,
>
> RocksDB State Backend GET call on a key that was PUT into the state like
> 100 ms earlier but is not returned intermittently. The issue never happened
> with the HashDB State backend. We are trying to increase block cache size,
> write buffer size, and enable bloom filter as per the doc: -
> https://flink.apache.org/2021/01/18/using-rocksdb-state-backend-in-apache-flink-when-and-how/
>
> Any ideas on what could be wrong or how to debug this?
>
> Thanks,
> Prabhu Joseph
>


-- 
Best,
Hangxiang.


Re: RocksDB State Backend GET returns null intermittently

2023-06-27 Thread Alexander Fedulov
Hi Prabhu,

make sure that the key you use is the same for both records and try to
reproduce the issue with the level of parallelism of 1.

Best,
Alex

On Sun, 25 Jun 2023 at 04:29, Hangxiang Yu  wrote:

> Hi, Prabhu.
>
> This is a correctness issue. IIUC, It should not be related to the size of
> the block cache, write buffer, or whether the bloom filter is enabled.
>
> Is your job a DataStream job? Does the job contain a custom Serializer?
> You could check or share the logic of the Serializer, as this is one of the
> main differences between RocksDBStateBackend and HashMapStateBackend
> (HashMapStateBackend does not perform serialization and deserialization).
>
> On Wed, Jun 21, 2023 at 3:44 PM Prabhu Joseph 
> wrote:
>
>> Hi,
>>
>> RocksDB State Backend GET call on a key that was PUT into the state like
>> 100 ms earlier but is not returned intermittently. The issue never happened
>> with the HashDB State backend. We are trying to increase block cache size,
>> write buffer size, and enable bloom filter as per the doc: -
>> https://flink.apache.org/2021/01/18/using-rocksdb-state-backend-in-apache-flink-when-and-how/
>>
>> Any ideas on what could be wrong or how to debug this?
>>
>> Thanks,
>> Prabhu Joseph
>>
>
>
> --
> Best,
> Hangxiang.
>


Re: RocksDB State Backend GET returns null intermittently

2023-06-27 Thread Prabhu Joseph
Thanks Hangxiang and Alex for the pointers. Have added audit logs into
RocsDBValueState (GET call: value() and PUT call: update()) and found
nothing wrong on the RocsDB side. It never sends Null to the GET call for
the key, which was PUT earlier. Then we added audit logs into the CX
application and found they have a cache (HashMap) on top of
RocsDBValueState to speed up, which is where the issue is. The application
checks the key from the cache first, and if it does not exist, it gets it
from RocsDBValueState. There is a race condition in their code where they
override the RocsDBValueState with a new entry that does not have the
previous state, causing an issue.

Sorry for the confusion; it turned out to be a problem on the Flink
Application side rather than the Framework side.




On Tue, Jun 27, 2023 at 2:53 PM Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:

> Hi Prabhu,
>
> make sure that the key you use is the same for both records and try to
> reproduce the issue with the level of parallelism of 1.
>
> Best,
> Alex
>
> On Sun, 25 Jun 2023 at 04:29, Hangxiang Yu  wrote:
>
>> Hi, Prabhu.
>>
>> This is a correctness issue. IIUC, It should not be related to the size
>> of the block cache, write buffer, or whether the bloom filter is enabled.
>>
>> Is your job a DataStream job? Does the job contain a custom Serializer?
>> You could check or share the logic of the Serializer, as this is one of the
>> main differences between RocksDBStateBackend and HashMapStateBackend
>> (HashMapStateBackend does not perform serialization and deserialization).
>>
>> On Wed, Jun 21, 2023 at 3:44 PM Prabhu Joseph 
>> wrote:
>>
>>> Hi,
>>>
>>> RocksDB State Backend GET call on a key that was PUT into the state like
>>> 100 ms earlier but is not returned intermittently. The issue never happened
>>> with the HashDB State backend. We are trying to increase block cache size,
>>> write buffer size, and enable bloom filter as per the doc: -
>>> https://flink.apache.org/2021/01/18/using-rocksdb-state-backend-in-apache-flink-when-and-how/
>>>
>>> Any ideas on what could be wrong or how to debug this?
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>
>>
>> --
>> Best,
>> Hangxiang.
>>
>