Re: Impact of RocksDB backend on the Java heap

2024-02-19 Thread Alexis Sarda-Espinosa
Hi Zakelly,

Yeah that makes sense to me, I was just curious about whether reading could
be a bottleneck or not, but I imagine user-specific logic would be better
than a generic cache from Flink that might habe a low hit rate.

Thanks again,
Alexis.

On Mon, 19 Feb 2024, 07:29 Zakelly Lan,  wrote:

> Hi Alexis,
>
> Assuming the bulk load for a batch of sequential keys performs better than
> accessing them one by one, the main problem comes to do we really need to
> access all the keys that were bulk-loaded to cache before. In other words,
> cache hit rate is the key issue. If the rate is high, even though a single
> key-value is large and loading them is slow, it is still worth it to load
> them in advance. In case of timer and iteration (which I missed in last
> mail), the cache is almost guaranteed to hit. Thus a cache is introduced to
> enhance the performance here.
>
>
> Best,
> Zakelly
>
> On Sun, Feb 18, 2024 at 7:42 PM Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com> wrote:
>
>> Hi Zakelly,
>>
>> thanks for the information, that's interesting. Would you say that
>> reading a subset from RocksDB is fast enough to be pretty much negligible,
>> or could it be a bottleneck if the state of each key is "large"? Again
>> assuming the number of distinct partition keys is large.
>>
>> Regards,
>> Alexis.
>>
>> On Sun, 18 Feb 2024, 05:02 Zakelly Lan,  wrote:
>>
>>> Hi Alexis,
>>>
>>> Flink does need some heap memory to bridge requests to rocksdb and
>>> gather the results. In most cases, the memory is discarded immediately
>>> (eventually collected by GC). In case of timers, flink do cache a limited
>>> subset of key-values in heap to improve performance.
>>>
>>> In general you don't need to consider its heap consumption since it is
>>> minor.
>>>
>>>
>>> Best,
>>> Zakelly
>>>
>>> On Fri, Feb 16, 2024 at 4:43 AM Asimansu Bera 
>>> wrote:
>>>
 Hello Alexis,

 I don't think data in RocksDB resides in JVM even with function calls.

 For more details, check the link below:

 https://github.com/facebook/rocksdb/wiki/RocksDB-Overview#3-high-level-architecture

 RocksDB has three main components - memtable, sstfile and WAL(not used
 in Flink as Flink uses checkpointing). When TM starts with statebackend as
 RocksDB,TM has its own RocksDB instance and the state is managed as column
 Family by that TM. Any changes of state go into memtable --> sst-->
 persistent store. When read, data goes to the buffers and cache of RocksDB.

 In the case of RocksDB as state backend, JVM still holds threads stack
 as for high degree of parallelism, there are many
 stacks maintaining separate thread information.

 Hope this helps!!





 On Thu, Feb 15, 2024 at 11:21 AM Alexis Sarda-Espinosa <
 sarda.espin...@gmail.com> wrote:

> Hi Asimansu
>
> The memory RocksDB manages is outside the JVM, yes, but the mentioned
> subsets must be bridged to the JVM somehow so that the data can be exposed
> to the functions running inside Flink, no?
>
> Regards,
> Alexis.
>
>
> On Thu, 15 Feb 2024, 14:06 Asimansu Bera, 
> wrote:
>
>> Hello Alexis,
>>
>> RocksDB resides off-heap and outside of JVM. The small subset of data
>> ends up on the off-heap in the memory.
>>
>> For more details, check the following link:
>>
>> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/memory/mem_setup_tm/#managed-memory
>>
>> I hope this addresses your inquiry.
>>
>>
>>
>>
>> On Thu, Feb 15, 2024 at 12:52 AM Alexis Sarda-Espinosa <
>> sarda.espin...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Most info regarding RocksDB memory for Flink focuses on what's
>>> needed independently of the JVM (although the Flink process configures 
>>> its
>>> limits and so on). I'm wondering if there are additional special
>>> considerations with regards to the JVM heap in the following scenario.
>>>
>>> Assuming a key used to partition a Flink stream and its state has a
>>> high cardinality, but that the state of each key is small, when Flink
>>> prepares the state to expose to a user function during a call (with a 
>>> given
>>> partition key), I guess it loads only the required subset from RocksDB, 
>>> but
>>> does this small subset end (temporarily) up on the JVM heap? And if it
>>> does, does it stay "cached" in the JVM for some time or is it 
>>> immediately
>>> discarded after the user function completes?
>>>
>>> Maybe this isn't even under Flink's control, but I'm curious.
>>>
>>> Regards,
>>> Alexis.
>>>
>>


Re: Impact of RocksDB backend on the Java heap

2024-02-18 Thread Zakelly Lan
Hi Alexis,

Assuming the bulk load for a batch of sequential keys performs better than
accessing them one by one, the main problem comes to do we really need to
access all the keys that were bulk-loaded to cache before. In other words,
cache hit rate is the key issue. If the rate is high, even though a single
key-value is large and loading them is slow, it is still worth it to load
them in advance. In case of timer and iteration (which I missed in last
mail), the cache is almost guaranteed to hit. Thus a cache is introduced to
enhance the performance here.


Best,
Zakelly

On Sun, Feb 18, 2024 at 7:42 PM Alexis Sarda-Espinosa <
sarda.espin...@gmail.com> wrote:

> Hi Zakelly,
>
> thanks for the information, that's interesting. Would you say that reading
> a subset from RocksDB is fast enough to be pretty much negligible, or could
> it be a bottleneck if the state of each key is "large"? Again assuming the
> number of distinct partition keys is large.
>
> Regards,
> Alexis.
>
> On Sun, 18 Feb 2024, 05:02 Zakelly Lan,  wrote:
>
>> Hi Alexis,
>>
>> Flink does need some heap memory to bridge requests to rocksdb and gather
>> the results. In most cases, the memory is discarded immediately (eventually
>> collected by GC). In case of timers, flink do cache a limited subset of
>> key-values in heap to improve performance.
>>
>> In general you don't need to consider its heap consumption since it is
>> minor.
>>
>>
>> Best,
>> Zakelly
>>
>> On Fri, Feb 16, 2024 at 4:43 AM Asimansu Bera 
>> wrote:
>>
>>> Hello Alexis,
>>>
>>> I don't think data in RocksDB resides in JVM even with function calls.
>>>
>>> For more details, check the link below:
>>>
>>> https://github.com/facebook/rocksdb/wiki/RocksDB-Overview#3-high-level-architecture
>>>
>>> RocksDB has three main components - memtable, sstfile and WAL(not used
>>> in Flink as Flink uses checkpointing). When TM starts with statebackend as
>>> RocksDB,TM has its own RocksDB instance and the state is managed as column
>>> Family by that TM. Any changes of state go into memtable --> sst-->
>>> persistent store. When read, data goes to the buffers and cache of RocksDB.
>>>
>>> In the case of RocksDB as state backend, JVM still holds threads stack
>>> as for high degree of parallelism, there are many
>>> stacks maintaining separate thread information.
>>>
>>> Hope this helps!!
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Feb 15, 2024 at 11:21 AM Alexis Sarda-Espinosa <
>>> sarda.espin...@gmail.com> wrote:
>>>
 Hi Asimansu

 The memory RocksDB manages is outside the JVM, yes, but the mentioned
 subsets must be bridged to the JVM somehow so that the data can be exposed
 to the functions running inside Flink, no?

 Regards,
 Alexis.


 On Thu, 15 Feb 2024, 14:06 Asimansu Bera, 
 wrote:

> Hello Alexis,
>
> RocksDB resides off-heap and outside of JVM. The small subset of data
> ends up on the off-heap in the memory.
>
> For more details, check the following link:
>
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/memory/mem_setup_tm/#managed-memory
>
> I hope this addresses your inquiry.
>
>
>
>
> On Thu, Feb 15, 2024 at 12:52 AM Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com> wrote:
>
>> Hello,
>>
>> Most info regarding RocksDB memory for Flink focuses on what's needed
>> independently of the JVM (although the Flink process configures its 
>> limits
>> and so on). I'm wondering if there are additional special considerations
>> with regards to the JVM heap in the following scenario.
>>
>> Assuming a key used to partition a Flink stream and its state has a
>> high cardinality, but that the state of each key is small, when Flink
>> prepares the state to expose to a user function during a call (with a 
>> given
>> partition key), I guess it loads only the required subset from RocksDB, 
>> but
>> does this small subset end (temporarily) up on the JVM heap? And if it
>> does, does it stay "cached" in the JVM for some time or is it immediately
>> discarded after the user function completes?
>>
>> Maybe this isn't even under Flink's control, but I'm curious.
>>
>> Regards,
>> Alexis.
>>
>


Re: Impact of RocksDB backend on the Java heap

2024-02-18 Thread Alexis Sarda-Espinosa
Hi Zakelly,

thanks for the information, that's interesting. Would you say that reading
a subset from RocksDB is fast enough to be pretty much negligible, or could
it be a bottleneck if the state of each key is "large"? Again assuming the
number of distinct partition keys is large.

Regards,
Alexis.

On Sun, 18 Feb 2024, 05:02 Zakelly Lan,  wrote:

> Hi Alexis,
>
> Flink does need some heap memory to bridge requests to rocksdb and gather
> the results. In most cases, the memory is discarded immediately (eventually
> collected by GC). In case of timers, flink do cache a limited subset of
> key-values in heap to improve performance.
>
> In general you don't need to consider its heap consumption since it is
> minor.
>
>
> Best,
> Zakelly
>
> On Fri, Feb 16, 2024 at 4:43 AM Asimansu Bera 
> wrote:
>
>> Hello Alexis,
>>
>> I don't think data in RocksDB resides in JVM even with function calls.
>>
>> For more details, check the link below:
>>
>> https://github.com/facebook/rocksdb/wiki/RocksDB-Overview#3-high-level-architecture
>>
>> RocksDB has three main components - memtable, sstfile and WAL(not used in
>> Flink as Flink uses checkpointing). When TM starts with statebackend as
>> RocksDB,TM has its own RocksDB instance and the state is managed as column
>> Family by that TM. Any changes of state go into memtable --> sst-->
>> persistent store. When read, data goes to the buffers and cache of RocksDB.
>>
>> In the case of RocksDB as state backend, JVM still holds threads stack as
>> for high degree of parallelism, there are many stacks maintaining separate
>> thread information.
>>
>> Hope this helps!!
>>
>>
>>
>>
>>
>> On Thu, Feb 15, 2024 at 11:21 AM Alexis Sarda-Espinosa <
>> sarda.espin...@gmail.com> wrote:
>>
>>> Hi Asimansu
>>>
>>> The memory RocksDB manages is outside the JVM, yes, but the mentioned
>>> subsets must be bridged to the JVM somehow so that the data can be exposed
>>> to the functions running inside Flink, no?
>>>
>>> Regards,
>>> Alexis.
>>>
>>>
>>> On Thu, 15 Feb 2024, 14:06 Asimansu Bera, 
>>> wrote:
>>>
 Hello Alexis,

 RocksDB resides off-heap and outside of JVM. The small subset of data
 ends up on the off-heap in the memory.

 For more details, check the following link:

 https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/memory/mem_setup_tm/#managed-memory

 I hope this addresses your inquiry.




 On Thu, Feb 15, 2024 at 12:52 AM Alexis Sarda-Espinosa <
 sarda.espin...@gmail.com> wrote:

> Hello,
>
> Most info regarding RocksDB memory for Flink focuses on what's needed
> independently of the JVM (although the Flink process configures its limits
> and so on). I'm wondering if there are additional special considerations
> with regards to the JVM heap in the following scenario.
>
> Assuming a key used to partition a Flink stream and its state has a
> high cardinality, but that the state of each key is small, when Flink
> prepares the state to expose to a user function during a call (with a 
> given
> partition key), I guess it loads only the required subset from RocksDB, 
> but
> does this small subset end (temporarily) up on the JVM heap? And if it
> does, does it stay "cached" in the JVM for some time or is it immediately
> discarded after the user function completes?
>
> Maybe this isn't even under Flink's control, but I'm curious.
>
> Regards,
> Alexis.
>



Re: Impact of RocksDB backend on the Java heap

2024-02-17 Thread Zakelly Lan
Hi Alexis,

Flink does need some heap memory to bridge requests to rocksdb and gather
the results. In most cases, the memory is discarded immediately (eventually
collected by GC). In case of timers, flink do cache a limited subset of
key-values in heap to improve performance.

In general you don't need to consider its heap consumption since it is
minor.


Best,
Zakelly

On Fri, Feb 16, 2024 at 4:43 AM Asimansu Bera 
wrote:

> Hello Alexis,
>
> I don't think data in RocksDB resides in JVM even with function calls.
>
> For more details, check the link below:
>
> https://github.com/facebook/rocksdb/wiki/RocksDB-Overview#3-high-level-architecture
>
> RocksDB has three main components - memtable, sstfile and WAL(not used in
> Flink as Flink uses checkpointing). When TM starts with statebackend as
> RocksDB,TM has its own RocksDB instance and the state is managed as column
> Family by that TM. Any changes of state go into memtable --> sst-->
> persistent store. When read, data goes to the buffers and cache of RocksDB.
>
> In the case of RocksDB as state backend, JVM still holds threads stack as
> for high degree of parallelism, there are many stacks maintaining separate
> thread information.
>
> Hope this helps!!
>
>
>
>
>
> On Thu, Feb 15, 2024 at 11:21 AM Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com> wrote:
>
>> Hi Asimansu
>>
>> The memory RocksDB manages is outside the JVM, yes, but the mentioned
>> subsets must be bridged to the JVM somehow so that the data can be exposed
>> to the functions running inside Flink, no?
>>
>> Regards,
>> Alexis.
>>
>>
>> On Thu, 15 Feb 2024, 14:06 Asimansu Bera, 
>> wrote:
>>
>>> Hello Alexis,
>>>
>>> RocksDB resides off-heap and outside of JVM. The small subset of data
>>> ends up on the off-heap in the memory.
>>>
>>> For more details, check the following link:
>>>
>>> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/memory/mem_setup_tm/#managed-memory
>>>
>>> I hope this addresses your inquiry.
>>>
>>>
>>>
>>>
>>> On Thu, Feb 15, 2024 at 12:52 AM Alexis Sarda-Espinosa <
>>> sarda.espin...@gmail.com> wrote:
>>>
 Hello,

 Most info regarding RocksDB memory for Flink focuses on what's needed
 independently of the JVM (although the Flink process configures its limits
 and so on). I'm wondering if there are additional special considerations
 with regards to the JVM heap in the following scenario.

 Assuming a key used to partition a Flink stream and its state has a
 high cardinality, but that the state of each key is small, when Flink
 prepares the state to expose to a user function during a call (with a given
 partition key), I guess it loads only the required subset from RocksDB, but
 does this small subset end (temporarily) up on the JVM heap? And if it
 does, does it stay "cached" in the JVM for some time or is it immediately
 discarded after the user function completes?

 Maybe this isn't even under Flink's control, but I'm curious.

 Regards,
 Alexis.

>>>


Re: Impact of RocksDB backend on the Java heap

2024-02-15 Thread Asimansu Bera
Hello Alexis,

I don't think data in RocksDB resides in JVM even with function calls.

For more details, check the link below:
https://github.com/facebook/rocksdb/wiki/RocksDB-Overview#3-high-level-architecture

RocksDB has three main components - memtable, sstfile and WAL(not used in
Flink as Flink uses checkpointing). When TM starts with statebackend as
RocksDB,TM has its own RocksDB instance and the state is managed as column
Family by that TM. Any changes of state go into memtable --> sst-->
persistent store. When read, data goes to the buffers and cache of RocksDB.

In the case of RocksDB as state backend, JVM still holds threads stack as
for high degree of parallelism, there are many stacks maintaining separate
thread information.

Hope this helps!!





On Thu, Feb 15, 2024 at 11:21 AM Alexis Sarda-Espinosa <
sarda.espin...@gmail.com> wrote:

> Hi Asimansu
>
> The memory RocksDB manages is outside the JVM, yes, but the mentioned
> subsets must be bridged to the JVM somehow so that the data can be exposed
> to the functions running inside Flink, no?
>
> Regards,
> Alexis.
>
>
> On Thu, 15 Feb 2024, 14:06 Asimansu Bera,  wrote:
>
>> Hello Alexis,
>>
>> RocksDB resides off-heap and outside of JVM. The small subset of data
>> ends up on the off-heap in the memory.
>>
>> For more details, check the following link:
>>
>> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/memory/mem_setup_tm/#managed-memory
>>
>> I hope this addresses your inquiry.
>>
>>
>>
>>
>> On Thu, Feb 15, 2024 at 12:52 AM Alexis Sarda-Espinosa <
>> sarda.espin...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Most info regarding RocksDB memory for Flink focuses on what's needed
>>> independently of the JVM (although the Flink process configures its limits
>>> and so on). I'm wondering if there are additional special considerations
>>> with regards to the JVM heap in the following scenario.
>>>
>>> Assuming a key used to partition a Flink stream and its state has a high
>>> cardinality, but that the state of each key is small, when Flink prepares
>>> the state to expose to a user function during a call (with a given
>>> partition key), I guess it loads only the required subset from RocksDB, but
>>> does this small subset end (temporarily) up on the JVM heap? And if it
>>> does, does it stay "cached" in the JVM for some time or is it immediately
>>> discarded after the user function completes?
>>>
>>> Maybe this isn't even under Flink's control, but I'm curious.
>>>
>>> Regards,
>>> Alexis.
>>>
>>


Re: Impact of RocksDB backend on the Java heap

2024-02-15 Thread Alexis Sarda-Espinosa
Hi Asimansu

The memory RocksDB manages is outside the JVM, yes, but the mentioned
subsets must be bridged to the JVM somehow so that the data can be exposed
to the functions running inside Flink, no?

Regards,
Alexis.


On Thu, 15 Feb 2024, 14:06 Asimansu Bera,  wrote:

> Hello Alexis,
>
> RocksDB resides off-heap and outside of JVM. The small subset of data ends
> up on the off-heap in the memory.
>
> For more details, check the following link:
>
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/memory/mem_setup_tm/#managed-memory
>
> I hope this addresses your inquiry.
>
>
>
>
> On Thu, Feb 15, 2024 at 12:52 AM Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com> wrote:
>
>> Hello,
>>
>> Most info regarding RocksDB memory for Flink focuses on what's needed
>> independently of the JVM (although the Flink process configures its limits
>> and so on). I'm wondering if there are additional special considerations
>> with regards to the JVM heap in the following scenario.
>>
>> Assuming a key used to partition a Flink stream and its state has a high
>> cardinality, but that the state of each key is small, when Flink prepares
>> the state to expose to a user function during a call (with a given
>> partition key), I guess it loads only the required subset from RocksDB, but
>> does this small subset end (temporarily) up on the JVM heap? And if it
>> does, does it stay "cached" in the JVM for some time or is it immediately
>> discarded after the user function completes?
>>
>> Maybe this isn't even under Flink's control, but I'm curious.
>>
>> Regards,
>> Alexis.
>>
>


Re: Impact of RocksDB backend on the Java heap

2024-02-15 Thread Asimansu Bera
Hello Alexis,

RocksDB resides off-heap and outside of JVM. The small subset of data ends
up on the off-heap in the memory.

For more details, check the following link:
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/memory/mem_setup_tm/#managed-memory

I hope this addresses your inquiry.




On Thu, Feb 15, 2024 at 12:52 AM Alexis Sarda-Espinosa <
sarda.espin...@gmail.com> wrote:

> Hello,
>
> Most info regarding RocksDB memory for Flink focuses on what's needed
> independently of the JVM (although the Flink process configures its limits
> and so on). I'm wondering if there are additional special considerations
> with regards to the JVM heap in the following scenario.
>
> Assuming a key used to partition a Flink stream and its state has a high
> cardinality, but that the state of each key is small, when Flink prepares
> the state to expose to a user function during a call (with a given
> partition key), I guess it loads only the required subset from RocksDB, but
> does this small subset end (temporarily) up on the JVM heap? And if it
> does, does it stay "cached" in the JVM for some time or is it immediately
> discarded after the user function completes?
>
> Maybe this isn't even under Flink's control, but I'm curious.
>
> Regards,
> Alexis.
>