Re:Re: [DISCUSS] FLIP-427: Disaggregated State Store

Feifan Wang Wed, 27 Mar 2024 20:41:32 -0700

Thanks for this valuable proposal Hangxiang !


> If we need to introduce a JNI call during each filesystem call, that would be 
> N times JNI cost compared with the current RocksDB state-backend's JNI cost.
What if we only use jni to access DFS that needs to reuse Flink FileSystem? And 
all local disk access through native api. This idea is based on the 
understanding that jni overhead is not worth mentioning compared to DFS access 
latency. It might make more sense to consider avoiding jni overhead for faster 
local disks. Since local disk as secondary is already under consideration [1], 
maybe we can discuss in that FLIP whether to use native api to access local 
disk?


>I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
>Different disaggregated state storages may have their own semantics about
>this configuration, e.g. life cycle, supported file systems or storages.
I agree with considering moving this configuration up to the engine level until 
there are other disaggreated backends.


[1] https://cwiki.apache.org/confluence/x/U4p3EQ

——————————————

Best regards,

Feifan Wang




At 2024-03-28 09:55:48, "Hangxiang Yu" <master...@gmail.com> wrote:
>Hi, Yun.
>Thanks for the reply.
>
>The JNI cost you considered is right. As replied to Yue, I agreed to leave
>space and consider proposal 1 as an optimization in the future, which is
>also updated in the FLIP.
>
>The other question is that the configuration of
>> `state.backend.forSt.working-dir` looks too coupled with the ForSt
>> state-backend, how would it be if we introduce another disaggregated state
>> storage? Thus, I think `state.backend.disaggregated.working-dir` might be a
>> better configuration name.
>
>I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
>Different disaggregated state storages may have their own semantics about
>this configuration, e.g. life cycle, supported file systems or storages.
>Maybe it's more suitable to consider it together when we introduce other
>disaggregated state storages in the future.
>
>On Thu, Mar 28, 2024 at 12:02 AM Yun Tang <myas...@live.com> wrote:
>
>> Hi Hangxiang,
>>
>> The design looks good, and I also support leaving space for proposal 1.
>>
>> As you know, loading index/filter/data blocks for querying across levels
>> would introduce high IO access within the LSM tree for old data. If we need
>> to introduce a JNI call during each filesystem call, that would be N times
>> JNI cost compared with the current RocksDB state-backend's JNI cost.
>>
>> The other question is that the configuration of
>> `state.backend.forSt.working-dir` looks too coupled with the ForSt
>> state-backend, how would it be if we introduce another disaggregated state
>> storage? Thus, I think `state.backend.disaggregated.working-dir` might be a
>> better configuration name.
>>
>>
>> Best
>> Yun Tang
>>
>> ________________________________
>> From: Hangxiang Yu <master...@gmail.com>
>> Sent: Wednesday, March 20, 2024 11:32
>> To: dev@flink.apache.org <dev@flink.apache.org>
>> Subject: Re: [DISCUSS] FLIP-427: Disaggregated State Store
>>
>> Hi, Yue.
>> Thanks for the reply.
>>
>> If we use proposal1, we can easily reuse these optimizations .It is even
>> > possible to discuss and review the solution together in the Rocksdb
>> > community.
>>
>> We also saw these useful optimizations which could be applied to ForSt in
>> the future.
>> But IIUC, it's not binding to proposal 1, right? We could also
>> implement interfaces about temperature and secondary cache to reuse them,
>> or organize a more complex HybridEnv based on proposal 2.
>>
>> My point is whether we should retain the potential of proposal 1 in the
>> > design.
>> >
>> This is a good suggestion. We choose proposal 2 firstly due to its
>> maintainability and scalability, especially because it could leverage all
>> filesystems flink supported conveniently.
>> Given the indelible advantage in performance, I think we could also
>> consider proposal 1 as an optimization in the future.
>> For the interface on the DB side, we could also expose more different Envs
>> in the future.
>>
>>
>> On Tue, Mar 19, 2024 at 9:14 PM yue ma <mayuefi...@gmail.com> wrote:
>>
>> > Hi Hangxiang,
>> >
>> > Thanks for bringing this discussion.
>> > I have a few questions about the Proposal you mentioned in the FLIP.
>> >
>> > The current conclusion is to use proposal 2, which is okay for me. My
>> point
>> > is whether we should retain the potential of proposal 1 in the design.
>> > There are the following reasons:
>> > 1. No JNI overhead, just like the Performance Part mentioned in Flip
>> > 2. RocksDB currently also provides an interface for Env, and there are
>> also
>> > some implementations, such as HDFS-ENV, which seem to be easily scalable.
>> > 3. The RocksDB community continues to support LSM for different storage
>> > media, such as  Tiered Storage
>> > <
>> >
>> https://github.com/facebook/rocksdb/wiki/Tiered-Storage-%28Experimental%29
>> > >
>> >       And some optimizations have been made for this scenario, such as
>> Per
>> > Key Placement Comparison
>> > <https://rocksdb.org/blog/2022/11/09/time-aware-tiered-storage.html>.
>> >      *Secondary cache
>> > <
>> >
>> https://github.com/facebook/rocksdb/wiki/SecondaryCache-%28Experimental%29
>> > >*,
>> > similar to the Hybrid Block Cache mentioned in Flip-423
>> >  If we use proposal1, we can easily reuse these optimizations .It is even
>> > possible to discuss and review the solution together in the Rocksdb
>> > community.
>> >  In fact, we have already implemented some production practices using
>> > Proposal1 internally. We have integrated HybridEnv, Tiered Storage, and
>> > Secondary Cache on RocksDB and optimized the performance of Checkpoint
>> and
>> > State Restore. It seems work well for us.
>> >
>> > --
>> > Best,
>> > Yue
>> >
>>
>>
>> --
>> Best,
>> Hangxiang.
>>
>
>
>-- 
>Best,
>Hangxiang.

Re:Re: [DISCUSS] FLIP-427: Disaggregated State Store

Reply via email to