Re: Re: [DISCUSS] FLIP-427: Disaggregated State Store

Hangxiang Yu Wed, 27 Mar 2024 21:19:27 -0700

Hi, Feifan.

Thanks for your reply.


What if we only use jni to access DFS that needs to reuse Flink FileSystem?
> And all local disk access through native api. This idea is based on the
> understanding that jni overhead is not worth mentioning compared to DFS
> access latency. It might make more sense to consider avoiding jni overhead
> for faster local disks. Since local disk as secondary is already under
> consideration [1], maybe we can discuss in that FLIP whether to use native
> api to access local disk?
>
This is a good suggestion. It's reasonable to use native api to access
local disk cache since it requires lower latency compared to remote access.
I also believe that the jni overhead is relatively negligible when weighed
against the latency of remote I/O as mentioned in the FLIP.
So I think we could just go on proposal 2 and keep proposal 1 as a
potential future optimization, which could work better when there is a
higher performance requirement or some native libraries of filesystems have
significantly higher performance and resource usage compared to their java
libs.


On Thu, Mar 28, 2024 at 11:39 AM Feifan Wang <zoltar9...@163.com> wrote:

> Thanks for this valuable proposal Hangxiang !
>
>
> > If we need to introduce a JNI call during each filesystem call, that
> would be N times JNI cost compared with the current RocksDB state-backend's
> JNI cost.
> What if we only use jni to access DFS that needs to reuse Flink
> FileSystem? And all local disk access through native api. This idea is
> based on the understanding that jni overhead is not worth mentioning
> compared to DFS access latency. It might make more sense to consider
> avoiding jni overhead for faster local disks. Since local disk as secondary
> is already under consideration [1], maybe we can discuss in that FLIP
> whether to use native api to access local disk?
>
>
> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
> >Different disaggregated state storages may have their own semantics about
> >this configuration, e.g. life cycle, supported file systems or storages.
> I agree with considering moving this configuration up to the engine level
> until there are other disaggreated backends.
>
>
> [1] https://cwiki.apache.org/confluence/x/U4p3EQ
>
> ——————————————
>
> Best regards,
>
> Feifan Wang
>
>
>
>
> At 2024-03-28 09:55:48, "Hangxiang Yu" <master...@gmail.com> wrote:
> >Hi, Yun.
> >Thanks for the reply.
> >
> >The JNI cost you considered is right. As replied to Yue, I agreed to leave
> >space and consider proposal 1 as an optimization in the future, which is
> >also updated in the FLIP.
> >
> >The other question is that the configuration of
> >> `state.backend.forSt.working-dir` looks too coupled with the ForSt
> >> state-backend, how would it be if we introduce another disaggregated
> state
> >> storage? Thus, I think `state.backend.disaggregated.working-dir` might
> be a
> >> better configuration name.
> >
> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
> >Different disaggregated state storages may have their own semantics about
> >this configuration, e.g. life cycle, supported file systems or storages.
> >Maybe it's more suitable to consider it together when we introduce other
> >disaggregated state storages in the future.
> >
> >On Thu, Mar 28, 2024 at 12:02 AM Yun Tang <myas...@live.com> wrote:
> >
> >> Hi Hangxiang,
> >>
> >> The design looks good, and I also support leaving space for proposal 1.
> >>
> >> As you know, loading index/filter/data blocks for querying across levels
> >> would introduce high IO access within the LSM tree for old data. If we
> need
> >> to introduce a JNI call during each filesystem call, that would be N
> times
> >> JNI cost compared with the current RocksDB state-backend's JNI cost.
> >>
> >> The other question is that the configuration of
> >> `state.backend.forSt.working-dir` looks too coupled with the ForSt
> >> state-backend, how would it be if we introduce another disaggregated
> state
> >> storage? Thus, I think `state.backend.disaggregated.working-dir` might
> be a
> >> better configuration name.
> >>
> >>
> >> Best
> >> Yun Tang
> >>
> >> ________________________________
> >> From: Hangxiang Yu <master...@gmail.com>
> >> Sent: Wednesday, March 20, 2024 11:32
> >> To: dev@flink.apache.org <dev@flink.apache.org>
> >> Subject: Re: [DISCUSS] FLIP-427: Disaggregated State Store
> >>
> >> Hi, Yue.
> >> Thanks for the reply.
> >>
> >> If we use proposal1, we can easily reuse these optimizations .It is even
> >> > possible to discuss and review the solution together in the Rocksdb
> >> > community.
> >>
> >> We also saw these useful optimizations which could be applied to ForSt
> in
> >> the future.
> >> But IIUC, it's not binding to proposal 1, right? We could also
> >> implement interfaces about temperature and secondary cache to reuse
> them,
> >> or organize a more complex HybridEnv based on proposal 2.
> >>
> >> My point is whether we should retain the potential of proposal 1 in the
> >> > design.
> >> >
> >> This is a good suggestion. We choose proposal 2 firstly due to its
> >> maintainability and scalability, especially because it could leverage
> all
> >> filesystems flink supported conveniently.
> >> Given the indelible advantage in performance, I think we could also
> >> consider proposal 1 as an optimization in the future.
> >> For the interface on the DB side, we could also expose more different
> Envs
> >> in the future.
> >>
> >>
> >> On Tue, Mar 19, 2024 at 9:14 PM yue ma <mayuefi...@gmail.com> wrote:
> >>
> >> > Hi Hangxiang,
> >> >
> >> > Thanks for bringing this discussion.
> >> > I have a few questions about the Proposal you mentioned in the FLIP.
> >> >
> >> > The current conclusion is to use proposal 2, which is okay for me. My
> >> point
> >> > is whether we should retain the potential of proposal 1 in the design.
> >> > There are the following reasons:
> >> > 1. No JNI overhead, just like the Performance Part mentioned in Flip
> >> > 2. RocksDB currently also provides an interface for Env, and there are
> >> also
> >> > some implementations, such as HDFS-ENV, which seem to be easily
> scalable.
> >> > 3. The RocksDB community continues to support LSM for different
> storage
> >> > media, such as  Tiered Storage
> >> > <
> >> >
> >>
> https://github.com/facebook/rocksdb/wiki/Tiered-Storage-%28Experimental%29
> >> > >
> >> >       And some optimizations have been made for this scenario, such as
> >> Per
> >> > Key Placement Comparison
> >> > <https://rocksdb.org/blog/2022/11/09/time-aware-tiered-storage.html>.
> >> >      *Secondary cache
> >> > <
> >> >
> >>
> https://github.com/facebook/rocksdb/wiki/SecondaryCache-%28Experimental%29
> >> > >*,
> >> > similar to the Hybrid Block Cache mentioned in Flip-423
> >> >  If we use proposal1, we can easily reuse these optimizations .It is
> even
> >> > possible to discuss and review the solution together in the Rocksdb
> >> > community.
> >> >  In fact, we have already implemented some production practices using
> >> > Proposal1 internally. We have integrated HybridEnv, Tiered Storage,
> and
> >> > Secondary Cache on RocksDB and optimized the performance of Checkpoint
> >> and
> >> > State Restore. It seems work well for us.
> >> >
> >> > --
> >> > Best,
> >> > Yue
> >> >
> >>
> >>
> >> --
> >> Best,
> >> Hangxiang.
> >>
> >
> >
> >--
> >Best,
> >Hangxiang.
>


-- 
Best,
Hangxiang.

Re: Re: [DISCUSS] FLIP-427: Disaggregated State Store

Reply via email to