Hi, Feifan. Thanks for your reply.
What if we only use jni to access DFS that needs to reuse Flink FileSystem? > And all local disk access through native api. This idea is based on the > understanding that jni overhead is not worth mentioning compared to DFS > access latency. It might make more sense to consider avoiding jni overhead > for faster local disks. Since local disk as secondary is already under > consideration [1], maybe we can discuss in that FLIP whether to use native > api to access local disk? > This is a good suggestion. It's reasonable to use native api to access local disk cache since it requires lower latency compared to remote access. I also believe that the jni overhead is relatively negligible when weighed against the latency of remote I/O as mentioned in the FLIP. So I think we could just go on proposal 2 and keep proposal 1 as a potential future optimization, which could work better when there is a higher performance requirement or some native libraries of filesystems have significantly higher performance and resource usage compared to their java libs. On Thu, Mar 28, 2024 at 11:39 AM Feifan Wang <zoltar9...@163.com> wrote: > Thanks for this valuable proposal Hangxiang ! > > > > If we need to introduce a JNI call during each filesystem call, that > would be N times JNI cost compared with the current RocksDB state-backend's > JNI cost. > What if we only use jni to access DFS that needs to reuse Flink > FileSystem? And all local disk access through native api. This idea is > based on the understanding that jni overhead is not worth mentioning > compared to DFS access latency. It might make more sense to consider > avoiding jni overhead for faster local disks. Since local disk as secondary > is already under consideration [1], maybe we can discuss in that FLIP > whether to use native api to access local disk? > > > >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now. > >Different disaggregated state storages may have their own semantics about > >this configuration, e.g. life cycle, supported file systems or storages. > I agree with considering moving this configuration up to the engine level > until there are other disaggreated backends. > > > [1] https://cwiki.apache.org/confluence/x/U4p3EQ > > —————————————— > > Best regards, > > Feifan Wang > > > > > At 2024-03-28 09:55:48, "Hangxiang Yu" <master...@gmail.com> wrote: > >Hi, Yun. > >Thanks for the reply. > > > >The JNI cost you considered is right. As replied to Yue, I agreed to leave > >space and consider proposal 1 as an optimization in the future, which is > >also updated in the FLIP. > > > >The other question is that the configuration of > >> `state.backend.forSt.working-dir` looks too coupled with the ForSt > >> state-backend, how would it be if we introduce another disaggregated > state > >> storage? Thus, I think `state.backend.disaggregated.working-dir` might > be a > >> better configuration name. > > > >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now. > >Different disaggregated state storages may have their own semantics about > >this configuration, e.g. life cycle, supported file systems or storages. > >Maybe it's more suitable to consider it together when we introduce other > >disaggregated state storages in the future. > > > >On Thu, Mar 28, 2024 at 12:02 AM Yun Tang <myas...@live.com> wrote: > > > >> Hi Hangxiang, > >> > >> The design looks good, and I also support leaving space for proposal 1. > >> > >> As you know, loading index/filter/data blocks for querying across levels > >> would introduce high IO access within the LSM tree for old data. If we > need > >> to introduce a JNI call during each filesystem call, that would be N > times > >> JNI cost compared with the current RocksDB state-backend's JNI cost. > >> > >> The other question is that the configuration of > >> `state.backend.forSt.working-dir` looks too coupled with the ForSt > >> state-backend, how would it be if we introduce another disaggregated > state > >> storage? Thus, I think `state.backend.disaggregated.working-dir` might > be a > >> better configuration name. > >> > >> > >> Best > >> Yun Tang > >> > >> ________________________________ > >> From: Hangxiang Yu <master...@gmail.com> > >> Sent: Wednesday, March 20, 2024 11:32 > >> To: dev@flink.apache.org <dev@flink.apache.org> > >> Subject: Re: [DISCUSS] FLIP-427: Disaggregated State Store > >> > >> Hi, Yue. > >> Thanks for the reply. > >> > >> If we use proposal1, we can easily reuse these optimizations .It is even > >> > possible to discuss and review the solution together in the Rocksdb > >> > community. > >> > >> We also saw these useful optimizations which could be applied to ForSt > in > >> the future. > >> But IIUC, it's not binding to proposal 1, right? We could also > >> implement interfaces about temperature and secondary cache to reuse > them, > >> or organize a more complex HybridEnv based on proposal 2. > >> > >> My point is whether we should retain the potential of proposal 1 in the > >> > design. > >> > > >> This is a good suggestion. We choose proposal 2 firstly due to its > >> maintainability and scalability, especially because it could leverage > all > >> filesystems flink supported conveniently. > >> Given the indelible advantage in performance, I think we could also > >> consider proposal 1 as an optimization in the future. > >> For the interface on the DB side, we could also expose more different > Envs > >> in the future. > >> > >> > >> On Tue, Mar 19, 2024 at 9:14 PM yue ma <mayuefi...@gmail.com> wrote: > >> > >> > Hi Hangxiang, > >> > > >> > Thanks for bringing this discussion. > >> > I have a few questions about the Proposal you mentioned in the FLIP. > >> > > >> > The current conclusion is to use proposal 2, which is okay for me. My > >> point > >> > is whether we should retain the potential of proposal 1 in the design. > >> > There are the following reasons: > >> > 1. No JNI overhead, just like the Performance Part mentioned in Flip > >> > 2. RocksDB currently also provides an interface for Env, and there are > >> also > >> > some implementations, such as HDFS-ENV, which seem to be easily > scalable. > >> > 3. The RocksDB community continues to support LSM for different > storage > >> > media, such as Tiered Storage > >> > < > >> > > >> > https://github.com/facebook/rocksdb/wiki/Tiered-Storage-%28Experimental%29 > >> > > > >> > And some optimizations have been made for this scenario, such as > >> Per > >> > Key Placement Comparison > >> > <https://rocksdb.org/blog/2022/11/09/time-aware-tiered-storage.html>. > >> > *Secondary cache > >> > < > >> > > >> > https://github.com/facebook/rocksdb/wiki/SecondaryCache-%28Experimental%29 > >> > >*, > >> > similar to the Hybrid Block Cache mentioned in Flip-423 > >> > If we use proposal1, we can easily reuse these optimizations .It is > even > >> > possible to discuss and review the solution together in the Rocksdb > >> > community. > >> > In fact, we have already implemented some production practices using > >> > Proposal1 internally. We have integrated HybridEnv, Tiered Storage, > and > >> > Secondary Cache on RocksDB and optimized the performance of Checkpoint > >> and > >> > State Restore. It seems work well for us. > >> > > >> > -- > >> > Best, > >> > Yue > >> > > >> > >> > >> -- > >> Best, > >> Hangxiang. > >> > > > > > >-- > >Best, > >Hangxiang. > -- Best, Hangxiang.