Hi Hangxiang,

Thanks for the proposal. +1 for it.
I have a few comments.

Proposal 2 has additional JNI overhead, but the overhead is relatively
> negligible when weighed against the latency of remote I/O.

- Is this argument true for all workloads? Or does this argument also hold
for workloads with many small files, which is quite a common case [1] ?

- Also, in many workloads the engine does not need the whole file either
because of the query forces it or
file type supports efficient filtering (e.g. ORC, parquet, arrow files), or
simply one file is "divided" among multiple workers.
In these cases, the engine spawns huge amount of scan range requests to the
file system to retrieve different parts of a file.
How the proposed solution would work with these workloads?

- The similar question related to the above applies also for caching ( I
know caching is subject of FLIP-429, asking here becasue of the related
section in this FLIP).

Regards,
Jeyhun

[1] https://blog.min.io/challenge-big-data-small-files/



On Thu, Mar 7, 2024 at 10:09 AM Hangxiang Yu <master...@gmail.com> wrote:

> Hi devs,
>
>
> I'd like to start a discussion on a sub-FLIP of FLIP-423: Disaggregated
> State Storage and Management[1], which is a joint work of Yuan Mei, Zakelly
> Lan, Jinzhong Li, Hangxiang Yu, Yanfei Lei and Feng Wang:
>
> - FLIP-427: Disaggregated State Store
>
> This FLIP introduces the initial version of the ForSt disaggregated state
> store.
>
> Please make sure you have read the FLIP-423[1] to know the whole story, and
> we'll discuss the details of FLIP-427[2] under this mail. For the
> discussion of overall architecture or topics related with multiple
> sub-FLIPs, please post in the previous mail[3].
>
> Looking forward to hearing from you!
>
> [1] https://cwiki.apache.org/confluence/x/R4p3EQ
>
> [2] https://cwiki.apache.org/confluence/x/T4p3EQ
>
> [3] https://lists.apache.org/thread/ct8smn6g9y0b8730z7rp9zfpnwmj8vf0
>
>
> Best,
>
> Hangxiang.
>

Reply via email to