Re: [Discuss] FIP-25: Support Multi-Location for Remote Storage

Jark Wu Mon, 02 Mar 2026 23:04:17 -0800

Thanks Liebing for the updating.

+1 to start the vote.


Best,
Jark

On Tue, 3 Mar 2026 at 14:34, Liebing Yu <[email protected]> wrote:
>
> Hi Jark!
>
> Thank you for your insightful suggestions. This FIP is a small step for
> Fluss towards multi-remote (or multi-cloud) storage. As you mentioned, we
> envision future support for commit-level multi-pathing, similar to the
> approaches taken by Paimon and Lance.
>
> For your comments on the current FIP. I'm generally in agreement.
>
> 1. FileSystem#obtainSecurityToken(FsPath f)
> For the current implementation, obtainSecurityToken(FsPath f) is actually
> redundant and can be removed.
>
> 2. GetFileSystemSecurityTokenRequest/Response and Client Token Management
> Your suggestion simplifies the implementation of multi-path authorization.
> By deferring table-level authentication to a later stage, we can expedite
> the landing of this FIP. I will update the FIP accordingly.
>
> Best regards,
> Liebing Yu
>
>
> On Tue, 3 Mar 2026 at 01:19, Jark Wu <[email protected]> wrote:
>
> > Hi Liebing,
> >
> > Thank you for the proposal. I believe this is an excellent initiative
> > to improve throughput for large-scale clusters utilizing remote
> > storage.
> >
> > The current design implements multi-location support at the table or
> > partition level, meaning only new tables and partitions will utilize
> > new remote locations. Consequently, even after upgrading the cluster
> > to support multiple paths, data distribution will remain concentrated
> > in a single location for an extended period, failing to achieve rapid
> > traffic fan-out. In contrast, industry solutions like Paimon support
> > "data-file.external-paths" [1] to distribute new data files across
> > multiple paths, and Lance has recently introduced a file-level
> > multi-base layout [2].
> >
> > Ultimately, we need file-level multi-location support (I believe this
> > approach will resolve most of the concerns raised above by Yang Guo).
> > However, I am fine with supporting partition-level multi-location as
> > an initial phase, provided we have a clear roadmap toward the final
> > solution.
> >
> > Regarding the design details of this FIP, I have the following comments:
> >
> > 1. FileSystem#obtainSecurityToken(FsPath f)
> > We should not add the FsPath parameter to the obtainSecurityToken
> > interface for now. Because in current design, this interface only
> > retrieve the security token for the entire filesystem rather than for
> > a specific path. Since a filesystem is defined per authority, the
> > authority does not need to be derived from an FsPath.
> >
> > In fact, we plan to refactor the Filesystem soon. This refactoring
> > will add the FsPath parameter to obtainSecurityToken, ensuring the
> > returned token is strictly scoped to that specific path. This change
> > aims to address current permission leakage issues where a token
> > requested for reading one table inadvertently grants access to all
> > remote files of other tables.
> >
> > 2. GetFileSystemSecurityTokenRequest/Response and Client Token Management
> >
> > Current Issue: The FIP proposes maintaining a SecurityTokenManager per
> > LogScanner. However, since tokens are shared at the filesystem
> > granularity, tokens for the same FsKey across different tables should
> > be consolidated. Therefore, the DefaultSecurityTokenManager must be
> > maintained within the FlussConnection; otherwise,
> > SecurityTokenManagers for different tables will overwrite each other's
> > tokens.
> >
> > Recommendation: A straightforward approach is to leave
> > GetFileSystemSecurityTokenRequest unchanged while modifying
> > GetFileSystemSecurityTokenResponse to return a list of tokens. The
> > server side would then return STS tokens for each FsKey configured in
> > the cluster. The client-side Filesystem would subsequently retrieve
> > the corresponding STS token based on the FsKey. This avoids changes to
> > the LogScanner logic.
> >
> > While this approach retains the existing permission leakage issue,
> > that problem is already present today. We can address it in a
> > separate, dedicated FIP to simplify the scope and implementation of
> > the current proposal.
> >
> > Best,
> > Jark
> >
> > [1] https://paimon.apache.org/docs/1.3/maintenance/configurations/
> > [2]
> > https://lancedb.com/blog/rethinking-table-file-paths-lance-multi-base-layout/
> >
> >
> > On Sat, 28 Feb 2026 at 20:37, Yang Guo <[email protected]> wrote:
> > >
> > > Hi Liebing and all,
> > >
> > > This is a good FIP to resolve bottlenecks in the remote storage. Thanks
> > for
> > > your effort. The design looks good to me and the above discussion has
> > > covered some concerns in my mind.
> > >
> > > Now there are some further considerations I'm thinking of:
> > >
> > > 1. What happens if a path goes down?
> > >   Right now, there’s no automatic failover. If one S3 bucket (or HDFS
> > path)
> > > dies, every table or partition assigned to it just fails. Could we add
> > > simple health checks? If a path looks dead, the remote dir selector
> > > temporarily skips it until it’s back up.
> > >
> > > 2. New paths don't always help old data.
> > > The routing only happens when a new table or new partition is created.
> > And
> > > it depends on the partition strategy.
> > > - If the table is using time-based partitions (e.g., daily), adding new
> > > paths works well because new data goes to new partitions on new paths.
> > > - But for non-partitioned tables, or if it keeps writing to old
> > partitions,
> > > the new paths sit idle. The traffic never shifts over.
> > > It requires developers to think further about partition strategy and
> > input
> > > data when adding remote dirs.
> > >
> > > 3. Managing "weights" is tricky manually for developers/maintainers.
> > > Since the weighted round-robin is static:
> > >     - Developers/Maintainers have to determine the right weights based on
> > > current traffic.
> > >     - If you skew weights to favor a path, you have to remember to
> > > rebalance them later, or that path gets overloaded forever. E.g. If two
> > > paths are weighted [1, 2] in the beginning to rebalance the higher
> > traffic
> > > in the first path. Developers/Maintainers should remember to change the
> > > weights back to [1, 1] after the traffic is balanced between two paths.
> > > Otherwise the traffic in the second path will keep growing.
> > >     - Also, setting a weight to 0 behaves differently depending on your
> > > partition type (time-based paths eventually go quiet, but field-based
> > ones
> > > like "country=US" keep writing there forever).
> > > Instead of manual tuning, could we eventually make this dynamic? Let the
> > > system adjust weights based on real-time latency or throttling metrics.
> > >
> > > The points above are about future operational considerations—regarding
> > > failover and maintenance after this solution is deployed. I think they
> > > won't block this FIP. We may not need to fix these right now. Just bring
> > > them into this discussion.
> > >
> > > Regards,
> > > Yang Guo
> > >
> > > On Fri, Feb 27, 2026 at 5:53 PM Liebing Yu <[email protected]> wrote:
> > >
> > > > Hi Lorenzo, sorry for the late reply.
> > > >
> > > > Thanks for the AWS example! This further solidifies the case for
> > multi-path
> > > > support.
> > > >
> > > > Regarding your question about multi-cloud support:
> > > > Our current design naturally supports multi-cloud object storage
> > systems.
> > > > Since the implementation is built upon a multi-schema filesystem
> > > > abstraction (supporting schemes like s3://, oss://, abfs://, etc.), the
> > > > system is inherently "cloud-agnostic."
> > > >
> > > > Best regards,
> > > > Liebing Yu
> > > >
> > > >
> > > > On Wed, 4 Feb 2026 at 23:37, Lorenzo Affetti via dev <
> > [email protected]
> > > > >
> > > > wrote:
> > > >
> > > > > This is quite an interesting FIP and I think it is a significant
> > > > > enhancement, especially for large-scale clusters.
> > > > >
> > > > > I think you can also add the AWS case in your motivation:
> > > > >
> > > > >
> > > >
> > https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-high-request-rate
> > > > > AWS automatically scales if requests exceed 5,500 per second for the
> > same
> > > > > prefix, which results in transient 503 errors.
> > > > > Your approach would eliminate this problem by providing another
> > bucket.
> > > > >
> > > > > I was wondering if it might also provide the possibility of
> > configuring
> > > > the
> > > > > same Fluss cluster for multi-cloud object storage systems.
> > > > > From a design perspective, nothing should prevent me from storing
> > remote
> > > > > data on both Azure and AWS at the same time, probably resulting in
> > > > > different performance numbers for different partitions/tables.
> > > > > Should the design force the use of only 1 filesystem implementation?
> > > > >
> > > > > Thank you again!
> > > > >
> > > > > On Fri, Jan 30, 2026 at 7:59 AM Liebing Yu <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi Yuxia, thanks for the thoughtful response. Let me go through
> > your
> > > > > > questions one by one.
> > > > > >
> > > > > > 1. I think after we support `remote.data.dirs`, different schemas
> > will
> > > > be
> > > > > > supported naturally.
> > > > > > 2. Yes, I think we should change from `PbTablePath` to
> > > > > > `PbPhysicalTablePath`.
> > > > > > 3. Thanks for the reminder. I'll poc authentication in
> > > > > > https://github.com/apache/fluss/issues/2518. But it doesn't block
> > the
> > > > > > multiple-paths implementation in Fluss server in
> > > > > > https://github.com/apache/fluss/issues/2517.
> > > > > > 4. For a partition table, the table itself has a remote data dir
> > for
> > > > > > metadata (such as lake offset). And each partition has its own
> > remote
> > > > dir
> > > > > > for table data (e.g. kv or log data).
> > > > > > 5. Legacy clients can access data in the new cluster.
> > > > > >
> > > > > >    - If the permissions of the paths specified in
> > `remote.data.dirs` on
> > > > > the
> > > > > >    new cluster match those configured in `remote.data.dir`,
> > seamless
> > > > > > access is
> > > > > >    achievable.
> > > > > >    - If the permissions are inconsistent, access permissions must
> > be
> > > > > >    explicitly configured. For example, when using OSS, a policy
> > > > granting
> > > > > >    access permissions to the account identified by `fs.oss.roleArn`
> > > > must
> > > > > be
> > > > > >    configured for each bucket specified in `remote.data.dirs`.
> > > > > >
> > > > > >
> > > > > > Best regards,
> > > > > > Liebing Yu
> > > > > >
> > > > > >
> > > > > > On Thu, 29 Jan 2026 at 10:07, Yuxia Luo <[email protected]> wrote:
> > > > > >
> > > > > > > Hi, Liebing
> > > > > > >
> > > > > > > Thanks for the detailed FIP. I have a few questions:
> > > > > > > 1. Does `remote.data.dirs` support paths with different schemes?
> > For
> > > > > > > example:
> > > > > > > ```
> > > > > > > remote.data.dirs: oss://bucket1/fluss-data,
> > s3://bucket2/fluss-data
> > > > > > > ```
> > > > > > >
> > > > > > > 2. Should `GetFileSystemSecurityTokenRequest` include partition?
> > > > > > > The FIP adds `table_path` to the request, but since different
> > > > > partitions
> > > > > > > may reside on different remote paths (and require different
> > tokens),
> > > > > > > should the request also include partition information?
> > > > > > >
> > > > > > > 3. Just a reminder that `DefaultSecurityTokenManager` will become
> > > > more
> > > > > > > complex...
> > > > > > > This is not a blocker, but worth a poc to recoginize any
> > complexity
> > > > > > >
> > > > > > > 4. I want to confirm my understanding: For a partitioned table,
> > does
> > > > > the
> > > > > > > table itself have a remote dir, AND each partition also has its
> > own
> > > > > > remote
> > > > > > > dir?
> > > > > > >
> > > > > > > Or is it:
> > > > > > > - Non-partitioned table → table-level remote dir
> > > > > > > - Partitioned table → only partition-level remote dirs (no
> > > > > table-level)?
> > > > > > >
> > > > > > > 5. Can old clients (without table path in token request) still
> > read
> > > > > data
> > > > > > > from new clusters?
> > > > > > > One possibe solution is : For RPCs without table information, the
> > > > > server
> > > > > > > returns a token for the first dir in `remote.data.dirs`. Or other
> > > > ways
> > > > > > that
> > > > > > > allow users to configure the cluster to keep compatibility
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 2026/01/21 03:52:29 Zhe Wang wrote:
> > > > > > > > Thanks for your response, now it looks good to me.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Zhe Wang
> > > > > > > >
> > > > > > > > Liebing Yu <[email protected]> 于2026年1月20日周二 14:29写道：
> > > > > > > >
> > > > > > > > > Hi Zhe, sorry for the late reply.
> > > > > > > > >
> > > > > > > > > The primary focus of this FIP is not to address read/write
> > issues
> > > > > at
> > > > > > > the
> > > > > > > > > table or partition level, but rather to overcome limitations
> > at
> > > > the
> > > > > > > cluster
> > > > > > > > > level. Given the current capabilities of object storage,
> > > > read/write
> > > > > > > > > performance for a single table or partition is unlikely to
> > be a
> > > > > > > bottleneck;
> > > > > > > > > however, for a large-scale Fluss cluster, it can easily
> > become
> > > > one.
> > > > > > > > > Therefore, the core objective here is to distribute the
> > > > > cluster-wide
> > > > > > > > > read/write traffic across multiple remote storage systems.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Liebing Yu
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, 14 Jan 2026 at 16:07, Zhe Wang <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Liebing, Thanks for the clarification.
> > > > > > > > > > >1. To clarify, the data is currently split by partition
> > level
> > > > > for
> > > > > > > > > > partitioned tables and by table for non-partitioned tables.
> > > > > > > > > >
> > > > > > > > > > Therefore the main aim of this FIP is improving the speed
> > of
> > > > read
> > > > > > > data
> > > > > > > > > from
> > > > > > > > > > different partitions, store data speed may still limit for
> > a
> > > > > single
> > > > > > > > > system?
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Zhe Wang
> > > > > > > > > >
> > > > > > > > > > Liebing Yu <[email protected]> 于2026年1月13日周二 19:11写道：
> > > > > > > > > >
> > > > > > > > > > > Hi Zhe, Thanks for the questions!
> > > > > > > > > > >
> > > > > > > > > > > 1. To clarify, the data is currently split by partition
> > level
> > > > > for
> > > > > > > > > > > partitioned tables and by table for non-partitioned
> > tables.
> > > > > > > > > > >
> > > > > > > > > > > 2. Regarding RemoteStorageCleaner, you are absolutely
> > right.
> > > > > > > Supporting
> > > > > > > > > > > remote.data.dirs there is necessary for a complete
> > cleanup
> > > > > when a
> > > > > > > table
> > > > > > > > > > is
> > > > > > > > > > > dropped.
> > > > > > > > > > >
> > > > > > > > > > > Thanks for pointing that out!
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Liebing Yu
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 12 Jan 2026 at 17:02, Zhe Wang <
> > > > [email protected]>
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Liebing,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for driving this, I think it's a really useful
> > > > > feature.
> > > > > > > > > > > > I have two small questions:
> > > > > > > > > > > > 1. What's the scope for split data in dirs, I see
> > there's a
> > > > > > > > > partitionId
> > > > > > > > > > > in
> > > > > > > > > > > > ZK Data, so the data will spit by partition in
> > different
> > > > > > > directories,
> > > > > > > > > > or
> > > > > > > > > > > by
> > > > > > > > > > > > bucket?
> > > > > > > > > > > > 2. Maybe it needs to support remote.data.dirs in
> > > > > > > > > RemoteStorageCleaner?
> > > > > > > > > > So
> > > > > > > > > > > > we can delete all remoteStorage when delete table.
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Zhe Wang
> > > > > > > > > > > >
> > > > > > > > > > > > Liebing Yu <[email protected]> 于2026年1月8日周四 20:10写道：
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi devs,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I propose initiating discussion on FIP-25[1]. Fluss
> > > > > leverages
> > > > > > > > > remote
> > > > > > > > > > > > > storage systems—such as Amazon S3, HDFS, and Alibaba
> > > > Cloud
> > > > > > > OSS—to
> > > > > > > > > > > > deliver a
> > > > > > > > > > > > > cost-efficient, highly available, and fault-tolerant
> > > > > storage
> > > > > > > > > solution
> > > > > > > > > > > > > compared to local disk. *However, in production
> > > > > environments,
> > > > > > > we
> > > > > > > > > > often
> > > > > > > > > > > > find
> > > > > > > > > > > > > that the bandwidth of a single remote storage
> > becomes a
> > > > > > > bottleneck.
> > > > > > > > > > > > *Taking
> > > > > > > > > > > > > OSS[2] as an example, the typical upload bandwidth
> > limit
> > > > > for
> > > > > > a
> > > > > > > > > single
> > > > > > > > > > > > > account is 20 Gbit/s (Internal) and 10 Gbit/s
> > (Public).
> > > > So
> > > > > I
> > > > > > > > > > initiated
> > > > > > > > > > > > this
> > > > > > > > > > > > > FIP which aims to introduce support for multiple
> > remote
> > > > > > storage
> > > > > > > > > paths
> > > > > > > > > > > and
> > > > > > > > > > > > > enables the dynamic addition of new storage paths
> > without
> > > > > > > service
> > > > > > > > > > > > > interruption.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Any feedback and suggestions on this proposal are
> > > > welcome!
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLUSS/FIP-25%3A+Support+Multi-Location+for+Remote+Storage
> > > > > > > > > > > > > [2]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://www.alibabacloud.com/help/en/oss/user-guide/limits?spm=a2c63.l28256.help-menu-31815.d_0_0_5.2ac34d06oZYFvK
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Liebing Yu
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lorenzo Affetti
> > > > > Senior Software Engineer @ Flink Team
> > > > > Ververica <http://www.ververica.com>
> > > > >
> > > >
> >

Re: [Discuss] FIP-25: Support Multi-Location for Remote Storage

Reply via email to