Hi Jark! Thank you for your insightful suggestions. This FIP is a small step for Fluss towards multi-remote (or multi-cloud) storage. As you mentioned, we envision future support for commit-level multi-pathing, similar to the approaches taken by Paimon and Lance.
For your comments on the current FIP. I'm generally in agreement. 1. FileSystem#obtainSecurityToken(FsPath f) For the current implementation, obtainSecurityToken(FsPath f) is actually redundant and can be removed. 2. GetFileSystemSecurityTokenRequest/Response and Client Token Management Your suggestion simplifies the implementation of multi-path authorization. By deferring table-level authentication to a later stage, we can expedite the landing of this FIP. I will update the FIP accordingly. Best regards, Liebing Yu On Tue, 3 Mar 2026 at 01:19, Jark Wu <[email protected]> wrote: > Hi Liebing, > > Thank you for the proposal. I believe this is an excellent initiative > to improve throughput for large-scale clusters utilizing remote > storage. > > The current design implements multi-location support at the table or > partition level, meaning only new tables and partitions will utilize > new remote locations. Consequently, even after upgrading the cluster > to support multiple paths, data distribution will remain concentrated > in a single location for an extended period, failing to achieve rapid > traffic fan-out. In contrast, industry solutions like Paimon support > "data-file.external-paths" [1] to distribute new data files across > multiple paths, and Lance has recently introduced a file-level > multi-base layout [2]. > > Ultimately, we need file-level multi-location support (I believe this > approach will resolve most of the concerns raised above by Yang Guo). > However, I am fine with supporting partition-level multi-location as > an initial phase, provided we have a clear roadmap toward the final > solution. > > Regarding the design details of this FIP, I have the following comments: > > 1. FileSystem#obtainSecurityToken(FsPath f) > We should not add the FsPath parameter to the obtainSecurityToken > interface for now. Because in current design, this interface only > retrieve the security token for the entire filesystem rather than for > a specific path. Since a filesystem is defined per authority, the > authority does not need to be derived from an FsPath. > > In fact, we plan to refactor the Filesystem soon. This refactoring > will add the FsPath parameter to obtainSecurityToken, ensuring the > returned token is strictly scoped to that specific path. This change > aims to address current permission leakage issues where a token > requested for reading one table inadvertently grants access to all > remote files of other tables. > > 2. GetFileSystemSecurityTokenRequest/Response and Client Token Management > > Current Issue: The FIP proposes maintaining a SecurityTokenManager per > LogScanner. However, since tokens are shared at the filesystem > granularity, tokens for the same FsKey across different tables should > be consolidated. Therefore, the DefaultSecurityTokenManager must be > maintained within the FlussConnection; otherwise, > SecurityTokenManagers for different tables will overwrite each other's > tokens. > > Recommendation: A straightforward approach is to leave > GetFileSystemSecurityTokenRequest unchanged while modifying > GetFileSystemSecurityTokenResponse to return a list of tokens. The > server side would then return STS tokens for each FsKey configured in > the cluster. The client-side Filesystem would subsequently retrieve > the corresponding STS token based on the FsKey. This avoids changes to > the LogScanner logic. > > While this approach retains the existing permission leakage issue, > that problem is already present today. We can address it in a > separate, dedicated FIP to simplify the scope and implementation of > the current proposal. > > Best, > Jark > > [1] https://paimon.apache.org/docs/1.3/maintenance/configurations/ > [2] > https://lancedb.com/blog/rethinking-table-file-paths-lance-multi-base-layout/ > > > On Sat, 28 Feb 2026 at 20:37, Yang Guo <[email protected]> wrote: > > > > Hi Liebing and all, > > > > This is a good FIP to resolve bottlenecks in the remote storage. Thanks > for > > your effort. The design looks good to me and the above discussion has > > covered some concerns in my mind. > > > > Now there are some further considerations I'm thinking of: > > > > 1. What happens if a path goes down? > > Right now, there’s no automatic failover. If one S3 bucket (or HDFS > path) > > dies, every table or partition assigned to it just fails. Could we add > > simple health checks? If a path looks dead, the remote dir selector > > temporarily skips it until it’s back up. > > > > 2. New paths don't always help old data. > > The routing only happens when a new table or new partition is created. > And > > it depends on the partition strategy. > > - If the table is using time-based partitions (e.g., daily), adding new > > paths works well because new data goes to new partitions on new paths. > > - But for non-partitioned tables, or if it keeps writing to old > partitions, > > the new paths sit idle. The traffic never shifts over. > > It requires developers to think further about partition strategy and > input > > data when adding remote dirs. > > > > 3. Managing "weights" is tricky manually for developers/maintainers. > > Since the weighted round-robin is static: > > - Developers/Maintainers have to determine the right weights based on > > current traffic. > > - If you skew weights to favor a path, you have to remember to > > rebalance them later, or that path gets overloaded forever. E.g. If two > > paths are weighted [1, 2] in the beginning to rebalance the higher > traffic > > in the first path. Developers/Maintainers should remember to change the > > weights back to [1, 1] after the traffic is balanced between two paths. > > Otherwise the traffic in the second path will keep growing. > > - Also, setting a weight to 0 behaves differently depending on your > > partition type (time-based paths eventually go quiet, but field-based > ones > > like "country=US" keep writing there forever). > > Instead of manual tuning, could we eventually make this dynamic? Let the > > system adjust weights based on real-time latency or throttling metrics. > > > > The points above are about future operational considerations—regarding > > failover and maintenance after this solution is deployed. I think they > > won't block this FIP. We may not need to fix these right now. Just bring > > them into this discussion. > > > > Regards, > > Yang Guo > > > > On Fri, Feb 27, 2026 at 5:53 PM Liebing Yu <[email protected]> wrote: > > > > > Hi Lorenzo, sorry for the late reply. > > > > > > Thanks for the AWS example! This further solidifies the case for > multi-path > > > support. > > > > > > Regarding your question about multi-cloud support: > > > Our current design naturally supports multi-cloud object storage > systems. > > > Since the implementation is built upon a multi-schema filesystem > > > abstraction (supporting schemes like s3://, oss://, abfs://, etc.), the > > > system is inherently "cloud-agnostic." > > > > > > Best regards, > > > Liebing Yu > > > > > > > > > On Wed, 4 Feb 2026 at 23:37, Lorenzo Affetti via dev < > [email protected] > > > > > > > wrote: > > > > > > > This is quite an interesting FIP and I think it is a significant > > > > enhancement, especially for large-scale clusters. > > > > > > > > I think you can also add the AWS case in your motivation: > > > > > > > > > > > > https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-high-request-rate > > > > AWS automatically scales if requests exceed 5,500 per second for the > same > > > > prefix, which results in transient 503 errors. > > > > Your approach would eliminate this problem by providing another > bucket. > > > > > > > > I was wondering if it might also provide the possibility of > configuring > > > the > > > > same Fluss cluster for multi-cloud object storage systems. > > > > From a design perspective, nothing should prevent me from storing > remote > > > > data on both Azure and AWS at the same time, probably resulting in > > > > different performance numbers for different partitions/tables. > > > > Should the design force the use of only 1 filesystem implementation? > > > > > > > > Thank you again! > > > > > > > > On Fri, Jan 30, 2026 at 7:59 AM Liebing Yu <[email protected]> > wrote: > > > > > > > > > Hi Yuxia, thanks for the thoughtful response. Let me go through > your > > > > > questions one by one. > > > > > > > > > > 1. I think after we support `remote.data.dirs`, different schemas > will > > > be > > > > > supported naturally. > > > > > 2. Yes, I think we should change from `PbTablePath` to > > > > > `PbPhysicalTablePath`. > > > > > 3. Thanks for the reminder. I'll poc authentication in > > > > > https://github.com/apache/fluss/issues/2518. But it doesn't block > the > > > > > multiple-paths implementation in Fluss server in > > > > > https://github.com/apache/fluss/issues/2517. > > > > > 4. For a partition table, the table itself has a remote data dir > for > > > > > metadata (such as lake offset). And each partition has its own > remote > > > dir > > > > > for table data (e.g. kv or log data). > > > > > 5. Legacy clients can access data in the new cluster. > > > > > > > > > > - If the permissions of the paths specified in > `remote.data.dirs` on > > > > the > > > > > new cluster match those configured in `remote.data.dir`, > seamless > > > > > access is > > > > > achievable. > > > > > - If the permissions are inconsistent, access permissions must > be > > > > > explicitly configured. For example, when using OSS, a policy > > > granting > > > > > access permissions to the account identified by `fs.oss.roleArn` > > > must > > > > be > > > > > configured for each bucket specified in `remote.data.dirs`. > > > > > > > > > > > > > > > Best regards, > > > > > Liebing Yu > > > > > > > > > > > > > > > On Thu, 29 Jan 2026 at 10:07, Yuxia Luo <[email protected]> wrote: > > > > > > > > > > > Hi, Liebing > > > > > > > > > > > > Thanks for the detailed FIP. I have a few questions: > > > > > > 1. Does `remote.data.dirs` support paths with different schemes? > For > > > > > > example: > > > > > > ``` > > > > > > remote.data.dirs: oss://bucket1/fluss-data, > s3://bucket2/fluss-data > > > > > > ``` > > > > > > > > > > > > 2. Should `GetFileSystemSecurityTokenRequest` include partition? > > > > > > The FIP adds `table_path` to the request, but since different > > > > partitions > > > > > > may reside on different remote paths (and require different > tokens), > > > > > > should the request also include partition information? > > > > > > > > > > > > 3. Just a reminder that `DefaultSecurityTokenManager` will become > > > more > > > > > > complex... > > > > > > This is not a blocker, but worth a poc to recoginize any > complexity > > > > > > > > > > > > 4. I want to confirm my understanding: For a partitioned table, > does > > > > the > > > > > > table itself have a remote dir, AND each partition also has its > own > > > > > remote > > > > > > dir? > > > > > > > > > > > > Or is it: > > > > > > - Non-partitioned table → table-level remote dir > > > > > > - Partitioned table → only partition-level remote dirs (no > > > > table-level)? > > > > > > > > > > > > 5. Can old clients (without table path in token request) still > read > > > > data > > > > > > from new clusters? > > > > > > One possibe solution is : For RPCs without table information, the > > > > server > > > > > > returns a token for the first dir in `remote.data.dirs`. Or other > > > ways > > > > > that > > > > > > allow users to configure the cluster to keep compatibility > > > > > > > > > > > > > > > > > > > > > > > > On 2026/01/21 03:52:29 Zhe Wang wrote: > > > > > > > Thanks for your response, now it looks good to me. > > > > > > > > > > > > > > Best regards, > > > > > > > Zhe Wang > > > > > > > > > > > > > > Liebing Yu <[email protected]> 于2026年1月20日周二 14:29写道: > > > > > > > > > > > > > > > Hi Zhe, sorry for the late reply. > > > > > > > > > > > > > > > > The primary focus of this FIP is not to address read/write > issues > > > > at > > > > > > the > > > > > > > > table or partition level, but rather to overcome limitations > at > > > the > > > > > > cluster > > > > > > > > level. Given the current capabilities of object storage, > > > read/write > > > > > > > > performance for a single table or partition is unlikely to > be a > > > > > > bottleneck; > > > > > > > > however, for a large-scale Fluss cluster, it can easily > become > > > one. > > > > > > > > Therefore, the core objective here is to distribute the > > > > cluster-wide > > > > > > > > read/write traffic across multiple remote storage systems. > > > > > > > > > > > > > > > > Best regards, > > > > > > > > Liebing Yu > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 14 Jan 2026 at 16:07, Zhe Wang < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Liebing, Thanks for the clarification. > > > > > > > > > >1. To clarify, the data is currently split by partition > level > > > > for > > > > > > > > > partitioned tables and by table for non-partitioned tables. > > > > > > > > > > > > > > > > > > Therefore the main aim of this FIP is improving the speed > of > > > read > > > > > > data > > > > > > > > from > > > > > > > > > different partitions, store data speed may still limit for > a > > > > single > > > > > > > > system? > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Zhe Wang > > > > > > > > > > > > > > > > > > Liebing Yu <[email protected]> 于2026年1月13日周二 19:11写道: > > > > > > > > > > > > > > > > > > > Hi Zhe, Thanks for the questions! > > > > > > > > > > > > > > > > > > > > 1. To clarify, the data is currently split by partition > level > > > > for > > > > > > > > > > partitioned tables and by table for non-partitioned > tables. > > > > > > > > > > > > > > > > > > > > 2. Regarding RemoteStorageCleaner, you are absolutely > right. > > > > > > Supporting > > > > > > > > > > remote.data.dirs there is necessary for a complete > cleanup > > > > when a > > > > > > table > > > > > > > > > is > > > > > > > > > > dropped. > > > > > > > > > > > > > > > > > > > > Thanks for pointing that out! > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > Liebing Yu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, 12 Jan 2026 at 17:02, Zhe Wang < > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi Liebing, > > > > > > > > > > > > > > > > > > > > > > Thanks for driving this, I think it's a really useful > > > > feature. > > > > > > > > > > > I have two small questions: > > > > > > > > > > > 1. What's the scope for split data in dirs, I see > there's a > > > > > > > > partitionId > > > > > > > > > > in > > > > > > > > > > > ZK Data, so the data will spit by partition in > different > > > > > > directories, > > > > > > > > > or > > > > > > > > > > by > > > > > > > > > > > bucket? > > > > > > > > > > > 2. Maybe it needs to support remote.data.dirs in > > > > > > > > RemoteStorageCleaner? > > > > > > > > > So > > > > > > > > > > > we can delete all remoteStorage when delete table. > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Zhe Wang > > > > > > > > > > > > > > > > > > > > > > Liebing Yu <[email protected]> 于2026年1月8日周四 20:10写道: > > > > > > > > > > > > > > > > > > > > > > > Hi devs, > > > > > > > > > > > > > > > > > > > > > > > > I propose initiating discussion on FIP-25[1]. Fluss > > > > leverages > > > > > > > > remote > > > > > > > > > > > > storage systems—such as Amazon S3, HDFS, and Alibaba > > > Cloud > > > > > > OSS—to > > > > > > > > > > > deliver a > > > > > > > > > > > > cost-efficient, highly available, and fault-tolerant > > > > storage > > > > > > > > solution > > > > > > > > > > > > compared to local disk. *However, in production > > > > environments, > > > > > > we > > > > > > > > > often > > > > > > > > > > > find > > > > > > > > > > > > that the bandwidth of a single remote storage > becomes a > > > > > > bottleneck. > > > > > > > > > > > *Taking > > > > > > > > > > > > OSS[2] as an example, the typical upload bandwidth > limit > > > > for > > > > > a > > > > > > > > single > > > > > > > > > > > > account is 20 Gbit/s (Internal) and 10 Gbit/s > (Public). > > > So > > > > I > > > > > > > > > initiated > > > > > > > > > > > this > > > > > > > > > > > > FIP which aims to introduce support for multiple > remote > > > > > storage > > > > > > > > paths > > > > > > > > > > and > > > > > > > > > > > > enables the dynamic addition of new storage paths > without > > > > > > service > > > > > > > > > > > > interruption. > > > > > > > > > > > > > > > > > > > > > > > > Any feedback and suggestions on this proposal are > > > welcome! > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLUSS/FIP-25%3A+Support+Multi-Location+for+Remote+Storage > > > > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://www.alibabacloud.com/help/en/oss/user-guide/limits?spm=a2c63.l28256.help-menu-31815.d_0_0_5.2ac34d06oZYFvK > > > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Liebing Yu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Lorenzo Affetti > > > > Senior Software Engineer @ Flink Team > > > > Ververica <http://www.ververica.com> > > > > > > > >
