Hi Liebing, Thank you for the proposal. I believe this is an excellent initiative to improve throughput for large-scale clusters utilizing remote storage.
The current design implements multi-location support at the table or partition level, meaning only new tables and partitions will utilize new remote locations. Consequently, even after upgrading the cluster to support multiple paths, data distribution will remain concentrated in a single location for an extended period, failing to achieve rapid traffic fan-out. In contrast, industry solutions like Paimon support "data-file.external-paths" [1] to distribute new data files across multiple paths, and Lance has recently introduced a file-level multi-base layout [2]. Ultimately, we need file-level multi-location support (I believe this approach will resolve most of the concerns raised above by Yang Guo). However, I am fine with supporting partition-level multi-location as an initial phase, provided we have a clear roadmap toward the final solution. Regarding the design details of this FIP, I have the following comments: 1. FileSystem#obtainSecurityToken(FsPath f) We should not add the FsPath parameter to the obtainSecurityToken interface for now. Because in current design, this interface only retrieve the security token for the entire filesystem rather than for a specific path. Since a filesystem is defined per authority, the authority does not need to be derived from an FsPath. In fact, we plan to refactor the Filesystem soon. This refactoring will add the FsPath parameter to obtainSecurityToken, ensuring the returned token is strictly scoped to that specific path. This change aims to address current permission leakage issues where a token requested for reading one table inadvertently grants access to all remote files of other tables. 2. GetFileSystemSecurityTokenRequest/Response and Client Token Management Current Issue: The FIP proposes maintaining a SecurityTokenManager per LogScanner. However, since tokens are shared at the filesystem granularity, tokens for the same FsKey across different tables should be consolidated. Therefore, the DefaultSecurityTokenManager must be maintained within the FlussConnection; otherwise, SecurityTokenManagers for different tables will overwrite each other's tokens. Recommendation: A straightforward approach is to leave GetFileSystemSecurityTokenRequest unchanged while modifying GetFileSystemSecurityTokenResponse to return a list of tokens. The server side would then return STS tokens for each FsKey configured in the cluster. The client-side Filesystem would subsequently retrieve the corresponding STS token based on the FsKey. This avoids changes to the LogScanner logic. While this approach retains the existing permission leakage issue, that problem is already present today. We can address it in a separate, dedicated FIP to simplify the scope and implementation of the current proposal. Best, Jark [1] https://paimon.apache.org/docs/1.3/maintenance/configurations/ [2] https://lancedb.com/blog/rethinking-table-file-paths-lance-multi-base-layout/ On Sat, 28 Feb 2026 at 20:37, Yang Guo <[email protected]> wrote: > > Hi Liebing and all, > > This is a good FIP to resolve bottlenecks in the remote storage. Thanks for > your effort. The design looks good to me and the above discussion has > covered some concerns in my mind. > > Now there are some further considerations I'm thinking of: > > 1. What happens if a path goes down? > Right now, there’s no automatic failover. If one S3 bucket (or HDFS path) > dies, every table or partition assigned to it just fails. Could we add > simple health checks? If a path looks dead, the remote dir selector > temporarily skips it until it’s back up. > > 2. New paths don't always help old data. > The routing only happens when a new table or new partition is created. And > it depends on the partition strategy. > - If the table is using time-based partitions (e.g., daily), adding new > paths works well because new data goes to new partitions on new paths. > - But for non-partitioned tables, or if it keeps writing to old partitions, > the new paths sit idle. The traffic never shifts over. > It requires developers to think further about partition strategy and input > data when adding remote dirs. > > 3. Managing "weights" is tricky manually for developers/maintainers. > Since the weighted round-robin is static: > - Developers/Maintainers have to determine the right weights based on > current traffic. > - If you skew weights to favor a path, you have to remember to > rebalance them later, or that path gets overloaded forever. E.g. If two > paths are weighted [1, 2] in the beginning to rebalance the higher traffic > in the first path. Developers/Maintainers should remember to change the > weights back to [1, 1] after the traffic is balanced between two paths. > Otherwise the traffic in the second path will keep growing. > - Also, setting a weight to 0 behaves differently depending on your > partition type (time-based paths eventually go quiet, but field-based ones > like "country=US" keep writing there forever). > Instead of manual tuning, could we eventually make this dynamic? Let the > system adjust weights based on real-time latency or throttling metrics. > > The points above are about future operational considerations—regarding > failover and maintenance after this solution is deployed. I think they > won't block this FIP. We may not need to fix these right now. Just bring > them into this discussion. > > Regards, > Yang Guo > > On Fri, Feb 27, 2026 at 5:53 PM Liebing Yu <[email protected]> wrote: > > > Hi Lorenzo, sorry for the late reply. > > > > Thanks for the AWS example! This further solidifies the case for multi-path > > support. > > > > Regarding your question about multi-cloud support: > > Our current design naturally supports multi-cloud object storage systems. > > Since the implementation is built upon a multi-schema filesystem > > abstraction (supporting schemes like s3://, oss://, abfs://, etc.), the > > system is inherently "cloud-agnostic." > > > > Best regards, > > Liebing Yu > > > > > > On Wed, 4 Feb 2026 at 23:37, Lorenzo Affetti via dev <[email protected] > > > > > wrote: > > > > > This is quite an interesting FIP and I think it is a significant > > > enhancement, especially for large-scale clusters. > > > > > > I think you can also add the AWS case in your motivation: > > > > > > > > https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-high-request-rate > > > AWS automatically scales if requests exceed 5,500 per second for the same > > > prefix, which results in transient 503 errors. > > > Your approach would eliminate this problem by providing another bucket. > > > > > > I was wondering if it might also provide the possibility of configuring > > the > > > same Fluss cluster for multi-cloud object storage systems. > > > From a design perspective, nothing should prevent me from storing remote > > > data on both Azure and AWS at the same time, probably resulting in > > > different performance numbers for different partitions/tables. > > > Should the design force the use of only 1 filesystem implementation? > > > > > > Thank you again! > > > > > > On Fri, Jan 30, 2026 at 7:59 AM Liebing Yu <[email protected]> wrote: > > > > > > > Hi Yuxia, thanks for the thoughtful response. Let me go through your > > > > questions one by one. > > > > > > > > 1. I think after we support `remote.data.dirs`, different schemas will > > be > > > > supported naturally. > > > > 2. Yes, I think we should change from `PbTablePath` to > > > > `PbPhysicalTablePath`. > > > > 3. Thanks for the reminder. I'll poc authentication in > > > > https://github.com/apache/fluss/issues/2518. But it doesn't block the > > > > multiple-paths implementation in Fluss server in > > > > https://github.com/apache/fluss/issues/2517. > > > > 4. For a partition table, the table itself has a remote data dir for > > > > metadata (such as lake offset). And each partition has its own remote > > dir > > > > for table data (e.g. kv or log data). > > > > 5. Legacy clients can access data in the new cluster. > > > > > > > > - If the permissions of the paths specified in `remote.data.dirs` on > > > the > > > > new cluster match those configured in `remote.data.dir`, seamless > > > > access is > > > > achievable. > > > > - If the permissions are inconsistent, access permissions must be > > > > explicitly configured. For example, when using OSS, a policy > > granting > > > > access permissions to the account identified by `fs.oss.roleArn` > > must > > > be > > > > configured for each bucket specified in `remote.data.dirs`. > > > > > > > > > > > > Best regards, > > > > Liebing Yu > > > > > > > > > > > > On Thu, 29 Jan 2026 at 10:07, Yuxia Luo <[email protected]> wrote: > > > > > > > > > Hi, Liebing > > > > > > > > > > Thanks for the detailed FIP. I have a few questions: > > > > > 1. Does `remote.data.dirs` support paths with different schemes? For > > > > > example: > > > > > ``` > > > > > remote.data.dirs: oss://bucket1/fluss-data, s3://bucket2/fluss-data > > > > > ``` > > > > > > > > > > 2. Should `GetFileSystemSecurityTokenRequest` include partition? > > > > > The FIP adds `table_path` to the request, but since different > > > partitions > > > > > may reside on different remote paths (and require different tokens), > > > > > should the request also include partition information? > > > > > > > > > > 3. Just a reminder that `DefaultSecurityTokenManager` will become > > more > > > > > complex... > > > > > This is not a blocker, but worth a poc to recoginize any complexity > > > > > > > > > > 4. I want to confirm my understanding: For a partitioned table, does > > > the > > > > > table itself have a remote dir, AND each partition also has its own > > > > remote > > > > > dir? > > > > > > > > > > Or is it: > > > > > - Non-partitioned table → table-level remote dir > > > > > - Partitioned table → only partition-level remote dirs (no > > > table-level)? > > > > > > > > > > 5. Can old clients (without table path in token request) still read > > > data > > > > > from new clusters? > > > > > One possibe solution is : For RPCs without table information, the > > > server > > > > > returns a token for the first dir in `remote.data.dirs`. Or other > > ways > > > > that > > > > > allow users to configure the cluster to keep compatibility > > > > > > > > > > > > > > > > > > > > On 2026/01/21 03:52:29 Zhe Wang wrote: > > > > > > Thanks for your response, now it looks good to me. > > > > > > > > > > > > Best regards, > > > > > > Zhe Wang > > > > > > > > > > > > Liebing Yu <[email protected]> 于2026年1月20日周二 14:29写道: > > > > > > > > > > > > > Hi Zhe, sorry for the late reply. > > > > > > > > > > > > > > The primary focus of this FIP is not to address read/write issues > > > at > > > > > the > > > > > > > table or partition level, but rather to overcome limitations at > > the > > > > > cluster > > > > > > > level. Given the current capabilities of object storage, > > read/write > > > > > > > performance for a single table or partition is unlikely to be a > > > > > bottleneck; > > > > > > > however, for a large-scale Fluss cluster, it can easily become > > one. > > > > > > > Therefore, the core objective here is to distribute the > > > cluster-wide > > > > > > > read/write traffic across multiple remote storage systems. > > > > > > > > > > > > > > Best regards, > > > > > > > Liebing Yu > > > > > > > > > > > > > > > > > > > > > On Wed, 14 Jan 2026 at 16:07, Zhe Wang <[email protected]> > > > > wrote: > > > > > > > > > > > > > > > Hi Liebing, Thanks for the clarification. > > > > > > > > >1. To clarify, the data is currently split by partition level > > > for > > > > > > > > partitioned tables and by table for non-partitioned tables. > > > > > > > > > > > > > > > > Therefore the main aim of this FIP is improving the speed of > > read > > > > > data > > > > > > > from > > > > > > > > different partitions, store data speed may still limit for a > > > single > > > > > > > system? > > > > > > > > > > > > > > > > Best, > > > > > > > > Zhe Wang > > > > > > > > > > > > > > > > Liebing Yu <[email protected]> 于2026年1月13日周二 19:11写道: > > > > > > > > > > > > > > > > > Hi Zhe, Thanks for the questions! > > > > > > > > > > > > > > > > > > 1. To clarify, the data is currently split by partition level > > > for > > > > > > > > > partitioned tables and by table for non-partitioned tables. > > > > > > > > > > > > > > > > > > 2. Regarding RemoteStorageCleaner, you are absolutely right. > > > > > Supporting > > > > > > > > > remote.data.dirs there is necessary for a complete cleanup > > > when a > > > > > table > > > > > > > > is > > > > > > > > > dropped. > > > > > > > > > > > > > > > > > > Thanks for pointing that out! > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > Liebing Yu > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, 12 Jan 2026 at 17:02, Zhe Wang < > > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Liebing, > > > > > > > > > > > > > > > > > > > > Thanks for driving this, I think it's a really useful > > > feature. > > > > > > > > > > I have two small questions: > > > > > > > > > > 1. What's the scope for split data in dirs, I see there's a > > > > > > > partitionId > > > > > > > > > in > > > > > > > > > > ZK Data, so the data will spit by partition in different > > > > > directories, > > > > > > > > or > > > > > > > > > by > > > > > > > > > > bucket? > > > > > > > > > > 2. Maybe it needs to support remote.data.dirs in > > > > > > > RemoteStorageCleaner? > > > > > > > > So > > > > > > > > > > we can delete all remoteStorage when delete table. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Zhe Wang > > > > > > > > > > > > > > > > > > > > Liebing Yu <[email protected]> 于2026年1月8日周四 20:10写道: > > > > > > > > > > > > > > > > > > > > > Hi devs, > > > > > > > > > > > > > > > > > > > > > > I propose initiating discussion on FIP-25[1]. Fluss > > > leverages > > > > > > > remote > > > > > > > > > > > storage systems—such as Amazon S3, HDFS, and Alibaba > > Cloud > > > > > OSS—to > > > > > > > > > > deliver a > > > > > > > > > > > cost-efficient, highly available, and fault-tolerant > > > storage > > > > > > > solution > > > > > > > > > > > compared to local disk. *However, in production > > > environments, > > > > > we > > > > > > > > often > > > > > > > > > > find > > > > > > > > > > > that the bandwidth of a single remote storage becomes a > > > > > bottleneck. > > > > > > > > > > *Taking > > > > > > > > > > > OSS[2] as an example, the typical upload bandwidth limit > > > for > > > > a > > > > > > > single > > > > > > > > > > > account is 20 Gbit/s (Internal) and 10 Gbit/s (Public). > > So > > > I > > > > > > > > initiated > > > > > > > > > > this > > > > > > > > > > > FIP which aims to introduce support for multiple remote > > > > storage > > > > > > > paths > > > > > > > > > and > > > > > > > > > > > enables the dynamic addition of new storage paths without > > > > > service > > > > > > > > > > > interruption. > > > > > > > > > > > > > > > > > > > > > > Any feedback and suggestions on this proposal are > > welcome! > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLUSS/FIP-25%3A+Support+Multi-Location+for+Remote+Storage > > > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://www.alibabacloud.com/help/en/oss/user-guide/limits?spm=a2c63.l28256.help-menu-31815.d_0_0_5.2ac34d06oZYFvK > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > Liebing Yu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Lorenzo Affetti > > > Senior Software Engineer @ Flink Team > > > Ververica <http://www.ververica.com> > > > > >
