Thank you all for the healthy discussion. My view is that all automated maintenance tasks — such as compactions, data expiration, and cleanup — should be supported by the tiering service, but remain configurable.
These features can be enabled by default to support simple end-to-end use cases, without requiring users to rely on an external service. However, they should also be optional. In many cases, users may already have or prefer to use a dedicated maintenance service for their lakehouse tables. Such specialized services are often more capable and can manage maintenance tasks across all lake tables, not only the fluss tiered tables. Besides, +1 for the design proposal. I think we can kick off a vote now. Best, Jark On Tue, 8 Jul 2025 at 15:24, Mehul Batra <[email protected]> wrote: > Hi Yuxia, > > Thanks for the clarification. > > It's good to know that compaction/expiring-snapshots will be addressed in > the initial version, and I completely understand your point on the > complexity of orphan file cleanup. I agree it's better to avoid over-design > at this stage and evolve as needed based on future usage patterns and > feedback. > > Also, thanks for confirming that snapshot expiration will be triggered > explicitly via the LakeCommitter. That clears things up for me. > > Looking forward to working on this with you and the community, seeing how > this evolves! > > Best regards, > Mehul > On Tue, Jul 8, 2025 at 7:16 AM yuxia <[email protected]> wrote: > > > Hi, Mehul > > > > "Tableflow automates table maintenance by compacting and cleaning up > small > > files generated by continuous streaming data in object storage." > > Seems table flow supports compacting, which is covered in this FIP. > > Haven't seen orphan file cleanup in table flow. > > Orphan file cleanup is not straightforward and a little of complex, which > > required to list all files, and compare with the files in iceberg > manifest > > to find the orphan files. > > I still prefer not to introduce the complexity currently for the first > > version of iceberg support, which will cause overdesign. Let's just see > in > > the future what'll happen. > > > > As for snapshot expiration. Yes, LakeCommitter should trigger the > snapshot > > expiration action explicitly. It's a slight operation. > > > > Best regards, > > Yuxia > > > > ----- 原始邮件 ----- > > 发件人: "Mehul Batra" <[email protected]> > > 收件人: "dev" <[email protected]> > > 发送时间: 星期二, 2025年 7 月 08日 上午 4:26:19 > > 主题: [SPAM]Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg > > > > Hi Yuxia, Cheng, > > > > Thank you both for the insights. > > > > From a user’s perspective, I believe our goal should be to abstract away > as > > much operational complexity as possible. For example, TableFlow handles > > both data writing and maintenance seamlessly for the user, which avoids > the > > burden of running separate processes. > > > > > https://docs.confluent.io/cloud/current/topics/tableflow/overview.html#table-maintenance-and-optimizations > > > > In Fluss Integration, if users are expected to run a separate maintenance > > job (e.g., for snapshot expiration or orphan file cleanup), there's a > real > > risk of job overlap and failure, especially due to optimistic concurrency > > issues when both (tiering & maintenance) try to commit around the same > > time. > > > > Yuxia, you mentioned that the LakeCommitter will respect the > > history.expire.max-snapshot-age-ms property (similar to Paimon). I just > > wanted to clarify, while the property sets the retention policy, we still > > need to trigger the snapshot expiration action explicitly. Do we envision > > Fluss's tiering job playing that role? > > > > If so, that would be a great win, it could help automate snapshot > > expiration and indirectly clean up orphan files, making things much > > smoother for users. > > > > Please correct me if I’ve misunderstood anything. > > > > Best regards, > > Mehul > > > > On Mon, Jul 7, 2025 at 11:32 AM Wang Cheng <[email protected]> > > wrote: > > > > > Hi Mehul, > > > > > > > > > I agree with Yuxia's point. We should leave such table maintenance work > > > like expiring snapshots and deleting orphan files to Iceberg users > > > rather than relying on Fluss tiering job. > > > > > > > > > > > > Regards, > > > Cheng > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------ Original ------------------ > > > From: > > > "dev" > > > < > > > [email protected]>; > > > Date: Sat, Jul 5, 2025 11:38 PM > > > To: "dev"<[email protected]>; > > > > > > Subject: Re: [DISCUSS] FIP-3: Support tiering Fluss data to > Iceberg > > > > > > > > > > > > Hi Yuxia, > > > Great, that sounds good to me and will help the user to have a better > > read > > > latency. > > > How about the Snapshot expiration (to regulate metadata) and removing > the > > > orphan files(which are no longer referenced or dangling files of failed > > > tasks)? > > > Are we planning to introduce them as part of automated maintenance > > provided > > > by the Fluss cluster? > > > Warm regards, > > > Mehul Batra > > > > > > On Fri, Jul 4, 2025 at 5:02 PM yuxia <[email protected]> > > > wrote: > > > > > > > Hi, Mehul. > > > > Thanks for your attention. I think we don't need to introduce an > > extra > > > > post-commit hook to manage small files. In the design, all files > > that > > > belong > > > > to same bucket(in iceberg, it'll be same partition) be distributed > > to > > > same > > > > task to write. So, the task can compact these small files then for > > the > > > > partition. > > > > As this FIP said, while creating IcebergLakeWriter in one round of > > > > tiering, the writer can scan manifest to know the files in this > > > bucket, if > > > > found compaction is available, it can > > > > compact these files while writing new files. We have a similar > logic > > > for > > > > tiering to paimon. > > > > > > > > Best regards, > > > > Yuxia > > > > > > > > ----- 原始邮件 ----- > > > > 发件人: "Mehul Batra" <[email protected]> > > > > 收件人: "dev" <[email protected]> > > > > 发送时间: 星期四, 2025年 7 月 03日 下午 5:04:18 > > > > 主题: Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg > > > > > > > > +1 This will help us to address the missing table format and > provide > > > better > > > > ecosystem interoperability. Iceberg's growing adoption in the data > > > > lakehouse space makes this a valuable addition to Fluss's tiering > > > > capabilities. > > > > Are there any plans to integrate the Maintenance services as part > of > > > > tiering itself as a post-commit hook to manage small files? > > > > Warm regards, > > > > Mehul Batra > > > > > > > > On Thu, Jul 3, 2025 at 2:24 PM yuxia <[email protected] > > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > Fluss currently supports tiering data to Apache Paimon, > > enabling > > > > > cost-effective storage management for warm/cold data. > However, > > > the lack > > > > of > > > > > native Iceberg tiering support limits flexibility and > ecosystem > > > > integration > > > > > for users who rely on Iceberg’s open table format. > > > > > > > > > > To address this gap, I’d like to propose FIP-3: Support > Tiering > > > Fluss > > > > Data > > > > > to Iceberg[1] which aims to integrate Iceberg into Fluss’s > > > tiering > > > > > capabilities. > > > > > > > > > > Welcome your feedback and suggestions on this proposal. > Looking > > > forward > > > > to > > > > > a productive discussion! > > > > > > > > > > [1]: > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLUSS/FIP-3%3A+Support+tiering+Fluss+data+to+Iceberg > > > > > > > < > > > https://cwiki.apache.org/confluence/display/FLUSS/FIP-3%3A+Support+tiering+Fluss+data+to+Iceberg> > > >; > > > > > > > > > Best regards, > > > > > Yuxia > > > > > > > > > > > > > > > > >
