Re: [SPAM]Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg

yuxia Wed, 09 Jul 2025 23:17:40 -0700

Thanks for Jark's comments. I agree with Jark that it should allow user to 
disable the automated maintenance while tiering to lake. I introduce
a new boolean option `table.datalake.auto-maintenance`(true by default) to 
configure whether to do maintenance tasks to per table in lake tiering service. 
The FIP has been updated.


If have no any other comments, I'll start the vote process today later.

Best regards,
Yuxia

----- 原始邮件 -----
发件人: "Jark Wu" <[email protected]>
收件人: "dev" <[email protected]>
发送时间: 星期四, 2025年 7 月 10日 上午 10:50:55
主题: Re: [SPAM]Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg

Thank you all for the healthy discussion.

My view is that all automated maintenance tasks — such as compactions, data
expiration, and cleanup — should be supported by the tiering service, but
remain configurable.

These features can be enabled by default to support simple end-to-end use
cases, without requiring users to rely on an external service.

However, they should also be optional. In many cases, users may already
have or prefer to use a dedicated maintenance service for their lakehouse
tables. Such specialized services are often more capable and can manage
maintenance tasks across all lake tables, not only the fluss tiered tables.


Besides, +1 for the design proposal. I think we can kick off a vote now.

Best,
Jark

On Tue, 8 Jul 2025 at 15:24, Mehul Batra <[email protected]> wrote:

> Hi Yuxia,
>
> Thanks for the clarification.
>
> It's good to know that compaction/expiring-snapshots will be addressed in
> the initial version, and I completely understand your point on the
> complexity of orphan file cleanup. I agree it's better to avoid over-design
> at this stage and evolve as needed based on future usage patterns and
> feedback.
>
> Also, thanks for confirming that snapshot expiration will be triggered
> explicitly via the LakeCommitter. That clears things up for me.
>
> Looking forward to working on this with you and the community, seeing how
> this evolves!
>
> Best regards,
> Mehul
> On Tue, Jul 8, 2025 at 7:16 AM yuxia <[email protected]> wrote:
>
> > Hi, Mehul
> >
> > "Tableflow automates table maintenance by compacting and cleaning up
> small
> > files generated by continuous streaming data in object storage."
> > Seems table flow supports compacting, which is covered in this FIP.
> > Haven't seen orphan file cleanup in table flow.
> > Orphan file cleanup is not straightforward and a little of complex, which
> > required to list all files, and compare with the files in iceberg
> manifest
> > to find the orphan files.
> > I still prefer not to introduce the complexity currently for the first
> > version of iceberg support, which will cause overdesign. Let's just see
> in
> > the future what'll happen.
> >
> > As for snapshot expiration. Yes, LakeCommitter should trigger the
> snapshot
> > expiration action explicitly. It's a slight operation.
> >
> > Best regards,
> > Yuxia
> >
> > ----- 原始邮件 -----
> > 发件人: "Mehul Batra" <[email protected]>
> > 收件人: "dev" <[email protected]>
> > 发送时间: 星期二, 2025年 7 月 08日 上午 4:26:19
> > 主题: [SPAM]Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg
> >
> > Hi Yuxia, Cheng,
> >
> > Thank you both for the insights.
> >
> > From a user’s perspective, I believe our goal should be to abstract away
> as
> > much operational complexity as possible. For example, TableFlow handles
> > both data writing and maintenance seamlessly for the user, which avoids
> the
> > burden of running separate processes.
> >
> >
> https://docs.confluent.io/cloud/current/topics/tableflow/overview.html#table-maintenance-and-optimizations
> >
> > In Fluss Integration, if users are expected to run a separate maintenance
> > job (e.g., for snapshot expiration or orphan file cleanup), there's a
> real
> > risk of job overlap and failure, especially due to optimistic concurrency
> > issues when both (tiering & maintenance) try to commit around the same
> > time.
> >
> > Yuxia, you mentioned that the LakeCommitter will respect the
> > history.expire.max-snapshot-age-ms property (similar to Paimon). I just
> > wanted to clarify, while the property sets the retention policy, we still
> > need to trigger the snapshot expiration action explicitly. Do we envision
> > Fluss's tiering job playing that role?
> >
> > If so, that would be a great win, it could help automate snapshot
> > expiration and indirectly clean up orphan files, making things much
> > smoother for users.
> >
> > Please correct me if I’ve misunderstood anything.
> >
> > Best regards,
> > Mehul
> >
> > On Mon, Jul 7, 2025 at 11:32 AM Wang Cheng <[email protected]>
> > wrote:
> >
> > > Hi Mehul,
> > >
> > >
> > > I agree with Yuxia's point. We should leave such table maintenance work
> > > like expiring snapshots and deleting&nbsp;orphan files to Iceberg users
> > > rather than relying on Fluss tiering job.
> > >
> > >
> > >
> > > Regards,
> > > Cheng
> > >
> > >
> > >
> > > &nbsp;
> > >
> > >
> > >
> > >
> > > ------------------&nbsp;Original&nbsp;------------------
> > > From:
> > >                                                   "dev"
> > >                                                                 <
> > > [email protected]&gt;;
> > > Date:&nbsp;Sat, Jul 5, 2025 11:38 PM
> > > To:&nbsp;"dev"<[email protected]&gt;;
> > >
> > > Subject:&nbsp;Re: [DISCUSS] FIP-3: Support tiering Fluss data to
> Iceberg
> > >
> > >
> > >
> > > Hi Yuxia,
> > > Great, that sounds good to me and will help the user to have a better
> > read
> > > latency.
> > > How about the Snapshot expiration (to regulate metadata) and removing
> the
> > > orphan files(which are no longer referenced or dangling files of failed
> > > tasks)?
> > > Are we planning to introduce them as part of automated maintenance
> > provided
> > > by the Fluss cluster?
> > > Warm regards,
> > > Mehul Batra
> > >
> > > On Fri, Jul 4, 2025 at 5:02 PM yuxia <[email protected]&gt;
> > > wrote:
> > >
> > > &gt; Hi, Mehul.
> > > &gt; Thanks for your attention. I think we don't need to introduce an
> > extra
> > > &gt; post-commit hook to manage small files. In the design, all files
> > that
> > > belong
> > > &gt; to same bucket(in iceberg, it'll be same partition) be distributed
> > to
> > > same
> > > &gt; task to write. So, the task can compact these small files then for
> > the
> > > &gt; partition.
> > > &gt; As this FIP said, while creating IcebergLakeWriter in one round of
> > > &gt; tiering, the writer can scan manifest to know the files in this
> > > bucket, if
> > > &gt; found compaction is available, it can
> > > &gt; compact these files while writing new files. We have a similar
> logic
> > > for
> > > &gt; tiering to paimon.
> > > &gt;
> > > &gt; Best regards,
> > > &gt; Yuxia
> > > &gt;
> > > &gt; ----- 原始邮件 -----
> > > &gt; 发件人: "Mehul Batra" <[email protected]&gt;
> > > &gt; 收件人: "dev" <[email protected]&gt;
> > > &gt; 发送时间: 星期四, 2025年 7 月 03日 下午 5:04:18
> > > &gt; 主题: Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg
> > > &gt;
> > > &gt; +1 This will help us to address the missing table format and
> provide
> > > better
> > > &gt; ecosystem interoperability. Iceberg's growing adoption in the data
> > > &gt; lakehouse space makes this a valuable addition to Fluss's tiering
> > > &gt; capabilities.
> > > &gt; Are there any plans to integrate the Maintenance services as part
> of
> > > &gt; tiering itself as a post-commit hook to manage small files?
> > > &gt; Warm regards,
> > > &gt; Mehul Batra
> > > &gt;
> > > &gt; On Thu, Jul 3, 2025 at 2:24 PM yuxia <[email protected]
> > &gt;
> > > wrote:
> > > &gt;
> > > &gt; &gt; Hi,
> > > &gt; &gt;
> > > &gt; &gt; Fluss currently supports tiering data to Apache Paimon,
> > enabling
> > > &gt; &gt; cost-effective storage management for warm/cold data.
> However,
> > > the lack
> > > &gt; of
> > > &gt; &gt; native Iceberg tiering support limits flexibility and
> ecosystem
> > > &gt; integration
> > > &gt; &gt; for users who rely on Iceberg’s open table format.
> > > &gt; &gt;
> > > &gt; &gt; To address this gap, I’d like to propose FIP-3: Support
> Tiering
> > > Fluss
> > > &gt; Data
> > > &gt; &gt; to Iceberg[1] which aims to integrate Iceberg into Fluss’s
> > > tiering
> > > &gt; &gt; capabilities.
> > > &gt; &gt;
> > > &gt; &gt; Welcome your feedback and suggestions on this proposal.
> Looking
> > > forward
> > > &gt; to
> > > &gt; &gt; a productive discussion!
> > > &gt; &gt;
> > > &gt; &gt; [1]:
> > > &gt; &gt;
> > > &gt;
> > >
> >
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-3%3A+Support+tiering+Fluss+data+to+Iceberg
> > > &gt
> > > <
> >
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-3%3A+Support+tiering+Fluss+data+to+Iceberg&gt
> > >;
> > > &gt;
> > > &gt; &gt; Best regards,
> > > &gt; &gt; Yuxia
> > > &gt; &gt;
> > > &gt; &gt;
> > > &gt;
> >
>

Re: [SPAM]Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg

Reply via email to