Hi  dev

As we know, we will be happy to cut the iceberg 0.10.0 candidate release
this week.  I think it may be the time to plan for the future iceberg
0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]

I put the following issues into the newly created milestone:

1.   Apache Flink Rewrite Actions in Apache Iceberg.

It's possible that we encounter too many small files issues when running
the iceberg flink sink in real production because of the frequent
checkpoint.  we have two approaches to handle the small files:

a.  As the current spark rewrite actions designed,  flink will provide the
similar rewrite actions which will be running in a batch job.  It's
suitable to trigger the whole table or whole partitions compactions
periodically, because this kind of rewrites will compact many large files
and may consume lots of bandwidth.  Currently,   I and JunZheng are working
on this issue, and we've extracted the base rewrite actions between spark
module and flink module.  The next step would be implementing rewrite
actions in the flink module.

b. Compact those small files in the flink streaming job when sinking into
iceberg tables. That means we will provide a new rewrite operator chaining
to the current IcebergFilesCommitter.  Once an iceberg transaction has been
committed, the newly introduced rewrite operator will check whether it
needs a small compaction. Those actions only choose a few tiny size files
(may be several KB, or MB, I think we could provide a configurable
threshold) to rewrite, which can be achieved with a minimum cost and a
higher efficiency of compaction.   Currently,  simonsssu from Tencent has
provided a WIP PR here [2]


2. Allow to write CDC or UPSERT records by flink streaming jobs.

We've almost implemented the row-level delete feature in the iceberg master
branch, but still lack the ability to integrate with compute engines (to be
precise,  we spark/flink could read the expected records if someone has
deleted the rows correctly but the write path is not available).  I am
preparing the patch for sinking CDC into iceberg by flink streaming job
here [3], I think it will be ready in the next few weeks.

3.  Apache flink streaming reader.

We've prepared a POC version in our alibaba internal branch, but still not
contribute to apache iceberg now.  I think it's worth accomplishing that in
the following days.


The above are the issues that I think it's worth to merge before iceberg
0.11.0.  But  I' not quite sure what's the plan for the things:

1.  I know @Anton Okolnychyi <aokolnyc...@apple.com> is working on
spark-sql extensions for iceberg, I guess there's a high probability to get
that ?  [4]

2.  @Steven Wu <steve...@netflix.com> from netflix is working on flink
source based on the new FLIP-27 interface,  thoughts ? [5]

3.  How about the Spark Row-Delete integration work ?



[1].  https://github.com/apache/iceberg/milestone/12
[2]. https://github.com/apache/iceberg/pull/1669/files
[3]. https://github.com/apache/iceberg/pull/1663
[4]. https://github.com/apache/iceberg/milestone/11
[5]. https://github.com/apache/iceberg/issues/1626

Reply via email to