Thanks for your context about FLIP-27, Steven !

I will take a look for the patches under issues 1626.

On Sat, Oct 31, 2020 at 2:03 AM Steven Wu <stevenz...@gmail.com> wrote:

> OpenInx, thanks a lot for kicking off the discussion. Looks like my
> previous reply didn't reach the mailing list.
>
> > flink source based on the new FLIP-27 interface
>
> Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have
> updated the issue [1] with the following scopes.
>
>    - Support both static/batch and continuous/streaming enumeration modes
>    - Support only the simple assigner with no ordering/locality guarantee
>    when handing out split assignment. But make the interface flexible to plug
>    in different assigners (like the event time alignment assigner or locality
>    aware assigner)
>    - It will be @Experimenta status as nobody has run FLIP-27 sources in
>    production today. Flink 1.12.0 release (ETA end of Nov) will have the first
>    set of sources (Kafka and file) implemented with FLIP-27 source framework.
>    We still need to gain more production experiences.
>
>
> [1] https://github.com/apache/iceberg/issues/1626
>
> On Wed, Oct 28, 2020 at 12:15 AM OpenInx <open...@gmail.com> wrote:
>
>> Hi  dev
>>
>> As we know, we will be happy to cut the iceberg 0.10.0 candidate release
>> this week.  I think it may be the time to plan for the future iceberg
>> 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]
>>
>> I put the following issues into the newly created milestone:
>>
>> 1.   Apache Flink Rewrite Actions in Apache Iceberg.
>>
>> It's possible that we encounter too many small files issues when running
>> the iceberg flink sink in real production because of the frequent
>> checkpoint.  we have two approaches to handle the small files:
>>
>> a.  As the current spark rewrite actions designed,  flink will provide
>> the similar rewrite actions which will be running in a batch job.  It's
>> suitable to trigger the whole table or whole partitions compactions
>> periodically, because this kind of rewrites will compact many large files
>> and may consume lots of bandwidth.  Currently,   I and JunZheng are working
>> on this issue, and we've extracted the base rewrite actions between spark
>> module and flink module.  The next step would be implementing rewrite
>> actions in the flink module.
>>
>> b. Compact those small files in the flink streaming job when sinking into
>> iceberg tables. That means we will provide a new rewrite operator chaining
>> to the current IcebergFilesCommitter.  Once an iceberg transaction has been
>> committed, the newly introduced rewrite operator will check whether it
>> needs a small compaction. Those actions only choose a few tiny size files
>> (may be several KB, or MB, I think we could provide a configurable
>> threshold) to rewrite, which can be achieved with a minimum cost and a
>> higher efficiency of compaction.   Currently,  simonsssu from Tencent has
>> provided a WIP PR here [2]
>>
>>
>> 2. Allow to write CDC or UPSERT records by flink streaming jobs.
>>
>> We've almost implemented the row-level delete feature in the iceberg
>> master branch, but still lack the ability to integrate with compute engines
>> (to be precise,  we spark/flink could read the expected records if someone
>> has deleted the rows correctly but the write path is not available).  I am
>> preparing the patch for sinking CDC into iceberg by flink streaming job
>> here [3], I think it will be ready in the next few weeks.
>>
>> 3.  Apache flink streaming reader.
>>
>> We've prepared a POC version in our alibaba internal branch, but still
>> not contribute to apache iceberg now.  I think it's worth accomplishing
>> that in the following days.
>>
>>
>> The above are the issues that I think it's worth to merge before iceberg
>> 0.11.0.  But  I' not quite sure what's the plan for the things:
>>
>> 1.  I know @Anton Okolnychyi <aokolnyc...@apple.com> is working on
>> spark-sql extensions for iceberg, I guess there's a high probability to get
>> that ?  [4]
>>
>> 2.  @Steven Wu <steve...@netflix.com> from netflix is working on flink
>> source based on the new FLIP-27 interface,  thoughts ? [5]
>>
>> 3.  How about the Spark Row-Delete integration work ?
>>
>>
>>
>> [1].  https://github.com/apache/iceberg/milestone/12
>> [2]. https://github.com/apache/iceberg/pull/1669/files
>> [3]. https://github.com/apache/iceberg/pull/1663
>> [4]. https://github.com/apache/iceberg/milestone/11
>> [5]. https://github.com/apache/iceberg/issues/1626
>>
>

Reply via email to