Re: Change Data Capture for Iceberg

Yufei Gu Wed, 09 Mar 2022 09:59:55 -0800

Hi everyone,

Thanks for the joining and discussion in the sync-up last Friday. We’ve got
a consensus on several items:

   1.

   The snapshot granularity CDC generation is useful, and will cover a wide
   range of use cases. Sub-snapshot granularity is out of scope at this
   moment, which needs a separate proposal.
   2.

   For COW, we should treat all rows from the deleted data files as the
   deleted rows, which is more efficient, and more importantly, it doesn’t
   yield wrong results when duplicate rows exist.
   3.

   Creating a minimum viable product (MVP) according to the current design

Thanks Anton for the comments in
https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554.

With the meetup and Anton's comment, here is the plan to move forward. We
split the implementation into two phases. The minimum viable product (MVP)
in phase 1 will have most things from the proposal with the following
adjustments.

*Phase 1 (MVP)*

   1.

   To emit delete and insert CDC records only
   2.

   Don’t join for equality deletes. To emit equality deletes directly as
   deleted rows per Anton’s suggestion. Otherwise, we need to join the whole
   table with the equality delete files, which is not scalable. We will
   evaluate the cost of the join in phase 2 and support it probably, or the
   other way to approach it.
   3.

   COW: to output all rows in the deleted data files as the deleted rows,
   to output all rows in the added data files as the inserted rows. We will
   figure out a more scalable way to filter out unchanged rows in phase 2. The
   approach of joining on the all columns has two issues:
   1.

      Not scalable, think about a table with more than 100 columns
      2.

      Cannot handle the duplicate records, e.g. (1, Amy, 20) was in the
      data files marked as deleted, then we got new data files with
two same rows
      (1, Amy, 20) and (1, Amy, 20).
      4.

   User interface: to create an action to generate CDC records instead of a
   procedure, an action can return a dataframe, which is more convenient than
   an array of InternalRow produced by a Spark procedure.

*Phase 2*

   1.

   Enable update reconstruction to emit CDC update records.
   2.

   COW: to filter out unchanged rows.
   3.

   User Interface: to support the metatable, which will enable more use
   cases, e.g., streaming use case.

Best,

Yufei

`This is not a contribution`

On Mon, Mar 7, 2022 at 1:30 PM Anton Okolnychyi
<[email protected]> wrote:

> Hey folks,
>
> Based on Yufei’s design doc and what we discussed during the sync, I
> shared my thoughts on what can be efficiently supported right now.
>
> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554
>
> I’d be interested to learn more about specific use cases that would
> violate the assumptions I listed in my comment. If you have such a use case
> in mind, please, comment on the issue.
>
> - Anton
>
>
> On 24 Feb 2022, at 14:57, Yufei Gu <[email protected]> wrote:
>
> Hi everyone,
>
> Move the CDC design discussion to next week's Friday(Mar 4), 9-10am PST
> due to an unexpected event. The meeting link will be the same,
> meet.google.com/vam-cmfx-feo. Thanks!
>
> Best,
>
> Yufei
>
>
> On Tue, Feb 22, 2022 at 12:25 PM Yufei Gu <[email protected]> wrote:
>
>> Hi everyone,
>>
>> It's great to see a lot of interest in the design.
>> We are planning to have a meeting to discuss Iceberg CDC design on
>> Friday(2/25) 9-10am PST. The meeting link is meet.google.com/vam-cmfx-feo.
>> We will talk about the general idea, as well as open questions. The meeting
>> will be recorded.
>>
>>
>> Best,
>> Yufei
>>
>>
>> On Fri, Feb 11, 2022 at 3:54 PM Holden Karau <[email protected]>
>> wrote:
>>
>>> Oh cool, I have not had a chance to review much of this, but I was
>>> having a conversation with a team which wanted similar features for a table
>>> so excited to see folks working on it 👍
>>>
>>> On Fri, Feb 11, 2022 at 12:40 PM Yufei Gu <[email protected]> wrote:
>>>
>>>> Hi team,
>>>>
>>>> We propose a way to generate the CDC records from the Iceberg tables.
>>>> It is an approach without table spec change and write-time logging. It will
>>>> cover the majority of CDC use cases, but no guarantee to all of them. We
>>>> believe it's a good start point to approach CDC in the Iceberg. Any
>>>> feedback is welcome!
>>>>
>>>> https://docs.google.com/document/d/1bN6rdLNcYOHnT3xVBfB33BoiPO06aKBo56SZmuU9pnY/edit?usp=sharing
>>>>
>>>> Best,
>>>>
>>>> Yufei
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>

Re: Change Data Capture for Iceberg

Reply via email to