Re: RichCdcSinkBuilder with Iceberg catalog?

Giannis Polyzos Fri, 19 Jul 2024 11:52:13 -0700

Hi Andrew,
the Paimon / Iceberg snapshot was shipped last week in the master branch
and will come with the official 0.9 release I believe.
As noted in the thread, paimon has done lots of work in terms of CDC and
strong integration with Flink CDC.
Flink CDC added paimon support in the latest release thus you might see
both projects supporting a variant.. Flink CDC itself aims to make the
process easier via yaml files and you use it as well.


In terms of paimon in the context of Flink, the reason it has done so much
work on CDC is because data mutation is also what is required for stream
processing i.e. changelog streams.
It allows a more cost-efficient way (with some cost/latency trade-off) to
replace Flink operations like aggregations or expensive streaming joins
(via partial-updates).
At the same time it allows to replace expensive message queues via the
bucketed append as it provides a consistent message queue functionality.
Along with that and with the use of deletion vectors it also allows an
environment (cheaper storage) for OLAP.

Some rough numbers from production use cases:
- message queue functionality with 20-30 second latencies (atm there are a
few use cases in production with 10-second latencies but there are still a
few challenges there to keep the resources low)
- OLAP queries: 30-60 seconds data freshness and OLAP queries ~1-5 seconds
and of course, all the CDC and stream processing stuff that was done
amazing work there.

Overall the recommended latency for CDC and processing is around to
the minute level, to account also for small file problems in case you don't
have enough data.

Hope this provides some more context on the project and see if it can fit
more use cases.

On Fri, Jul 19, 2024 at 9:25 PM Andrew Otto <[email protected]> wrote:

> TIL about XTable.  Cool!
>
>
> On Fri, Jul 19, 2024 at 2:11 PM Kyle Weller <[email protected]> wrote:
>
>> I wonder if Apache XTable <https://xtable.apache.org/> is also a
>> viable option to consider? Data could still be written and stored natively
>> as Paimon and asynchronously generate the iceberg manifest files and sync
>> to an Iceberg catalog. It is working great between Iceberg, Hudi, Delta
>> today in production. There may be some code in that project to leverage or
>> adding paimon XTable interface would auto unlock omni directional
>> translation to all 4 table formats versus a 1 by 1 integration.
>>
>> On Fri, Jul 19, 2024 at 8:41 AM Andrew Otto <[email protected]> wrote:
>>
>>> > > Another approach is to create a snapshot compatible way for Paimon
>>>  to generate Iceberg, which is what we are working on.
>>> Hi, just checking in!  How is this going? Thanks!
>>>
>>> On Mon, Jun 10, 2024 at 9:17 AM Andrew Otto <[email protected]> wrote:
>>>
>>>> Awesome, I look forward to it!  Thank you!
>>>>
>>>> On Mon, Jun 10, 2024 at 2:35 AM Jingsong Li <[email protected]>
>>>> wrote:
>>>>
>>>>> We are developing prototype in our internal.
>>>>>
>>>>> It takes about 2 to 3 months.
>>>>>
>>>>> Andrew Otto <[email protected]>于2024年5月29日 周三21:46写道：
>>>>>
>>>>>> > Another approach is to create a snapshot compatible way for Paimon
>>>>>> to generate Iceberg, which is what we are working on.
>>>>>>
>>>>>> Oh!  Very interesting.  Can you say more? And/or do you have links to
>>>>>> Jira or anything?
>>>>>>
>>>>>> Thanks for your response! :)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 29, 2024 at 7:41 AM Jingsong Li <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> It is difficult to move this mechanism to the Iceberg sink. The table
>>>>>>> structure change in Iceberg's design requires generating a new
>>>>>>> snapshot, which poses significant challenges to schema evolution.
>>>>>>>
>>>>>>> Another approach is to create a snapshot compatible way for Paimon to
>>>>>>> generate Iceberg, which is what we are working on.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jingsong
>>>>>>>
>>>>>>> On Fri, May 24, 2024 at 8:11 PM Andrew Otto <[email protected]>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi!
>>>>>>> >
>>>>>>> > How coupled to Paimon catalogs and tables is the cdc part of
>>>>>>> Paimon?  RichCdcMultiplexRecord and related code seem incredibly useful
>>>>>>> even outside of the context of the Paimon table format.
>>>>>>> >
>>>>>>> > I'm asking because the database sync action feature is amazing.
>>>>>>> At the Wikimedia Foundation, we are on an all-in journey with Iceberg.  
>>>>>>> I'm
>>>>>>> wondering how hard it would be to extract the CDC logic from Paimon and
>>>>>>> abstract the Sink bits.
>>>>>>> >
>>>>>>> > Could the table/database sync with schema evolution (without Flink
>>>>>>> job restarts!) potentially work with the Iceberg sink?
>>>>>>> >
>>>>>>> > Thanks!
>>>>>>> > -Andrew Otto
>>>>>>> >  Wikimedia Foundation
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>

Re: RichCdcSinkBuilder with Iceberg catalog?

Reply via email to