from:"gongzhongqiang"

Re: [ANNOUNCE] Apache Flink 1.19.0 released

2024-03-18 Thread gongzhongqiang

Congrats! Thanks to everyone involved!

Best,
Zhongqiang Gong

Lincoln Lee  于2024年3月18日周一 16:27写道：

> The Apache Flink community is very happy to announce the release of Apache
> Flink 1.19.0, which is the fisrt release for the Apache Flink 1.19 series.
>
> Apache Flink® is an open-source stream processing framework for
> distributed, high-performing, always-available, and accurate data streaming
> applications.
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Please check out the release blog post for an overview of the improvements
> for this bugfix release:
>
> https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19/
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353282
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
>
> Best,
> Yun, Jing, Martijn and Lincoln
>

Re: [ANNOUNCE] Donation Flink CDC into Apache Flink has Completed

2024-03-21 Thread gongzhongqiang

Congrattulations! Thanks for the great work!


Best,
Zhongqiang Gong

Leonard Xu  于2024年3月20日周三 21:36写道：

> Hi devs and users,
>
> We are thrilled to announce that the donation of Flink CDC as a
> sub-project of Apache Flink has completed. We invite you to explore the new
> resources available:
>
> - GitHub Repository: https://github.com/apache/flink-cdc
> - Flink CDC Documentation:
> https://nightlies.apache.org/flink/flink-cdc-docs-stable
>
> After Flink community accepted this donation[1], we have completed
> software copyright signing, code repo migration, code cleanup, website
> migration, CI migration and github issues migration etc.
> Here I am particularly grateful to Hang Ruan, Zhongqaing Gong, Qingsheng
> Ren, Jiabao Sun, LvYanquan, loserwang1024 and other contributors for their
> contributions and help during this process！
>
>
> For all previous contributors: The contribution process has slightly
> changed to align with the main Flink project. To report bugs or suggest new
> features, please open tickets
> Apache Jira (https://issues.apache.org/jira).  Note that we will no
> longer accept GitHub issues for these purposes.
>
>
> Welcome to explore the new repository and documentation. Your feedback and
> contributions are invaluable as we continue to improve Flink CDC.
>
> Thanks everyone for your support and happy exploring Flink CDC!
>
> Best，
> Leonard
> [1] https://lists.apache.org/thread/cw29fhsp99243yfo95xrkw82s5s418ob
>
>

Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project

2024-03-28 Thread gongzhongqiang

Congratulations!

Best,

Zhongqiang Gong

Yu Li  于2024年3月28日周四 15:57写道：

> CC the Flink user and dev mailing list.
>
> Paimon originated within the Flink community, initially known as Flink
> Table Store, and all our incubating mentors are members of the Flink
> Project Management Committee. I am confident that the bonds of
> enduring friendship and close collaboration will continue to unite the
> two communities.
>
> And congratulations all!
>
> Best Regards,
> Yu
>
> On Wed, 27 Mar 2024 at 20:35, Guojun Li  wrote:
> >
> > Congratulations!
> >
> > Best,
> > Guojun
> >
> > On Wed, Mar 27, 2024 at 5:24 PM wulin  wrote:
> >
> > > Congratulations~
> > >
> > > > 2024年3月27日 15:54，王刚  写道：
> > > >
> > > > Congratulations~
> > > >
> > > >> 2024年3月26日 10:25，Jingsong Li  写道：
> > > >>
> > > >> Hi Paimon community,
> > > >>
> > > >> I’m glad to announce that the ASF board has approved a resolution to
> > > >> graduate Paimon into a full Top Level Project. Thanks to everyone
> for
> > > >> your help to get to this point.
> > > >>
> > > >> I just created an issue to track the things we need to modify [2],
> > > >> please comment on it if you feel that something is missing. You can
> > > >> refer to apache documentation [1] too.
> > > >>
> > > >> And, we already completed the GitHub repo migration [3], please
> update
> > > >> your local git repo to track the new repo [4].
> > > >>
> > > >> You can run the following command to complete the remote repo
> tracking
> > > >> migration.
> > > >>
> > > >> git remote set-url origin https://github.com/apache/paimon.git
> > > >>
> > > >> If you have a different name, please change the 'origin' to your
> remote
> > > name.
> > > >>
> > > >> Please join me in celebrating!
> > > >>
> > > >> [1]
> > >
> https://incubator.apache.org/guides/transferring.html#life_after_graduation
> > > >> [2] https://github.com/apache/paimon/issues/3091
> > > >> [3] https://issues.apache.org/jira/browse/INFRA-25630
> > > >> [4] https://github.com/apache/paimon
> > > >>
> > > >> Best,
> > > >> Jingsong Lee
> > >
> > >
>

Re: One query just for curiosity

2024-03-28 Thread gongzhongqiang

Hi  Ganesh,

As  Zhanghao Chen told before, He advise you two solutions for different
scenarios.

1.Process record is a CPU-bound task: scale up parallelism of task and
flink cluster to improve tps.
2.Process record is a IO-bound task: use Async-IO to reduce cost of
resource and alse get better performance.


Best,

Zhongqiang Gong

Ganesh Walse  于2024年3月29日周五 12:00写道：

> You mean to say we can process 32767 records in parallel. And may I know
> if this is the case then do we need to do anything for this.
>
> On Fri, 29 Mar 2024 at 8:08 AM, Zhanghao Chen 
> wrote:
>
>> Flink can be scaled up to a parallelism of 32767 at max. And if your
>> record processing is mostly IO-bound, you can further boost the throughput
>> via Async-IO [1].
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/
>>
>> Best,
>> Zhanghao Chen
>> --
>> *From:* Ganesh Walse 
>> *Sent:* Friday, March 29, 2024 4:48
>> *To:* user@flink.apache.org 
>> *Subject:* One query just for curiosity
>>
>> Hi Team,
>> If my 1 record gets processed in 1 second in a flink. Then what will be
>> the best time taken to process 1000 records in flink using maximum
>> parallelism.
>>
>

Re: FlinkCEP

2024-04-23 Thread gongzhongqiang

Hi,

After flink 1.5 , there have been no major changes to the CEP API.

Best,
Zhongqiang Gong

Esa Heikkinen  于2024年4月23日周二 04:19写道：

> Hi
>
> It's been over 5 years since I last did anything with FlinkCEP and Flink.
>
> Has there been any significant development in FlinkCEP during this time?
>
> BR. Esa
>
>

Re: CSV format and hdfs

2024-04-28 Thread gongzhongqiang

Hi  Artem,

I research on this and open a issue[1] , Rob Young , Alexander Fedulov and
I discuss on this. We also think this performance issue can be solved by
manual flush. I had
opened a pr[2]. You can cherry pick and package on your local, replace the
jar in lib folder.

I'm willing to hear from you about this.

1.https://issues.apache.org/jira/browse/FLINK-35240
2.https://github.com/apache/flink/pull/24730


Best，
Zhongqiang Gong

Robert Young  于2024年4月26日周五 13:25写道：

> Hi Artem,
>
> I had a debug of Flink 1.17.1 (running CsvFilesystemBatchITCase) and I see
> the same behaviour. It's the same on master too. Jackson flushes [1] the
> underlying stream after every `writeValue` call. I experimented with
> disabling the flush by disabling Jackson's FLUSH_PASSED_TO_STREAM [2]
> feature but this broke the Integration tests. This is because Jackson wraps
> the stream in it's own Writer that buffers data. We depend on the flush to
> flush the jackson writer and eventually write the bytes to the stream.
>
> One workaround I found [3] is to wrap the stream in an implementation that
> ignores flush calls, and pass that to Jackson. So Jackson will flush it's
> writer buffers and write the bytes to the underlying stream, then try to
> flush the underlying stream but it will be a No-Op. The CsvBulkWriter will
> continues to flush/sync the underlying stream. Unfortunately this required
> code changes in Flink CSV so might not be helpful for you.
>
> 1.
> https://github.com/FasterXML/jackson-dataformats-text/blob/8700b5489090f81b4b8d2636f9298ac47dbf14a3/csv/src/main/java/com/fasterxml/jackson/dataformat/csv/CsvGenerator.java#L504
> 2.
> https://fasterxml.github.io/jackson-core/javadoc/2.13/com/fasterxml/jackson/core/JsonGenerator.Feature.html#FLUSH_PASSED_TO_STREAM
> 3.
> https://github.com/robobario/flink/commit/ae3fdb1ca9de748df791af232bba57d6d7289a79
>
> Rob Young
>

Re: Checkpointing while loading causing issues

2024-05-14 Thread gongzhongqiang

Hi Lars,

Currently, there is no configuration available to trigger a checkpoint
immediately after the job starts in Flink.

But we
can address this issue from multiple perspectives using the insights
provided in this document
[1].


[1]  
https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/ops/state/large_state_tuning/



Best regards,
Zhongqiang Gong

Lars Skjærven  于2024年5月15日周三 05:10写道：

> Hello,
>
> When restarting jobs (e.g. after upgrade) with "large" state a task can
> take some time to "initialize" (depending on the state size). During this
> time I noticed that Flink attempts to checkpoint. In many cases
> checkpointing will fail repeatedly, and cause the job to hit the
> tolerable-failed-checkpoints limit and restart. The only way to overcome
> the issue seems to be to increase the checkpoint interval, but this is
> suboptimal.
>
> Could Flink wait to trigger checkpointing when one or more task is
> initializing?
>
> Lars
>

Re: SSL Kafka PyFlink

2024-05-16 Thread gongzhongqiang

Hi Phil,

The kafka configuration keys of ssl maybe not correct. You can refer the
kafka document[1] to get the ssl configurations of client.


[1] https://kafka.apache.org/documentation/#security_configclients


Best,
Zhongqiang Gong

Phil Stavridis  于2024年5月17日周五 01:44写道：

> Hi,
>
> I have a PyFlink job that needs to read from a Kafka topic and the
> communication with the Kafka broker requires SSL.
> I have connected to the Kafka cluster with something like this using just
> Python.
>
> from confluent_kafka import Consumer, KafkaException, KafkaError
>
>
>
> def get_config(bootstrap_servers, ca_file, cert_file, key_file):
> config = {
> 'bootstrap.servers': bootstrap_servers,
> 'security.protocol': 'SSL',
> 'ssl.ca.location': ca_file,
> 'ssl.certificate.location': cert_file,
> 'ssl.key.location': key_file,
> 'ssl.endpoint.identification.algorithm': 'none',
> 'enable.ssl.certificate.verification': 'false',
> 'group.id': ‘my_group_id'
> }
>
>
> return config
>
>
>
> And have read messages from the Kafka topic.
>
> I am trying to set up something similar with Flink SQL:
>
> t_env.execute_sql(f"""
> CREATE TABLE logs (
> `user` ROW(`user_id` BIGINT),
> `timestamp` ROW(`secs` BIGINT)
> ) WITH (
> 'connector' = '{CONNECTOR_TYPE}',
> 'topic' = ‘{KAFKA_TOPIC}',
> 'properties.bootstrap.servers' = '{BOOTSTRAP_SERVERS}',
> 'properties.group.id' = '{CONSUMER_GROUP}',
> 'scan.startup.mode' = '{STARTUP_MODE_LATEST}',
> 'format' = '{MESSAGE_FORMAT}',
> 'properties.security.protocol' = 'SSL',
> 'properties.ssl.ca.location' = '{ca_file}',
> 'properties.ssl.certificate.location' = '{cert_file}',
> 'properties.ssl.key.location' = '{key_file}',
> 'properties.ssl.endpoint.identification.algorithm' = ''
> )
> """)
>
>
> But when this runs I am getting this error:
>
> Caused by: org.apache.flink.util.FlinkException: Global failure triggered
> by OperatorCoordinator for 'Source: kafka_action_logs[1] -> Calc[2]'
> (operator cbc357ccb763df2852fee8c4fc7d55f2).
> ...
> Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to list
> subscribed topic partitions due to
> at
> ...
> Caused by: java.lang.RuntimeException: Failed to get metadata for topics
> [logs].
> at
> ...
> ... 3 more
> Caused by: java.util.concurrent.ExecutionException:
> org.apache.flink.kafka.shaded.org.apache.kafka.common.errors.SslAuthenticationException:
> SSL handshake failed
> at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at
> org.apache.flink.kafka.shaded.org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
> at
> org.apache.flink.connector.kafka.source.enumerator.subscriber.KafkaSubscriberUtils.getTopicMetadata(KafkaSubscriberUtils.java:44)
> ... 10 more
> Caused by:
> org.apache.flink.kafka.shaded.org.apache.kafka.common.errors.SslAuthenticationException:
> SSL handshake failed
> Caused by: javax.net.ssl.SSLHandshakeException: PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target
> ...
> at
> org.apache.flink.kafka.shaded.org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:435)
> at
> org.apache.flink.kafka.shaded.org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:523)
> at
> org.apache.flink.kafka.shaded.org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:373)
> at
> org.apache.flink.kafka.shaded.org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:293)
> at
> org.apache.flink.kafka.shaded.org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:178)
> at
> org.apache.flink.kafka.shaded.org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:543)
> at
> org.apache.flink.kafka.shaded.org.apache.kafka.common.network.Selector.poll(Selector.java:481)
> at
> org.apache.flink.kafka.shaded.org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:560)
> at
> org.apache.flink.kafka.shaded.org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1413)
> at
> org.apache.flink.kafka.shaded.org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1344)
> at java.lang.Thread.run(Thread.java:750)
> Caused by: sun.security.validator.ValidatorException: PKIX path building
> failed: sun.security.provider.certpath.SunCertPathBuilderException: unable
> to find valid certification path to requested target
> at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:456)
> at
> sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:323)
> at sun.security.validator.Validator.validate(Validator.java:271)
> at
> sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:315)
> at
> sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Tru

Re: What is the best way to aggregate data over a long window

2024-05-17 Thread gongzhongqiang

Hi  Sachin,

We can optimize this problem in the following ways:
-
use 
org.apache.flink.streaming.api.datastream.WindowedStream#aggregate(org.apache.flink.api.common.functions.AggregateFunction)
to reduce number of data
- use TTL to clean data which are not need
- enble incremental checkpoint
- use
multi-level time window granularity for pre-aggregation can
significantly improve performance and reduce computation latency

Best,
Zhongqiang Gong

Sachin Mittal  于2024年5月17日周五 03:48写道：

> Hi,
> My pipeline step is something like this:
>
> SingleOutputStreamOperator reducedData =
> data
> .keyBy(new KeySelector())
> .window(
> TumblingEventTimeWindows.of(Time.seconds(secs)))
> .reduce(new DataReducer())
> .name("reduce");
>
>
> This works fine for secs = 300.
> However once I increase the time window to say 1 hour or 3600 the state
> size increases as now it has a lot more records to reduce.
>
> Hence I need to allocate much more memory to the task manager.
>
> However there is no upper limit to this memory allocated. If the volume of
> data increases by say 10 fold I would have no option but to again increase
> the memory.
>
> Is there a better way to perform long window aggregation so overall this
> step has a small memory footprint.
>
> Thanks
> Sachin
>
>

Re: [ANNOUNCE] Apache Flink CDC 3.1.0 released

2024-05-17 Thread gongzhongqiang

Congratulations !
Thanks for all contributors.


Best,

Zhongqiang Gong

Qingsheng Ren  于 2024年5月17日周五 17:33写道：

> The Apache Flink community is very happy to announce the release of
> Apache Flink CDC 3.1.0.
>
> Apache Flink CDC is a distributed data integration tool for real time
> data and batch data, bringing the simplicity and elegance of data
> integration via YAML to describe the data movement and transformation
> in a data pipeline.
>
> Please check out the release blog post for an overview of the release:
>
> https://flink.apache.org/2024/05/17/apache-flink-cdc-3.1.0-release-announcement/
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Maven artifacts for Flink CDC can be found at:
> https://search.maven.org/search?q=g:org.apache.flink%20cdc
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12354387
>
> We would like to thank all contributors of the Apache Flink community
> who made this release possible!
>
> Regards,
> Qingsheng Ren
>

Re: What is the best way to aggregate data over a long window

2024-05-20 Thread gongzhongqiang

Hi  Sachin,

`performing incremental aggregation using stateful processing` is same as
`windows with agg`, but former is more flexible.If flink window can not
satisfy your performance needs
,and your business logic has some features that can be customized for
optimization. You can choose the former.

Best,
Zhongqiang Gong

Sachin Mittal  于2024年5月17日周五 19:39写道：

> Hi,
> I am doing the following
> 1. Use reduce function where the data type of output after windowing is
> the same as the input.
> 2. Where the output of data type after windowing is different from that of
> input I use the aggregate function. For example:
>
> SingleOutputStreamOperator data =
> reducedPlayerStatsData
> .keyBy(new KeySelector())
> .window(
> TumblingEventTimeWindows.of(Time.seconds(secs)))
> .aggregate(new DataAggregator())
> .name("aggregate");
>
> In this case data which is aggregated is of a different type than the
> input so I had to use aggregate function.
> However in cases where data is of the same type using reduce function is
> very simple to use.
> Is there any fundamental difference between aggregate and reduce function
> in terms of performance?
> 3. I have enable incremental checkpoints at flink conf level using:
> state.backend.type: "rocksdb"
> state.backend.incremental: "true"
>
> 4. I am really not sure how I can use TTL. I assumed that Flink would
> automatically clean the state of windows that are expired ? Is there any
> way I can use TTL in the steps I have mentioned.
> 5. When you talk about pre-aggregation is this what you mean, say first
> compute minute aggregation and use that as input for hour aggregation ? So
> my pipeline would be something like this:
>
> SingleOutputStreamOperator reducedData =
> data
> .keyBy(new KeySelector())
> .window(
> TumblingEventTimeWindows.of(Time.seconds(60)))
> .reduce(new DataReducer()).window(
>
> TumblingEventTimeWindows.of(Time.seconds(3600)))
> .reduce(new DataReducer()).name("reduce");
>
>
> I was thinking of performing incremental aggregation using stateful 
> processing.
>
> Basically read one record and reduce it and store it in state and then read 
> next and reduce that plus the current state and update the new reduced value 
> back in the state and so on.
>
> Fire the final reduced value from the state at the end of eventtime I 
> register to my event timer and then update the timer to next event time and 
> also clean the state.
>
> This way each state would always keep only one record, no matter for what 
> period we aggregate data for.
>
> Is this a better approach than windowing ?
>
>
> Thanks
> Sachin
>
>
> On Fri, May 17, 2024 at 1:14 PM gongzhongqiang 
> wrote:
>
>> Hi  Sachin,
>>
>> We can optimize this problem in the following ways:
>> -
>> use 
>> org.apache.flink.streaming.api.datastream.WindowedStream#aggregate(org.apache.flink.api.common.functions.AggregateFunction)
>> to reduce number of data
>> - use TTL to clean data which are not need
>> - enble incremental checkpoint
>> - use
>> multi-level time window granularity for pre-aggregation can significantly 
>> improve performance and reduce computation latency
>>
>> Best,
>> Zhongqiang Gong
>>
>> Sachin Mittal  于2024年5月17日周五 03:48写道：
>>
>>> Hi,
>>> My pipeline step is something like this:
>>>
>>> SingleOutputStreamOperator reducedData =
>>> data
>>> .keyBy(new KeySelector())
>>> .window(
>>> TumblingEventTimeWindows.of(Time.seconds(secs)))
>>> .reduce(new DataReducer())
>>> .name("reduce");
>>>
>>>
>>> This works fine for secs = 300.
>>> However once I increase the time window to say 1 hour or 3600 the state
>>> size increases as now it has a lot more records to reduce.
>>>
>>> Hence I need to allocate much more memory to the task manager.
>>>
>>> However there is no upper limit to this memory allocated. If the volume
>>> of data increases by say 10 fold I would have no option but to again
>>> increase the memory.
>>>
>>> Is there a better way to perform long window aggregation so overall this
>>> step has a small memory footprint.
>>>
>>> Thanks
>>> Sachin
>>>
>>>

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-26 Thread gongzhongqiang

Flink CDC 3.0 focuses on data integration scenarios, so you don't need to
pay attention to the framework implementation, you just need to use the
YAML format to describe the data source and target to quickly build a data
synchronization task with schema evolution.And it supports rich source and
sink.

So I think add pipeline sink connector for iceberg will be the better way
to go.We can rely on the Flink CDC 3.0 framework to connect source and
target easily.

Best,
Zhongqiang Gong


Péter Váry  于2024年5月25日周六 19:29写道：

> > Could the table/database sync with schema evolution (without Flink job
> restarts!) potentially work with the Iceberg sink?
>
> Making  this work would be a good addition to the Iceberg-Flink connector.
> It is definitely doable, but not a single PR sized task. If you want to try
> your hands on it, I will try to find time to review your plans/code, so
> your code could be incorporated into the upcoming releases.
>
> Thanks,
> Peter
>
>
>
> On Fri, May 24, 2024, 17:07 Andrew Otto  wrote:
>
>> > What is not is the automatic syncing of entire databases, with schema
>> evolution and detection of new (and dropped?) tables. :)
>> Wait.  Is it?
>>
>> > Flink CDC supports synchronizing all tables of source database
>> instance to downstream in one job by configuring the captured database list
>> and table list.
>>
>>
>> On Fri, May 24, 2024 at 11:04 AM Andrew Otto  wrote:
>>
>>> Indeed, using Flink-CDC to write to Flink Sink Tables, including
>>> Iceberg, is supported.
>>>
>>> What is not is the automatic syncing of entire databases, with schema
>>> evolution and detection of new (and dropped?) tables.  :)
>>>
>>>
>>>
>>>
>>> On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos 
>>> wrote:
>>>
 https://nightlies.apache.org/flink/flink-cdc-docs-stable/
 All these features come from Flink cdc itself. Because Paimon and Flink
 cdc are projects native to Flink there is a strong integration between 
 them.
 (I believe it’s on the roadmap to support iceberg as well)

 On Fri, 24 May 2024 at 3:52 PM, Andrew Otto  wrote:

> > I’m curious if there is any reason for choosing Iceberg instead of
> Paimon
>
> No technical reason that I'm aware of.  We are using it mostly because
> of momentum.  We looked at Flink Table Store (before it was Paimon), but
> decided it was too early and the docs were too sparse at the time to 
> really
> consider it.
>
> > Especially for a use case like CDC that iceberg struggles to
> support.
>
> We aren't doing any CDC right now (for many reasons), but I have never
> seen a feature like Paimon's database sync before.  One job to sync and
> evolve an entire database?  That is amazing.
>
> If we could do this with Iceberg, we might be able to make an argument
> to product managers to push for CDC.
>
>
>
> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos 
> wrote:
>
>> I’m curious if there is any reason for choosing Iceberg instead of
>> Paimon (other than - iceberg is more popular).
>> Especially for a use case like CDC that iceberg struggles to support.
>>
>> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto 
>> wrote:
>>
>>> Interesting thank you!
>>>
>>> I asked this in the Paimon users group:
>>>
>>> How coupled to Paimon catalogs and tables is the cdc part of
>>> Paimon?  RichCdcMultiplexRecord
>>> 
>>>  and
>>> related code seem incredibly useful even outside of the context of the
>>> Paimon table format.
>>>
>>> I'm asking because the database sync action
>>> 
>>>  feature
>>> is amazing.  At the Wikimedia Foundation, we are on an all-in journey 
>>> with
>>> Iceberg.  I'm wondering how hard it would be to extract the CDC logic 
>>> from
>>> Paimon and abstract the Sink bits.
>>>
>>> Could the table/database sync with schema evolution (without Flink
>>> job restarts!) potentially work with the Iceberg sink?
>>>
>>>
>>>
>>>
>>> On Thu, May 23, 2024 at 4:34 PM Péter Váry <
>>> peter.vary.apa...@gmail.com> wrote:
>>>
 If I understand correctly, Paimon is sending `CdcRecord`-s [1] on
 the wire which contain not only the data, but the schema as well.
 With Iceberg we currently only send the row data, and expect to
 receive the schema on job start - this is more performant than sending 
 the
 schema all the time, but has the obvious issue that it is not able to
 handle the schema changes. Another part of the dynamic schema
 synchronization is the update of the Iceberg table schema - the

Re: Slack Invite

2024-05-30 Thread gongzhongqiang

Hi,
The invite  link :
https://join.slack.com/t/apache-flink/shared_invite/zt-2jtsd06wy-31q_aELVkdc4dHsx0GMhOQ

Best,
Zhongqiang Gong

Nelson de Menezes Neto  于2024年5月30日周四 15:01写道：

> Hey guys!
>
> I want to join the slack community but the invite has expired..
> Can u send me a new one?
>

Re: [ANNOUNCE] Apache Flink 1.19.0 released

Re: [ANNOUNCE] Donation Flink CDC into Apache Flink has Completed

Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project

Re: One query just for curiosity

Re: FlinkCEP

Re: CSV format and hdfs

Re: Checkpointing while loading causing issues

Re: SSL Kafka PyFlink

Re: What is the best way to aggregate data over a long window

Re: [ANNOUNCE] Apache Flink CDC 3.1.0 released

Re: What is the best way to aggregate data over a long window

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Re: Slack Invite

13 matches

Site Navigation

Mail list logo

Footer information