Iceberg connector

2024-04-15 Thread Chetas Joshi
Hello, I am running a batch flink job to read an iceberg table. I want to understand a few things. 1. How does the FlinkSplitPlanner decide which fileScanTasks (I think one task corresponds to one data file) need to be clubbed together within a single split and when to create a new split? 2.

Re: Optimize exact deduplication for tens of billions data per day

2024-04-15 Thread Alex Cruise
It may not be completely relevant to this conversation in this year, but I find myself sharing this article once or twice a year when opining about how hard deduplication at scale can be.  -0xe1a On Thu, Apr 11, 2024 at 10:22 PM Péter Váry

Re: [EXTERNAL]Re: Pyflink Performance and Benchmark

2024-04-15 Thread Niklas Wilcke
Hi Zhanghao Chen, thanks for sharing the link. This looks quite interesting! Regards, Niklas > On 15. Apr 2024, at 12:43, Zhanghao Chen wrote: > > When it comes down to the actual runtime, what really matters is the plan > optimization and the operator impl & shuffling. You might be

Re: Flink job performance

2024-04-15 Thread Kenan Kılıçtepe
How many taskmanagers and server do you have? Can you also share the task managers page of flink dashboard? On Mon, Apr 15, 2024 at 10:58 AM Oscar Perez via user wrote: > Hi community! > > We have an interesting problem with Flink after increasing parallelism in > a certain way. Here is the

Re: Flink job performance

2024-04-15 Thread Oscar Perez via user
Hi, I appreciate your comments and thank you for that. My original question still remains though. Why the very same job just by changing the settings aforementioned had this increase in cpu usage and performance degradation when we should have expected the opposite behaviour? thanks again, Oscar

Kinesis connector writes wrong sequence number at stop with savepoint

2024-04-15 Thread Vararu, Vadim
I’ve been investigating a data duplication issue in a Kinesis -> Flink -> Kafka exactly once setup. Found out that at the stop with savepoint next things happen: * The Kafka transaction is committed, the last processed events being written * The Kinesis sequence number is written in

Re: Flink job performance

2024-04-15 Thread Zhanghao Chen
The exception basically says the remote TM is unreachable, probably terminated due to some other reasons. This may not be the root cause. Is there any other exceptions in the log? Also, since the overall resource usage is almost full, could you try allocating more CPUs and see if the

Re: Flink job performance

2024-04-15 Thread Zhanghao Chen
Hi, there seems to be sth wrong with the two images attached in the latest email. I cannot open them. Best, Zhanghao Chen From: Oscar Perez via user Sent: Monday, April 15, 2024 15:57 To: Oscar Perez via user ; pi-team ; Hermes Team Subject: Flink job

Re: Understanding event time wrt watermarking strategy in flink

2024-04-15 Thread Sachin Mittal
Hi Yunfeng, So regarding the dropping of records for out of order watermark, lats say records later than T - B will be dropped by the first operator after watermarking, which is reading from the source. So then these records will never be forwarded to the step where we do event-time windowing.

Re: Pyflink Performance and Benchmark

2024-04-15 Thread Zhanghao Chen
When it comes down to the actual runtime, what really matters is the plan optimization and the operator impl & shuffling. You might be interested in this blog: https://flink.apache.org/2022/05/06/exploring-the-thread-mode-in-pyflink/, which did a benchmark on the latter with the common the

Re: Flink job performance

2024-04-15 Thread Zhanghao Chen
Hi Oscar, The rebalance operation will go over the network stack, but not necessarily involving remote data shuffle. For data shuffling between tasks of the same node, the local channel is used, but compared to chained operators, it still introduces extra data serialization overhead. For data

Data duplication at stop with savepoint

2024-04-15 Thread Vararu, Vadim
Hi community, Need your help to understand if there is a misconfiguration or it’s a Flink bug. I’ve drawn a schema for better understanding but here is the problem in few steps:

Pyflink Performance and Benchmark

2024-04-15 Thread Niklas Wilcke
Hi Flink Community, I wanted to reach out to you to get some input about Pyflink performance. Are there any resources available about Pyflink benchmarks and maybe a comparison with the Java API? I wasn't able to find something valuable, but maybe I missed something? I am aware that