Re: Time needed to read from Kafka source

B.B. Wed, 26 May 2021 21:37:07 -0700

Hi,
I forgot to mention that we are using Flink 1.12.0. This is a job that has
only minimum components. Reading from source and printing it.
Profiling was my next step to do. Regarding memory I didn't see any
bottlenecks.
I guess I will have to do some investigating in the metric part of Flink.


BR,
BB

On Tue, 25 May 2021 at 17:12, Piotr Nowojski <pnowoj...@apache.org> wrote:

> Hi,
>
> That's a throughput of 700 records/second, which should be well below
> theoretical limits of any deserializer (from hundreds thousands up to tens
> of millions records/second/per single operator), unless your records are
> huge or very complex.
>
> Long story short, I don't know of a magic bullet to help you solve your
> problem. As always you have two options, either optimise/speed up your
> code/job, or scale up.
>
> If you choose the former, think about Flink as just another Java
> application. Check metrics and resource usage, and understand what resource
> is the problem (cpu? memory? machine is swapping? io?). You might be able
> to guess what's your bottleneck (reading from kafka? deserialisation?
> something else? Flink itself?) by looking at some of the metrics
> (busyTimeMsPerSecond [1] or idleTimeMsPerSecond could help with that), or
> you can also simplify your job to bare minimum and test performance of
> independent components. Also you can always attach a code profiler and
> simply look at what's happening. First identify what's the source of the
> bottleneck and then try to understand what's causing it.
>
> Best,
> Piotrek
>
> [1] busyTimeMsPerSecond is available since Flink 1.13. Flink 1.13 also
> comes with nice tools to analyse bottlenecks in the WebUI (coloring nodes
> in the job graph based on busy/back pressured status and Flamegraph
> support)
>
> wt., 25 maj 2021 o 15:44 B.B. <bijela.vr...@gmail.com> napisał(a):
>
>> Hi,
>>
>> I am in the process of optimizing my job which at the moment by our
>> thinking is too slow.
>>
>> We are deploying job in kubernetes with 1 job manager with 1gb ram and 1
>> cpu and 1 task manager with 4gb ram and 2 cpu-s (eg. 2 task slots and
>> parallelism of two).
>>
>> The main problem is one kafka source that has 3,8 million events that we
>> have to process.
>> As a test we made a simple job that connects to kafka using a custom
>> implementation of KafkaDeserializationSchema. There we are using
>> ObjectMapper that mapps input values eg.
>>
>> *var event = objectMapper.readValue(consumerRecord.value(),
>> MyClass.class);*
>>
>> This is then validated with hibernate validator and output of this
>> source is printed on the console.
>>
>> The time needed for the job to consume all the events was one and a half
>> hours, which seems a bit long.
>> Is there a way we can speed up this process?
>>
>> Is more cpu cores or memory solution?
>> Should we switch to avro deserialization schema?
>>
>>
>>
>> --
Everybody wants to be a winner
Nobody wants to lose their game
Its insane for me
Its insane for you
Its insane

Re: Time needed to read from Kafka source

Reply via email to