Re: Doubts about kafka connector

Juan Romero Thu, 13 Apr 2023 12:03:24 -0700

Hi John. Thanks for your response!.

Point 2 is clear now for me. I was reading lot of documentation about it
and i only wanted to confirm with you.

Regarding point 1 I know the drawbacks , we have to avoid hot partitions
(For this purpouse we can define a hash function that evenly distributes
the message across the three partitions ). However in our specific case we
are using a tool that ingest the data from transactional databases (This
tool allow us to define the partitioner function) and we want to persist
the last state of an specific database row in a target OLAP database. For
example we could have the following events which indicates that an specific
row was updated:

("Id": "123456 ", "name": "Juan", "last_name": "Romero", "timestamp":
"13-04-2023 13:44:00.100", "Event": "Update")
("Id": "123456 ", "name": "Juan", "last_name": "Romero2", "timestamp":
"13-04-2023 13:44:00.200", "Event": "Update")
("Id": "123456 ", "name": "Juan", "last_name": "Romero3", "timestamp":
"13-04-2023 13:44:00.300", "Event": "Update")

Suppose all this events are being processed by different workers (worker is
a consumer and producer at the same time  in this case, which at the end
push the data in other target topic). In the source we can guarantee that
the updatings with the same id go to the same partitions (so far the
messages would be ordered ), the problem is that we will have many
consumers and producers in the middle and in this case we can read the
messages in order from the first topic (because the same key always got to
the same partition) in the chain but after it we have to process the
message and push the message in other topic that have 3 partitions (like i
said before), in this case the producer will push the row updatings between
the different partitions and we can potencially lose the original messages
order, because we are pushing the different updates referring an specific
row across multiple partions.

Let me know If you can understand my case. If you want I can give you a
clear diagram to explain to you.

The general idea is that we need push the last valievent in the
transactional database in the target and in the middle we have
different beam pipelines which act like consumer and producer.

Thnaks!!

El jue, 13 abr 2023 a las 13:23, John Casey via user (<[email protected]>)
escribió:

> Hi Juan,
>
> Under normal usage, Kafka will maintain ordering within a partition
> without any extra work by you.
>
> For 2, you can use .commitOffsetsInFinalize to only commit back to the
> source topic once the pipeline has persisted the message, at which point it
> may not be fully processed, but it is guaranteed that it will be processed.
>
> For 1, I would generally recommend against trying to write to a specific
> partition. This undermines the load balancing that Kafka enables with
> partitioning. I'd recommend separate topics instead. We don't support
> writing to specific partitions in the current IO.
>
> John
>
> On Thu, Apr 13, 2023 at 2:15 PM Juan Romero <[email protected]> wrote:
>
>> Hi Apache Beam team. I have been working on a POC for the company i'm
>> working for and Im using apache beam kafka connector to read from kafka
>> topic and write into other kafka topic. The source and target topic have 3
>> partitions and is compulsory keep ordering by certain message keys.
>> Regarding it I have two questions:
>>
>> 1. How can I write an specific message to an specific kafka partition.
>> 2. How can we commit the message to the source topic only and only when
>> the pipeline had processed the message.
>>
>> I looking forward and hope you can help me with these doubts.
>>
>

Re: Doubts about kafka connector

Reply via email to