Re: ConsumeKafka to PublishKafka doesn't keep the order of the messages in the destination topic

2024-01-02 Thread edi mari
Hi Mark ,
I appreciate your advice. Upon examination, implementing the queue as a
"First In First Out" prioritizer and configuring the load balancing
strategy to "Partition by attribute" with the "kafka.partition" attribute
has proven effective in maintaining the order.

Thanks
Edi

On Fri, Dec 15, 2023 at 9:53 PM Mark Payne  wrote:

> Edi,
>
> Looking at your config again, you’ll want to also ensure that on the
> Publisher, you set the partition to `${kafka.partition}` so that it goes to
> the same partition on the destination system. You’ll also went to ensure
> that you set “Failure Strategy” to “Rollback” - otherwise any failure would
> route to ‘failure’ relationship and change the ordering. You’ll also need
> to limit the concurrent tasks on the publisher to 1 concurrent task, to
> ensure that you’re not sending multiple FlowFiles out of order.
>
> Thanks
> -Mark
>
>
> On Dec 15, 2023, at 2:26 AM, edi mari  wrote:
>
> Hi Mark,
> I tried the combination of FIFO and setting the back pressure to 10k, but
> it didn't preserve the order.
>
> Thanks
> Edi
>
> On Wed, Dec 13, 2023 at 3:47 PM Mark Payne  wrote:
>
>> Hey Edi,
>>
>> By default, nifi doesn’t preserve ordering but you can have it do so by
>> updating the connection’s configuration and adding the First In First Out
>> Prioritizer.
>>
>> Also of note you will want to keep the backpressure threshold set to
>> 10,000 objects rather than increasing it as shown in the image.
>>
>> Thanks
>> Mark
>>
>>
>> Sent from my iPhone
>>
>> On Dec 13, 2023, at 8:19 AM, edi mari  wrote:
>>
>> 
>>
>> Hello ,
>> I'm using NIFI v1.20.0 to replicate 250 million messages between Kafka
>> topics.
>> The problem is that NIFI replicates messages in a non-sequential order,
>> resulting in the destination topic storing messages differently than the
>> source topic.
>>
>> for example
>> *source topic - partition 0*
>> offset:5 key:a value:v1
>> offset:6 key:a value:v2
>> offset:7 key:a value:v3
>>
>> *destination topic - partition 0*
>> offset:5 key:a value:v2
>> offset:6 key:a value:v1
>> offset:7 key:a value:v3
>>
>> The topics are configured with a cleanup policy: compact.
>>
>> I'm using ConsumeKafka and PublishKafka processors to replicate topics.
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> Thanks
>> Edi
>>
>>
>


Re: ConsumeKafka to PublishKafka doesn't keep the order of the messages in the destination topic

2023-12-15 Thread Willem Pretorius

Hello Edi


NiFi is amazing for many use cases but this feels like swimming upstream 
so to speak.



Have you considered Kafka Connect?

Define a Sink Connector to dump source Kafka topic to disk/s3 and then 
define a Source Connector to read it back to a destination topic.
Have not played with the S3 connector yet but it does seem to support 
for Kafka partitions.




On 2023/12/15 11:26, edi mari wrote:

Hi Mark,
I tried the combination of FIFO and setting the back pressure to 10k, 
but it didn't preserve the order.


Thanks
Edi

On Wed, Dec 13, 2023 at 3:47 PM Mark Payne  wrote:

Hey Edi,

By default, nifi doesn’t preserve ordering but you can have it do
so by updating the connection’s configuration and adding the First
In First Out Prioritizer.

Also of note you will want to keep the backpressure threshold set
to 10,000 objects rather than increasing it as shown in the image.

Thanks
Mark


Sent from my iPhone


On Dec 13, 2023, at 8:19 AM, edi mari  wrote:



Hello ,
I'm using NIFI v1.20.0 to replicate 250 million messages between
Kafka topics.
The problem is that NIFI replicates messages in a non-sequential
order, resulting in the destination topic storing messages
differently than the source topic.

for example
*source topic - partition 0*
offset:5 key:a value:v1
offset:6 key:a value:v2
offset:7 key:a value:v3

*destination topic - partition 0*
offset:5 key:a value:v2
offset:6 key:a value:v1
offset:7 key:a value:v3

The topics are configured with a cleanup policy: compact.

I'm using ConsumeKafka and PublishKafka processors to replicate
topics.











Thanks
Edi


Re: ConsumeKafka to PublishKafka doesn't keep the order of the messages in the destination topic

2023-12-15 Thread Mark Payne
Edi,

Looking at your config again, you’ll want to also ensure that on the Publisher, 
you set the partition to `${kafka.partition}` so that it goes to the same 
partition on the destination system. You’ll also went to ensure that you set 
“Failure Strategy” to “Rollback” - otherwise any failure would route to 
‘failure’ relationship and change the ordering. You’ll also need to limit the 
concurrent tasks on the publisher to 1 concurrent task, to ensure that you’re 
not sending multiple FlowFiles out of order.

Thanks
-Mark


On Dec 15, 2023, at 2:26 AM, edi mari  wrote:

Hi Mark,
I tried the combination of FIFO and setting the back pressure to 10k, but it 
didn't preserve the order.

Thanks
Edi

On Wed, Dec 13, 2023 at 3:47 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Hey Edi,

By default, nifi doesn’t preserve ordering but you can have it do so by 
updating the connection’s configuration and adding the First In First Out 
Prioritizer.

Also of note you will want to keep the backpressure threshold set to 10,000 
objects rather than increasing it as shown in the image.

Thanks
Mark


Sent from my iPhone

On Dec 13, 2023, at 8:19 AM, edi mari 
mailto:edim2...@gmail.com>> wrote:



Hello ,
I'm using NIFI v1.20.0 to replicate 250 million messages between Kafka topics.
The problem is that NIFI replicates messages in a non-sequential order, 
resulting in the destination topic storing messages differently than the source 
topic.

for example
source topic - partition 0
offset:5 key:a value:v1
offset:6 key:a value:v2
offset:7 key:a value:v3

destination topic - partition 0
offset:5 key:a value:v2
offset:6 key:a value:v1
offset:7 key:a value:v3

The topics are configured with a cleanup policy: compact.

I'm using ConsumeKafka and PublishKafka processors to replicate topics.











Thanks
Edi



Re: ConsumeKafka to PublishKafka doesn't keep the order of the messages in the destination topic

2023-12-14 Thread edi mari
Hi Mark,
I tried the combination of FIFO and setting the back pressure to 10k, but
it didn't preserve the order.

Thanks
Edi

On Wed, Dec 13, 2023 at 3:47 PM Mark Payne  wrote:

> Hey Edi,
>
> By default, nifi doesn’t preserve ordering but you can have it do so by
> updating the connection’s configuration and adding the First In First Out
> Prioritizer.
>
> Also of note you will want to keep the backpressure threshold set to
> 10,000 objects rather than increasing it as shown in the image.
>
> Thanks
> Mark
>
>
> Sent from my iPhone
>
> On Dec 13, 2023, at 8:19 AM, edi mari  wrote:
>
> 
>
> Hello ,
> I'm using NIFI v1.20.0 to replicate 250 million messages between Kafka
> topics.
> The problem is that NIFI replicates messages in a non-sequential order,
> resulting in the destination topic storing messages differently than the
> source topic.
>
> for example
> *source topic - partition 0*
> offset:5 key:a value:v1
> offset:6 key:a value:v2
> offset:7 key:a value:v3
>
> *destination topic - partition 0*
> offset:5 key:a value:v2
> offset:6 key:a value:v1
> offset:7 key:a value:v3
>
> The topics are configured with a cleanup policy: compact.
>
> I'm using ConsumeKafka and PublishKafka processors to replicate topics.
>
> 
>
> 
>
> 
>
> 
>
> 
>
> Thanks
> Edi
>
>


Re: ConsumeKafka to PublishKafka doesn't keep the order of the messages in the destination topic

2023-12-13 Thread edi mari
 In my case, all the NIFI nodes functioned well during the replication so I
guess this is not my case, but it's a good point to consider.

Thanks Phillip

On Wed, Dec 13, 2023 at 4:20 PM Phillip Lord  wrote:

> Perhaps try following this guidance in the docs???
>
> Consumer Partition Assignment
>
> By default, this processor will subscribe to one or more Kafka topics in
> such a way that the topics to consume from are randomly assigned to the
> nodes in the NiFi cluster. Consider a scenario where a single Kafka topic
> has 8 partitions and the consuming NiFi cluster has 3 nodes. In this
> scenario, Node 1 may be assigned partitions 0, 1, and 2. Node 2 may be
> assigned partitions 3, 4, and 5. Node 3 will then be assigned partitions 6
> and 7.
>
> In this scenario, if Node 3 somehow fails or stops pulling data from
> Kafka, partitions 6 and 7 may then be reassigned to the other two nodes.
> For most use cases, this is desirable. It provides fault tolerance and
> allows the remaining nodes to pick up the slack. However, there are cases
> where this is undesirable.
>
> One such case is when using NiFi to consume Change Data Capture (CDC) data
> from Kafka. Consider again the above scenario. Consider that Node 3 has
> pulled 1,000 messages from Kafka but has not yet delivered them to their
> final destination. NiFi is then stopped and restarted, and that takes 15
> minutes to complete. In the meantime, Partitions 6 and 7 have been
> reassigned to the other nodes. Those nodes then proceeded to pull data from
> Kafka and deliver it to the desired destination. After 15 minutes, Node 3
> rejoins the cluster and then continues to deliver its 1,000 messages that
> it has already pulled from Kafka to the destination system. Now, those
> records have been delivered out of order.
>
> The solution for this, then, is to assign partitions statically instead of
> dynamically. In this way, we can assign Partitions 6 and 7 to Node 3
> specifically. Then, if Node 3 is restarted, the other nodes will not pull
> data from Partitions 6 and 7. The data will remain queued in Kafka until
> Node 3 is restarted. By using this approach, we can ensure that the data
> that already was pulled can be processed (assuming First In First Out
> Prioritizers are used) before newer messages are handled.
>
> In order to provide a static mapping of node to Kafka partition(s), one or
> more user-defined properties must be added using the naming scheme
> partitions. with the value being a comma-separated list of
> Kafka partitions to use. For example, partitions.nifi-01=0, 3, 6, 9, 
> partitions.nifi-02=1,
> 4, 7, 10, and partitions.nifi-03=2, 5, 8, 11. The hostname that is used
> can be the fully qualified hostname, the "simple" hostname, or the IP
> address. There must be an entry for each node in the cluster, or the
> Processor will become invalid. If it is desirable for a node to not have
> any partitions assigned to it, a Property may be added for the hostname
> with an empty string as the value.
>
> NiFi cannot readily validate that all Partitions have been assigned before
> the Processor is scheduled to run. However, it can validate that no
> partitions have been skipped. As such, if partitions 0, 1, and 3 are
> assigned but not partition 2, the Processor will not be valid. However, if
> partitions 0, 1, and 2 are assigned, the Processor will become valid, even
> if there are 4 partitions on the Topic. When the Processor is started, the
> Processor will immediately start to fail, logging errors, and avoid pulling
> any data until the Processor is updated to account for all partitions. Once
> running, if the number of partitions is changed, the Processor will
> continue to run but not pull data from the newly added partitions. Once
> stopped, it will begin to error until all partitions have been assigned.
> Additionally, if partitions that are assigned do not exist (e.g.,
> partitions 0, 1, 2, 3, 4, 5, 6, and 7 are assigned, but the Topic has only
> 4 partitions), then the Processor will begin to log errors on startup and
> will not pull data.
>
> In order to use a static mapping of Kafka partitions, the "Topic Name
> Format" must be set to "names" rather than "pattern." Additionally, all
> Topics that are to be consumed must have the same number of partitions. If
> multiple Topics are to be consumed and have a different number of
> partitions, multiple Processors must be used so that each Processor
> consumes only from Topics with the same number of partitions.
>
> On Dec 13, 2023 at 8:59 AM -0500, edi mari , wrote:
>
> Hi Pierre,
> Yes, We tried to use FIFO prioritize in the queue, but it didn't help.
> Some records in the target topic are ordered differently from the source
> topic(which is critical in cleanup policy: compact) .
>
> Edi
>
> On Wed, Dec 13, 2023 at 3:46 PM Pierre Villard <
> pierre.villard...@gmail.com> wrote:
>
>> Hi Edi,
>>
>> Did you try setting the FIFO prioritizer on the connection between the
>> processors?
>>
>> 

Re: ConsumeKafka to PublishKafka doesn't keep the order of the messages in the destination topic

2023-12-13 Thread Phillip Lord
 Perhaps try following this guidance in the docs???

Consumer Partition Assignment

By default, this processor will subscribe to one or more Kafka topics in
such a way that the topics to consume from are randomly assigned to the
nodes in the NiFi cluster. Consider a scenario where a single Kafka topic
has 8 partitions and the consuming NiFi cluster has 3 nodes. In this
scenario, Node 1 may be assigned partitions 0, 1, and 2. Node 2 may be
assigned partitions 3, 4, and 5. Node 3 will then be assigned partitions 6
and 7.

In this scenario, if Node 3 somehow fails or stops pulling data from Kafka,
partitions 6 and 7 may then be reassigned to the other two nodes. For most
use cases, this is desirable. It provides fault tolerance and allows the
remaining nodes to pick up the slack. However, there are cases where this
is undesirable.

One such case is when using NiFi to consume Change Data Capture (CDC) data
from Kafka. Consider again the above scenario. Consider that Node 3 has
pulled 1,000 messages from Kafka but has not yet delivered them to their
final destination. NiFi is then stopped and restarted, and that takes 15
minutes to complete. In the meantime, Partitions 6 and 7 have been
reassigned to the other nodes. Those nodes then proceeded to pull data from
Kafka and deliver it to the desired destination. After 15 minutes, Node 3
rejoins the cluster and then continues to deliver its 1,000 messages that
it has already pulled from Kafka to the destination system. Now, those
records have been delivered out of order.

The solution for this, then, is to assign partitions statically instead of
dynamically. In this way, we can assign Partitions 6 and 7 to Node 3
specifically. Then, if Node 3 is restarted, the other nodes will not pull
data from Partitions 6 and 7. The data will remain queued in Kafka until
Node 3 is restarted. By using this approach, we can ensure that the data
that already was pulled can be processed (assuming First In First Out
Prioritizers are used) before newer messages are handled.

In order to provide a static mapping of node to Kafka partition(s), one or
more user-defined properties must be added using the naming scheme
partitions. with the value being a comma-separated list of Kafka
partitions to use. For example, partitions.nifi-01=0, 3, 6, 9,
partitions.nifi-02=1,
4, 7, 10, and partitions.nifi-03=2, 5, 8, 11. The hostname that is used can
be the fully qualified hostname, the "simple" hostname, or the IP address.
There must be an entry for each node in the cluster, or the Processor will
become invalid. If it is desirable for a node to not have any partitions
assigned to it, a Property may be added for the hostname with an empty
string as the value.

NiFi cannot readily validate that all Partitions have been assigned before
the Processor is scheduled to run. However, it can validate that no
partitions have been skipped. As such, if partitions 0, 1, and 3 are
assigned but not partition 2, the Processor will not be valid. However, if
partitions 0, 1, and 2 are assigned, the Processor will become valid, even
if there are 4 partitions on the Topic. When the Processor is started, the
Processor will immediately start to fail, logging errors, and avoid pulling
any data until the Processor is updated to account for all partitions. Once
running, if the number of partitions is changed, the Processor will
continue to run but not pull data from the newly added partitions. Once
stopped, it will begin to error until all partitions have been assigned.
Additionally, if partitions that are assigned do not exist (e.g.,
partitions 0, 1, 2, 3, 4, 5, 6, and 7 are assigned, but the Topic has only
4 partitions), then the Processor will begin to log errors on startup and
will not pull data.

In order to use a static mapping of Kafka partitions, the "Topic Name
Format" must be set to "names" rather than "pattern." Additionally, all
Topics that are to be consumed must have the same number of partitions. If
multiple Topics are to be consumed and have a different number of
partitions, multiple Processors must be used so that each Processor
consumes only from Topics with the same number of partitions.

On Dec 13, 2023 at 8:59 AM -0500, edi mari , wrote:

Hi Pierre,
Yes, We tried to use FIFO prioritize in the queue, but it didn't help.
Some records in the target topic are ordered differently from the source
topic(which is critical in cleanup policy: compact) .

Edi

On Wed, Dec 13, 2023 at 3:46 PM Pierre Villard 
wrote:

> Hi Edi,
>
> Did you try setting the FIFO prioritizer on the connection between the
> processors?
>
> Thanks,
> Pierre
>
> Le mer. 13 déc. 2023 à 14:19, edi mari  a écrit :
>
>>
>> Hello ,
>> I'm using NIFI v1.20.0 to replicate 250 million messages between Kafka
>> topics.
>> The problem is that NIFI replicates messages in a non-sequential order,
>> resulting in the destination topic storing messages differently than the
>> source topic.
>>
>> for example
>> *source topic - partition 0*
>> offset:5