Re: ConsumeKafka to PublishKafka doesn't keep the order of the messages in the destination topic

2023-12-15 Thread Mark Payne
Edi,

Looking at your config again, you’ll want to also ensure that on the Publisher, 
you set the partition to `${kafka.partition}` so that it goes to the same 
partition on the destination system. You’ll also went to ensure that you set 
“Failure Strategy” to “Rollback” - otherwise any failure would route to 
‘failure’ relationship and change the ordering. You’ll also need to limit the 
concurrent tasks on the publisher to 1 concurrent task, to ensure that you’re 
not sending multiple FlowFiles out of order.

Thanks
-Mark


On Dec 15, 2023, at 2:26 AM, edi mari  wrote:

Hi Mark,
I tried the combination of FIFO and setting the back pressure to 10k, but it 
didn't preserve the order.

Thanks
Edi

On Wed, Dec 13, 2023 at 3:47 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Hey Edi,

By default, nifi doesn’t preserve ordering but you can have it do so by 
updating the connection’s configuration and adding the First In First Out 
Prioritizer.

Also of note you will want to keep the backpressure threshold set to 10,000 
objects rather than increasing it as shown in the image.

Thanks
Mark


Sent from my iPhone

On Dec 13, 2023, at 8:19 AM, edi mari 
mailto:edim2...@gmail.com>> wrote:



Hello ,
I'm using NIFI v1.20.0 to replicate 250 million messages between Kafka topics.
The problem is that NIFI replicates messages in a non-sequential order, 
resulting in the destination topic storing messages differently than the source 
topic.

for example
source topic - partition 0
offset:5 key:a value:v1
offset:6 key:a value:v2
offset:7 key:a value:v3

destination topic - partition 0
offset:5 key:a value:v2
offset:6 key:a value:v1
offset:7 key:a value:v3

The topics are configured with a cleanup policy: compact.

I'm using ConsumeKafka and PublishKafka processors to replicate topics.











Thanks
Edi



Re: ConsumeKafka to PublishKafka doesn't keep the order of the messages in the destination topic

2023-12-15 Thread Willem Pretorius

Hello Edi


NiFi is amazing for many use cases but this feels like swimming upstream 
so to speak.



Have you considered Kafka Connect?

Define a Sink Connector to dump source Kafka topic to disk/s3 and then 
define a Source Connector to read it back to a destination topic.
Have not played with the S3 connector yet but it does seem to support 
for Kafka partitions.




On 2023/12/15 11:26, edi mari wrote:

Hi Mark,
I tried the combination of FIFO and setting the back pressure to 10k, 
but it didn't preserve the order.


Thanks
Edi

On Wed, Dec 13, 2023 at 3:47 PM Mark Payne  wrote:

Hey Edi,

By default, nifi doesn’t preserve ordering but you can have it do
so by updating the connection’s configuration and adding the First
In First Out Prioritizer.

Also of note you will want to keep the backpressure threshold set
to 10,000 objects rather than increasing it as shown in the image.

Thanks
Mark


Sent from my iPhone


On Dec 13, 2023, at 8:19 AM, edi mari  wrote:



Hello ,
I'm using NIFI v1.20.0 to replicate 250 million messages between
Kafka topics.
The problem is that NIFI replicates messages in a non-sequential
order, resulting in the destination topic storing messages
differently than the source topic.

for example
*source topic - partition 0*
offset:5 key:a value:v1
offset:6 key:a value:v2
offset:7 key:a value:v3

*destination topic - partition 0*
offset:5 key:a value:v2
offset:6 key:a value:v1
offset:7 key:a value:v3

The topics are configured with a cleanup policy: compact.

I'm using ConsumeKafka and PublishKafka processors to replicate
topics.











Thanks
Edi


Re: Nifi - Content-repo on AWS-EBS volumes

2023-12-15 Thread Gregory M. Foreman
Mark:

I was just discussing multiple content repos on EBS volumes with a colleague.  
I found your post from a long time ago:

https://lists.apache.org/thread/nq3mpry0wppzrodmldrcfnxwzp3n1cjv

“Re #2: I don't know that i've used any SAN to back my repositories other than 
the EBS provided by Amazon EC2. In that environment, I found that having one or 
having multiple repos was essentially equivalent.”

Does that statement still hold true today?  Essentially there is no real 
performance benefit to having multiple content repos on multiple EBS volumes?

Thanks,
Greg



> On Dec 11, 2023, at 8:50 PM, Mark Payne  wrote:
> 
> Hey Phil,
> 
> NiFi will not spread the content of a single file over multiple partitions. 
> It will write the content of FlowFile 1 to content repo 1, then write the 
> next FlowFile to repo 2, etc. so it does round-robin but does not spread a 
> single FlowFile across multiple repos.
> 
> Thanks
> -Mark
> 
> Sent from my iPhone
> 
>> On Dec 11, 2023, at 8:45 PM, Phillip Lord  wrote:
>> 
>> 
>> Hello Nifi comrades,
>> 
>> Here's my scenario...
>> Let's say I have a Nifi cluster running on EC2 instances with attached EBS 
>> volumes serving as their repos.  They've split up their content-repos into 
>> three content-repos per node(cont1, cont2, cont3).  Each being a dedicated 
>> EBS volume.  My understanding is that the content-claims for a single file 
>> can potentially span across more than one of these repos.(correct me if I've 
>> lost my mind over the years)
>> For instance if you have a 1 MB file, and lets say your 
>> max.content.claim.size is 100KB, that's 10 - 100KB claims(ish) potentially 
>> split up across the 3 EBS volumes.  So if Nifi is trying to move that file 
>> to S3 or something for instance... it needs to be read from each of the 
>> volumes.  
>> Whereas if it was a single EBS volume for the cont-repo... it would read 
>> from the single volume, which I would think would be more performant?  Or 
>> does spreading out any IO contention across volumes provide more of a 
>> benefit?
>> I know there's different levels of EBS volumes... but not factoring that in 
>> for right now.
>> 
>> Appreciate any insight... trying to determine the best configuration.  
>> 
>> Thanks,
>> Phil
>> 
>> 



Re: Nifi - Content-repo on AWS-EBS volumes

2023-12-15 Thread Mark Payne
Greg,

Whether or not multiple content repos will have any impact depends very much on 
where your system’s bottleneck is. If your bottleneck is disk I/O, it will 
absolutely help. If your bottleneck is CPU, it won’t. If, for example, you’re 
running on bare metal and have 48 cores on your machine and you’re running with 
spinning disks, you’ll definitely want to use multiple spinning disks. But if 
you’re running in AWS on a VM that has 4 cores and you’re using gp3 EBS 
volumes, it’s unlikely that multiple content repos will help.

Thanks
-Mark



> On Dec 15, 2023, at 3:25 PM, Gregory M. Foreman 
>  wrote:
> 
> Mark:
> 
> I was just discussing multiple content repos on EBS volumes with a colleague. 
>  I found your post from a long time ago:
> 
> https://lists.apache.org/thread/nq3mpry0wppzrodmldrcfnxwzp3n1cjv
> 
> “Re #2: I don't know that i've used any SAN to back my repositories other 
> than the EBS provided by Amazon EC2. In that environment, I found that having 
> one or having multiple repos was essentially equivalent.”
> 
> Does that statement still hold true today?  Essentially there is no real 
> performance benefit to having multiple content repos on multiple EBS volumes?
> 
> Thanks,
> Greg
> 
> 
> 
>> On Dec 11, 2023, at 8:50 PM, Mark Payne  wrote:
>> 
>> Hey Phil,
>> 
>> NiFi will not spread the content of a single file over multiple partitions. 
>> It will write the content of FlowFile 1 to content repo 1, then write the 
>> next FlowFile to repo 2, etc. so it does round-robin but does not spread a 
>> single FlowFile across multiple repos.
>> 
>> Thanks
>> -Mark
>> 
>> Sent from my iPhone
>> 
>>> On Dec 11, 2023, at 8:45 PM, Phillip Lord  wrote:
>>> 
>>> 
>>> Hello Nifi comrades,
>>> 
>>> Here's my scenario...
>>> Let's say I have a Nifi cluster running on EC2 instances with attached EBS 
>>> volumes serving as their repos.  They've split up their content-repos into 
>>> three content-repos per node(cont1, cont2, cont3).  Each being a dedicated 
>>> EBS volume.  My understanding is that the content-claims for a single file 
>>> can potentially span across more than one of these repos.(correct me if 
>>> I've lost my mind over the years)
>>> For instance if you have a 1 MB file, and lets say your 
>>> max.content.claim.size is 100KB, that's 10 - 100KB claims(ish) potentially 
>>> split up across the 3 EBS volumes.  So if Nifi is trying to move that file 
>>> to S3 or something for instance... it needs to be read from each of the 
>>> volumes.  
>>> Whereas if it was a single EBS volume for the cont-repo... it would read 
>>> from the single volume, which I would think would be more performant?  Or 
>>> does spreading out any IO contention across volumes provide more of a 
>>> benefit?
>>> I know there's different levels of EBS volumes... but not factoring that in 
>>> for right now.
>>> 
>>> Appreciate any insight... trying to determine the best configuration.  
>>> 
>>> Thanks,
>>> Phil
>>> 
>>> 
> 



Re: Nifi - Content-repo on AWS-EBS volumes

2023-12-15 Thread Gregory M. Foreman
Mark:

Got it.  Thank you for the help.

Greg

> On Dec 15, 2023, at 4:14 PM, Mark Payne  wrote:
> 
> Greg,
> 
> Whether or not multiple content repos will have any impact depends very much 
> on where your system’s bottleneck is. If your bottleneck is disk I/O, it will 
> absolutely help. If your bottleneck is CPU, it won’t. If, for example, you’re 
> running on bare metal and have 48 cores on your machine and you’re running 
> with spinning disks, you’ll definitely want to use multiple spinning disks. 
> But if you’re running in AWS on a VM that has 4 cores and you’re using gp3 
> EBS volumes, it’s unlikely that multiple content repos will help.
> 
> Thanks
> -Mark
> 
> 
> 
>> On Dec 15, 2023, at 3:25 PM, Gregory M. Foreman 
>>  wrote:
>> 
>> Mark:
>> 
>> I was just discussing multiple content repos on EBS volumes with a 
>> colleague.  I found your post from a long time ago:
>> 
>> https://lists.apache.org/thread/nq3mpry0wppzrodmldrcfnxwzp3n1cjv
>> 
>> “Re #2: I don't know that i've used any SAN to back my repositories other 
>> than the EBS provided by Amazon EC2. In that environment, I found that 
>> having one or having multiple repos was essentially equivalent.”
>> 
>> Does that statement still hold true today?  Essentially there is no real 
>> performance benefit to having multiple content repos on multiple EBS volumes?
>> 
>> Thanks,
>> Greg
>> 
>> 
>> 
>>> On Dec 11, 2023, at 8:50 PM, Mark Payne  wrote:
>>> 
>>> Hey Phil,
>>> 
>>> NiFi will not spread the content of a single file over multiple partitions. 
>>> It will write the content of FlowFile 1 to content repo 1, then write the 
>>> next FlowFile to repo 2, etc. so it does round-robin but does not spread a 
>>> single FlowFile across multiple repos.
>>> 
>>> Thanks
>>> -Mark
>>> 
>>> Sent from my iPhone
>>> 
 On Dec 11, 2023, at 8:45 PM, Phillip Lord  wrote:
 
 
 Hello Nifi comrades,
 
 Here's my scenario...
 Let's say I have a Nifi cluster running on EC2 instances with attached EBS 
 volumes serving as their repos.  They've split up their content-repos into 
 three content-repos per node(cont1, cont2, cont3).  Each being a dedicated 
 EBS volume.  My understanding is that the content-claims for a single file 
 can potentially span across more than one of these repos.(correct me if 
 I've lost my mind over the years)
 For instance if you have a 1 MB file, and lets say your 
 max.content.claim.size is 100KB, that's 10 - 100KB claims(ish) potentially 
 split up across the 3 EBS volumes.  So if Nifi is trying to move that file 
 to S3 or something for instance... it needs to be read from each of the 
 volumes.  
 Whereas if it was a single EBS volume for the cont-repo... it would read 
 from the single volume, which I would think would be more performant?  Or 
 does spreading out any IO contention across volumes provide more of a 
 benefit?
 I know there's different levels of EBS volumes... but not factoring that 
 in for right now.
 
 Appreciate any insight... trying to determine the best configuration.  
 
 Thanks,
 Phil
 
 
>> 
> 



Re: Nifi - Content-repo on AWS-EBS volumes

2023-12-15 Thread Phillip Lord
I just switched a cluster using 3 EBS volumes for cont-repo from gp2 to gp3… 
resolved definite I/O throughput issues.  The change to gp3 was significant 
enough that I might actually reduce from 3 to 2 volumes, perhaps even a single 
volume would be sufficient.

Of course every use case is unique.
On Dec 15, 2023 at 5:37 PM -0500, Gregory M. Foreman 
, wrote:
> Mark:
>
> Got it. Thank you for the help.
>
> Greg
>
> > On Dec 15, 2023, at 4:14 PM, Mark Payne  wrote:
> >
> > Greg,
> >
> > Whether or not multiple content repos will have any impact depends very 
> > much on where your system’s bottleneck is. If your bottleneck is disk I/O, 
> > it will absolutely help. If your bottleneck is CPU, it won’t. If, for 
> > example, you’re running on bare metal and have 48 cores on your machine and 
> > you’re running with spinning disks, you’ll definitely want to use multiple 
> > spinning disks. But if you’re running in AWS on a VM that has 4 cores and 
> > you’re using gp3 EBS volumes, it’s unlikely that multiple content repos 
> > will help.
> >
> > Thanks
> > -Mark
> >
> >
> >
> > > On Dec 15, 2023, at 3:25 PM, Gregory M. Foreman 
> > >  wrote:
> > >
> > > Mark:
> > >
> > > I was just discussing multiple content repos on EBS volumes with a 
> > > colleague. I found your post from a long time ago:
> > >
> > > https://lists.apache.org/thread/nq3mpry0wppzrodmldrcfnxwzp3n1cjv
> > >
> > > “Re #2: I don't know that i've used any SAN to back my repositories other 
> > > than the EBS provided by Amazon EC2. In that environment, I found that 
> > > having one or having multiple repos was essentially equivalent.”
> > >
> > > Does that statement still hold true today? Essentially there is no real 
> > > performance benefit to having multiple content repos on multiple EBS 
> > > volumes?
> > >
> > > Thanks,
> > > Greg
> > >
> > >
> > >
> > > > On Dec 11, 2023, at 8:50 PM, Mark Payne  wrote:
> > > >
> > > > Hey Phil,
> > > >
> > > > NiFi will not spread the content of a single file over multiple 
> > > > partitions. It will write the content of FlowFile 1 to content repo 1, 
> > > > then write the next FlowFile to repo 2, etc. so it does round-robin but 
> > > > does not spread a single FlowFile across multiple repos.
> > > >
> > > > Thanks
> > > > -Mark
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Dec 11, 2023, at 8:45 PM, Phillip Lord  
> > > > > wrote:
> > > > >
> > > > >
> > > > > Hello Nifi comrades,
> > > > >
> > > > > Here's my scenario...
> > > > > Let's say I have a Nifi cluster running on EC2 instances with 
> > > > > attached EBS volumes serving as their repos. They've split up their 
> > > > > content-repos into three content-repos per node(cont1, cont2, cont3). 
> > > > > Each being a dedicated EBS volume. My understanding is that the 
> > > > > content-claims for a single file can potentially span across more 
> > > > > than one of these repos.(correct me if I've lost my mind over the 
> > > > > years)
> > > > > For instance if you have a 1 MB file, and lets say your 
> > > > > max.content.claim.size is 100KB, that's 10 - 100KB claims(ish) 
> > > > > potentially split up across the 3 EBS volumes. So if Nifi is trying 
> > > > > to move that file to S3 or something for instance... it needs to be 
> > > > > read from each of the volumes.
> > > > > Whereas if it was a single EBS volume for the cont-repo... it would 
> > > > > read from the single volume, which I would think would be more 
> > > > > performant? Or does spreading out any IO contention across volumes 
> > > > > provide more of a benefit?
> > > > > I know there's different levels of EBS volumes... but not factoring 
> > > > > that in for right now.
> > > > >
> > > > > Appreciate any insight... trying to determine the best configuration.
> > > > >
> > > > > Thanks,
> > > > > Phil
> > > > >
> > > > >
> > >
> >
>