Re: ConsumeKafka to PublishKafka doesn't keep the order of the messages in the destination topic
Edi, Looking at your config again, you’ll want to also ensure that on the Publisher, you set the partition to `${kafka.partition}` so that it goes to the same partition on the destination system. You’ll also went to ensure that you set “Failure Strategy” to “Rollback” - otherwise any failure would route to ‘failure’ relationship and change the ordering. You’ll also need to limit the concurrent tasks on the publisher to 1 concurrent task, to ensure that you’re not sending multiple FlowFiles out of order. Thanks -Mark On Dec 15, 2023, at 2:26 AM, edi mari wrote: Hi Mark, I tried the combination of FIFO and setting the back pressure to 10k, but it didn't preserve the order. Thanks Edi On Wed, Dec 13, 2023 at 3:47 PM Mark Payne mailto:marka...@hotmail.com>> wrote: Hey Edi, By default, nifi doesn’t preserve ordering but you can have it do so by updating the connection’s configuration and adding the First In First Out Prioritizer. Also of note you will want to keep the backpressure threshold set to 10,000 objects rather than increasing it as shown in the image. Thanks Mark Sent from my iPhone On Dec 13, 2023, at 8:19 AM, edi mari mailto:edim2...@gmail.com>> wrote: Hello , I'm using NIFI v1.20.0 to replicate 250 million messages between Kafka topics. The problem is that NIFI replicates messages in a non-sequential order, resulting in the destination topic storing messages differently than the source topic. for example source topic - partition 0 offset:5 key:a value:v1 offset:6 key:a value:v2 offset:7 key:a value:v3 destination topic - partition 0 offset:5 key:a value:v2 offset:6 key:a value:v1 offset:7 key:a value:v3 The topics are configured with a cleanup policy: compact. I'm using ConsumeKafka and PublishKafka processors to replicate topics. Thanks Edi
Re: ConsumeKafka to PublishKafka doesn't keep the order of the messages in the destination topic
Hello Edi NiFi is amazing for many use cases but this feels like swimming upstream so to speak. Have you considered Kafka Connect? Define a Sink Connector to dump source Kafka topic to disk/s3 and then define a Source Connector to read it back to a destination topic. Have not played with the S3 connector yet but it does seem to support for Kafka partitions. On 2023/12/15 11:26, edi mari wrote: Hi Mark, I tried the combination of FIFO and setting the back pressure to 10k, but it didn't preserve the order. Thanks Edi On Wed, Dec 13, 2023 at 3:47 PM Mark Payne wrote: Hey Edi, By default, nifi doesn’t preserve ordering but you can have it do so by updating the connection’s configuration and adding the First In First Out Prioritizer. Also of note you will want to keep the backpressure threshold set to 10,000 objects rather than increasing it as shown in the image. Thanks Mark Sent from my iPhone On Dec 13, 2023, at 8:19 AM, edi mari wrote: Hello , I'm using NIFI v1.20.0 to replicate 250 million messages between Kafka topics. The problem is that NIFI replicates messages in a non-sequential order, resulting in the destination topic storing messages differently than the source topic. for example *source topic - partition 0* offset:5 key:a value:v1 offset:6 key:a value:v2 offset:7 key:a value:v3 *destination topic - partition 0* offset:5 key:a value:v2 offset:6 key:a value:v1 offset:7 key:a value:v3 The topics are configured with a cleanup policy: compact. I'm using ConsumeKafka and PublishKafka processors to replicate topics. Thanks Edi
Re: Nifi - Content-repo on AWS-EBS volumes
Mark: I was just discussing multiple content repos on EBS volumes with a colleague. I found your post from a long time ago: https://lists.apache.org/thread/nq3mpry0wppzrodmldrcfnxwzp3n1cjv “Re #2: I don't know that i've used any SAN to back my repositories other than the EBS provided by Amazon EC2. In that environment, I found that having one or having multiple repos was essentially equivalent.” Does that statement still hold true today? Essentially there is no real performance benefit to having multiple content repos on multiple EBS volumes? Thanks, Greg > On Dec 11, 2023, at 8:50 PM, Mark Payne wrote: > > Hey Phil, > > NiFi will not spread the content of a single file over multiple partitions. > It will write the content of FlowFile 1 to content repo 1, then write the > next FlowFile to repo 2, etc. so it does round-robin but does not spread a > single FlowFile across multiple repos. > > Thanks > -Mark > > Sent from my iPhone > >> On Dec 11, 2023, at 8:45 PM, Phillip Lord wrote: >> >> >> Hello Nifi comrades, >> >> Here's my scenario... >> Let's say I have a Nifi cluster running on EC2 instances with attached EBS >> volumes serving as their repos. They've split up their content-repos into >> three content-repos per node(cont1, cont2, cont3). Each being a dedicated >> EBS volume. My understanding is that the content-claims for a single file >> can potentially span across more than one of these repos.(correct me if I've >> lost my mind over the years) >> For instance if you have a 1 MB file, and lets say your >> max.content.claim.size is 100KB, that's 10 - 100KB claims(ish) potentially >> split up across the 3 EBS volumes. So if Nifi is trying to move that file >> to S3 or something for instance... it needs to be read from each of the >> volumes. >> Whereas if it was a single EBS volume for the cont-repo... it would read >> from the single volume, which I would think would be more performant? Or >> does spreading out any IO contention across volumes provide more of a >> benefit? >> I know there's different levels of EBS volumes... but not factoring that in >> for right now. >> >> Appreciate any insight... trying to determine the best configuration. >> >> Thanks, >> Phil >> >>
Re: Nifi - Content-repo on AWS-EBS volumes
Greg, Whether or not multiple content repos will have any impact depends very much on where your system’s bottleneck is. If your bottleneck is disk I/O, it will absolutely help. If your bottleneck is CPU, it won’t. If, for example, you’re running on bare metal and have 48 cores on your machine and you’re running with spinning disks, you’ll definitely want to use multiple spinning disks. But if you’re running in AWS on a VM that has 4 cores and you’re using gp3 EBS volumes, it’s unlikely that multiple content repos will help. Thanks -Mark > On Dec 15, 2023, at 3:25 PM, Gregory M. Foreman > wrote: > > Mark: > > I was just discussing multiple content repos on EBS volumes with a colleague. > I found your post from a long time ago: > > https://lists.apache.org/thread/nq3mpry0wppzrodmldrcfnxwzp3n1cjv > > “Re #2: I don't know that i've used any SAN to back my repositories other > than the EBS provided by Amazon EC2. In that environment, I found that having > one or having multiple repos was essentially equivalent.” > > Does that statement still hold true today? Essentially there is no real > performance benefit to having multiple content repos on multiple EBS volumes? > > Thanks, > Greg > > > >> On Dec 11, 2023, at 8:50 PM, Mark Payne wrote: >> >> Hey Phil, >> >> NiFi will not spread the content of a single file over multiple partitions. >> It will write the content of FlowFile 1 to content repo 1, then write the >> next FlowFile to repo 2, etc. so it does round-robin but does not spread a >> single FlowFile across multiple repos. >> >> Thanks >> -Mark >> >> Sent from my iPhone >> >>> On Dec 11, 2023, at 8:45 PM, Phillip Lord wrote: >>> >>> >>> Hello Nifi comrades, >>> >>> Here's my scenario... >>> Let's say I have a Nifi cluster running on EC2 instances with attached EBS >>> volumes serving as their repos. They've split up their content-repos into >>> three content-repos per node(cont1, cont2, cont3). Each being a dedicated >>> EBS volume. My understanding is that the content-claims for a single file >>> can potentially span across more than one of these repos.(correct me if >>> I've lost my mind over the years) >>> For instance if you have a 1 MB file, and lets say your >>> max.content.claim.size is 100KB, that's 10 - 100KB claims(ish) potentially >>> split up across the 3 EBS volumes. So if Nifi is trying to move that file >>> to S3 or something for instance... it needs to be read from each of the >>> volumes. >>> Whereas if it was a single EBS volume for the cont-repo... it would read >>> from the single volume, which I would think would be more performant? Or >>> does spreading out any IO contention across volumes provide more of a >>> benefit? >>> I know there's different levels of EBS volumes... but not factoring that in >>> for right now. >>> >>> Appreciate any insight... trying to determine the best configuration. >>> >>> Thanks, >>> Phil >>> >>> >
Re: Nifi - Content-repo on AWS-EBS volumes
Mark: Got it. Thank you for the help. Greg > On Dec 15, 2023, at 4:14 PM, Mark Payne wrote: > > Greg, > > Whether or not multiple content repos will have any impact depends very much > on where your system’s bottleneck is. If your bottleneck is disk I/O, it will > absolutely help. If your bottleneck is CPU, it won’t. If, for example, you’re > running on bare metal and have 48 cores on your machine and you’re running > with spinning disks, you’ll definitely want to use multiple spinning disks. > But if you’re running in AWS on a VM that has 4 cores and you’re using gp3 > EBS volumes, it’s unlikely that multiple content repos will help. > > Thanks > -Mark > > > >> On Dec 15, 2023, at 3:25 PM, Gregory M. Foreman >> wrote: >> >> Mark: >> >> I was just discussing multiple content repos on EBS volumes with a >> colleague. I found your post from a long time ago: >> >> https://lists.apache.org/thread/nq3mpry0wppzrodmldrcfnxwzp3n1cjv >> >> “Re #2: I don't know that i've used any SAN to back my repositories other >> than the EBS provided by Amazon EC2. In that environment, I found that >> having one or having multiple repos was essentially equivalent.” >> >> Does that statement still hold true today? Essentially there is no real >> performance benefit to having multiple content repos on multiple EBS volumes? >> >> Thanks, >> Greg >> >> >> >>> On Dec 11, 2023, at 8:50 PM, Mark Payne wrote: >>> >>> Hey Phil, >>> >>> NiFi will not spread the content of a single file over multiple partitions. >>> It will write the content of FlowFile 1 to content repo 1, then write the >>> next FlowFile to repo 2, etc. so it does round-robin but does not spread a >>> single FlowFile across multiple repos. >>> >>> Thanks >>> -Mark >>> >>> Sent from my iPhone >>> On Dec 11, 2023, at 8:45 PM, Phillip Lord wrote: Hello Nifi comrades, Here's my scenario... Let's say I have a Nifi cluster running on EC2 instances with attached EBS volumes serving as their repos. They've split up their content-repos into three content-repos per node(cont1, cont2, cont3). Each being a dedicated EBS volume. My understanding is that the content-claims for a single file can potentially span across more than one of these repos.(correct me if I've lost my mind over the years) For instance if you have a 1 MB file, and lets say your max.content.claim.size is 100KB, that's 10 - 100KB claims(ish) potentially split up across the 3 EBS volumes. So if Nifi is trying to move that file to S3 or something for instance... it needs to be read from each of the volumes. Whereas if it was a single EBS volume for the cont-repo... it would read from the single volume, which I would think would be more performant? Or does spreading out any IO contention across volumes provide more of a benefit? I know there's different levels of EBS volumes... but not factoring that in for right now. Appreciate any insight... trying to determine the best configuration. Thanks, Phil >> >
Re: Nifi - Content-repo on AWS-EBS volumes
I just switched a cluster using 3 EBS volumes for cont-repo from gp2 to gp3… resolved definite I/O throughput issues. The change to gp3 was significant enough that I might actually reduce from 3 to 2 volumes, perhaps even a single volume would be sufficient. Of course every use case is unique. On Dec 15, 2023 at 5:37 PM -0500, Gregory M. Foreman , wrote: > Mark: > > Got it. Thank you for the help. > > Greg > > > On Dec 15, 2023, at 4:14 PM, Mark Payne wrote: > > > > Greg, > > > > Whether or not multiple content repos will have any impact depends very > > much on where your system’s bottleneck is. If your bottleneck is disk I/O, > > it will absolutely help. If your bottleneck is CPU, it won’t. If, for > > example, you’re running on bare metal and have 48 cores on your machine and > > you’re running with spinning disks, you’ll definitely want to use multiple > > spinning disks. But if you’re running in AWS on a VM that has 4 cores and > > you’re using gp3 EBS volumes, it’s unlikely that multiple content repos > > will help. > > > > Thanks > > -Mark > > > > > > > > > On Dec 15, 2023, at 3:25 PM, Gregory M. Foreman > > > wrote: > > > > > > Mark: > > > > > > I was just discussing multiple content repos on EBS volumes with a > > > colleague. I found your post from a long time ago: > > > > > > https://lists.apache.org/thread/nq3mpry0wppzrodmldrcfnxwzp3n1cjv > > > > > > “Re #2: I don't know that i've used any SAN to back my repositories other > > > than the EBS provided by Amazon EC2. In that environment, I found that > > > having one or having multiple repos was essentially equivalent.” > > > > > > Does that statement still hold true today? Essentially there is no real > > > performance benefit to having multiple content repos on multiple EBS > > > volumes? > > > > > > Thanks, > > > Greg > > > > > > > > > > > > > On Dec 11, 2023, at 8:50 PM, Mark Payne wrote: > > > > > > > > Hey Phil, > > > > > > > > NiFi will not spread the content of a single file over multiple > > > > partitions. It will write the content of FlowFile 1 to content repo 1, > > > > then write the next FlowFile to repo 2, etc. so it does round-robin but > > > > does not spread a single FlowFile across multiple repos. > > > > > > > > Thanks > > > > -Mark > > > > > > > > Sent from my iPhone > > > > > > > > > On Dec 11, 2023, at 8:45 PM, Phillip Lord > > > > > wrote: > > > > > > > > > > > > > > > Hello Nifi comrades, > > > > > > > > > > Here's my scenario... > > > > > Let's say I have a Nifi cluster running on EC2 instances with > > > > > attached EBS volumes serving as their repos. They've split up their > > > > > content-repos into three content-repos per node(cont1, cont2, cont3). > > > > > Each being a dedicated EBS volume. My understanding is that the > > > > > content-claims for a single file can potentially span across more > > > > > than one of these repos.(correct me if I've lost my mind over the > > > > > years) > > > > > For instance if you have a 1 MB file, and lets say your > > > > > max.content.claim.size is 100KB, that's 10 - 100KB claims(ish) > > > > > potentially split up across the 3 EBS volumes. So if Nifi is trying > > > > > to move that file to S3 or something for instance... it needs to be > > > > > read from each of the volumes. > > > > > Whereas if it was a single EBS volume for the cont-repo... it would > > > > > read from the single volume, which I would think would be more > > > > > performant? Or does spreading out any IO contention across volumes > > > > > provide more of a benefit? > > > > > I know there's different levels of EBS volumes... but not factoring > > > > > that in for right now. > > > > > > > > > > Appreciate any insight... trying to determine the best configuration. > > > > > > > > > > Thanks, > > > > > Phil > > > > > > > > > > > > > > > >