Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?
Lee, Thanks for your idea. I have one doubt regarding Execute Stream that needs CommandPath and ArgumentDelimiter. I have given this regex (.+)[|](.+)[|](.+)[|](.+) in Extract Text processor. How can i give this reg ex to execute Stream processor? or Is any other processor which having same functionality like ExtractText processor? Thanks On Tue, Oct 18, 2016 at 11:42 AM, Lee Laim wrote: > > Prabhu, > > You might also try to replace ExtractText with a series of > ExecuteStreamCommand processors that perform system calls (sed/awk/grep or > the Windows equivalents) on the flowfiles contents. You can even write the > result directly to a flowfile attribute. > > I suspect there are wildcards in your ExtractText regex that are taking a > while to buffer and compare. > > Lee > > On Oct 18, 2016, at 2:31 PM, prabhu Mahendran > wrote: > > Mark, > > Thanks for your response. > > Please find the response for your questions. > > ==>The first processor that you see that exhibits poor performance is > ExtractText, correct? > Yes,Extract Text exhibits poor performance. > > ==>How big is your Java heap? > I have set 1 GB for java heap. > > ==>Do you have back pressure configured on the connection between > ExtractText and ReplaceText? >There is no back pressure between extract and > replace text. > > ==>when you say that you specify concurrent tasks, what are you > configuring the concurrent tasks > to be? > I have specify concurrent tasks to be 2 for the > extract text processor due to slower processing rate.Which > is specified in Concurrent Task Text box. > > ==>Have you changed the maximum number of concurrent tasks available to > your dataflow? > No i haven't changed. > > ==>How many CPU's are available on this machine? > Only single cpu are available in this machine with > core i5 processor CPU @2.20Ghz. > > ==> Are these the only processors in your flow, or do you have other > dataflows going on in the > same instance as NiFi? >Yes this is the only processor in work flow which > is running and no other instances are running. > > Thanks > > On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne wrote: > >> Prabhu, >> >> Certainly, the performance that you are seeing, taking 4-5 hours to move >> 3M rows into SQLServer is far from >> ideal, but the good news is that it is also far from typical. You should >> be able to see far better results. >> >> To help us understand what is limiting the performance, and to make sure >> that we understand what you are seeing, >> I have a series of questions that would help us to understand what is >> going on. >> >> The first processor that you see that exhibits poor performance is >> ExtractText, correct? >> Can you share the configuration that you have for that processor? >> >> How big is your Java heap? This is configured in conf/bootstrap.conf; by >> default it is configured as: >> java.arg.2=-Xms512m >> java.arg.3=-Xmx512m >> >> Do you have backpressure configured on the connection between ExtractText >> and ReplaceText? >> >> Also, when you say that you specify concurrent tasks, what are you >> configuring the concurrent tasks >> to be? Have you changed the maximum number of concurrent tasks available >> to your dataflow? By default, NiFi will >> use only 10 threads max. How many CPU's are available on this machine? >> >> And finally, are these the only processors in your flow, or do you have >> other dataflows going on in the >> same instance as NiFi? >> >> Thanks >> -Mark >> >> >> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran >> wrote: >> >> Hi All, >> >> I have tried to perform the below operation. >> >> dat file(input)-->JSON-->SQL-->SQLServer >> >> >> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText- >> ->ConvertJsonToSQL-->PutSQL. >> >> My Input File(.dat)-->3,00,000 rows. >> >> *Objective:* Move the data from '.dat' file into SQLServer. >> >> I can able to Store the data in SQL Server by using combination of above >> processors.But it takes almost 4-5 hrs to move complete data into SQLServer. >> >> Combination of SplitText's perform data read quickly.But Extract Text >> takes long time to pass given data matches with user defined expression.If >> input comes 107 MB but it send outputs in KB size only even ReplaceText >> processor also processing data in KB Size only. >> >> In accordance with above slow processing leads the more time taken for >> data into SQLsever. >> >> >> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing >> flow file in Kilobytes only. >> >> If i have specify concurrent tasks for those >> ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and >> disk usage. >> >> It just 30 MB data ,But processors takes 6 hrs for data movement into >> SQLServer. >> >> Faced Problem is.., >> >> >>1.Almost 6 hrs ta
Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?
Prabhu, You might also try to replace ExtractText with a series of ExecuteStreamCommand processors that perform system calls (sed/awk/grep or the Windows equivalents) on the flowfiles contents. You can even write the result directly to a flowfile attribute. I suspect there are wildcards in your ExtractText regex that are taking a while to buffer and compare. Lee On Oct 18, 2016, at 2:31 PM, prabhu Mahendran wrote: > Mark, > > Thanks for your response. > > Please find the response for your questions. > > ==>The first processor that you see that exhibits poor performance is > ExtractText, correct? > Yes,Extract Text exhibits poor performance. > > ==>How big is your Java heap? > I have set 1 GB for java heap. > > ==>Do you have back pressure configured on the connection between ExtractText > and ReplaceText? >There is no back pressure between extract and > replace text. > > ==>when you say that you specify concurrent tasks, what are you configuring > the concurrent tasks > to be? > I have specify concurrent tasks to be 2 for the > extract text processor due to slower processing rate.Which >is specified in Concurrent Task Text box. > > ==>Have you changed the maximum number of concurrent tasks available to your > dataflow? > No i haven't changed. > > ==>How many CPU's are available on this machine? > Only single cpu are available in this machine with > core i5 processor CPU @2.20Ghz. > > ==> Are these the only processors in your flow, or do you have other > dataflows going on in the > same instance as NiFi? >Yes this is the only processor in work flow which is > running and no other instances are running. > > Thanks > >> On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne wrote: >> Prabhu, >> >> Certainly, the performance that you are seeing, taking 4-5 hours to move 3M >> rows into SQLServer is far from >> ideal, but the good news is that it is also far from typical. You should be >> able to see far better results. >> >> To help us understand what is limiting the performance, and to make sure >> that we understand what you are seeing, >> I have a series of questions that would help us to understand what is going >> on. >> >> The first processor that you see that exhibits poor performance is >> ExtractText, correct? >> Can you share the configuration that you have for that processor? >> >> How big is your Java heap? This is configured in conf/bootstrap.conf; by >> default it is configured as: >> java.arg.2=-Xms512m >> java.arg.3=-Xmx512m >> >> Do you have backpressure configured on the connection between ExtractText >> and ReplaceText? >> >> Also, when you say that you specify concurrent tasks, what are you >> configuring the concurrent tasks >> to be? Have you changed the maximum number of concurrent tasks available to >> your dataflow? By default, NiFi will >> use only 10 threads max. How many CPU's are available on this machine? >> >> And finally, are these the only processors in your flow, or do you have >> other dataflows going on in the >> same instance as NiFi? >> >> Thanks >> -Mark >> >> >>> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran >>> wrote: >>> >>> Hi All, >>> >>> I have tried to perform the below operation. >>> >>> dat file(input)-->JSON-->SQL-->SQLServer >>> >>> >>> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->ConvertJsonToSQL-->PutSQL. >>> >>> My Input File(.dat)-->3,00,000 rows. >>> >>> Objective: Move the data from '.dat' file into SQLServer. >>> >>> I can able to Store the data in SQL Server by using combination of above >>> processors.But it takes almost 4-5 hrs to move complete data into SQLServer. >>> >>> Combination of SplitText's perform data read quickly.But Extract Text takes >>> long time to pass given data matches with user defined expression.If input >>> comes 107 MB but it send outputs in KB size only even ReplaceText processor >>> also processing data in KB Size only. >>> >>> In accordance with above slow processing leads the more time taken for data >>> into SQLsever. >>> >>> >>> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow >>> file in Kilobytes only. >>> >>> If i have specify concurrent tasks for those >>> ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and >>> disk usage. >>> >>> It just 30 MB data ,But processors takes 6 hrs for data movement into >>> SQLServer. >>> >>> Faced Problem is.., >>> >>>Almost 6 hrs taken for move the 3lakhs data into SQL Server. >>>ExtractText,ReplaceText take long time for processing data(it send >>> output flowfile kb size only). >>> Can anyone help me to solve below requirement? >>> >>> Need to reduce the number of time taken by the processors for move the >>> lakhs of data
Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?
Mark, Thanks for your response. Please find the response for your questions. ==>The first processor that you see that exhibits poor performance is ExtractText, correct? Yes,Extract Text exhibits poor performance. ==>How big is your Java heap? I have set 1 GB for java heap. ==>Do you have back pressure configured on the connection between ExtractText and ReplaceText? There is no back pressure between extract and replace text. ==>when you say that you specify concurrent tasks, what are you configuring the concurrent tasks to be? I have specify concurrent tasks to be 2 for the extract text processor due to slower processing rate.Which is specified in Concurrent Task Text box. ==>Have you changed the maximum number of concurrent tasks available to your dataflow? No i haven't changed. ==>How many CPU's are available on this machine? Only single cpu are available in this machine with core i5 processor CPU @2.20Ghz. ==> Are these the only processors in your flow, or do you have other dataflows going on in the same instance as NiFi? Yes this is the only processor in work flow which is running and no other instances are running. Thanks On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne wrote: > Prabhu, > > Certainly, the performance that you are seeing, taking 4-5 hours to move > 3M rows into SQLServer is far from > ideal, but the good news is that it is also far from typical. You should > be able to see far better results. > > To help us understand what is limiting the performance, and to make sure > that we understand what you are seeing, > I have a series of questions that would help us to understand what is > going on. > > The first processor that you see that exhibits poor performance is > ExtractText, correct? > Can you share the configuration that you have for that processor? > > How big is your Java heap? This is configured in conf/bootstrap.conf; by > default it is configured as: > java.arg.2=-Xms512m > java.arg.3=-Xmx512m > > Do you have backpressure configured on the connection between ExtractText > and ReplaceText? > > Also, when you say that you specify concurrent tasks, what are you > configuring the concurrent tasks > to be? Have you changed the maximum number of concurrent tasks available > to your dataflow? By default, NiFi will > use only 10 threads max. How many CPU's are available on this machine? > > And finally, are these the only processors in your flow, or do you have > other dataflows going on in the > same instance as NiFi? > > Thanks > -Mark > > > On Oct 17, 2016, at 3:35 AM, prabhu Mahendran > wrote: > > Hi All, > > I have tried to perform the below operation. > > dat file(input)-->JSON-->SQL-->SQLServer > > > GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText--> > ConvertJsonToSQL-->PutSQL. > > My Input File(.dat)-->3,00,000 rows. > > *Objective:* Move the data from '.dat' file into SQLServer. > > I can able to Store the data in SQL Server by using combination of above > processors.But it takes almost 4-5 hrs to move complete data into SQLServer. > > Combination of SplitText's perform data read quickly.But Extract Text > takes long time to pass given data matches with user defined expression.If > input comes 107 MB but it send outputs in KB size only even ReplaceText > processor also processing data in KB Size only. > > In accordance with above slow processing leads the more time taken for > data into SQLsever. > > > Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow > file in Kilobytes only. > > If i have specify concurrent tasks for those > ExtractText,ReplaceText,ConvertJsonToSQL > then it occupy the 100% cpu and disk usage. > > It just 30 MB data ,But processors takes 6 hrs for data movement into > SQLServer. > > Faced Problem is.., > > >1.Almost 6 hrs taken for move the 3lakhs data into SQL Server. >2.ExtractText,ReplaceText take long time for processing >data(it send output flowfile kb size only). > > Can anyone help me to solve below *requirement*? > > Need to reduce the number of time taken by the processors for move the > lakhs of data into SQL Server. > > > > If anything i'm done wrong,please help me to done it right. > > > > > > > > >
Re: Calculating the theoretical throughput of a Nifi server
Ali, Without knowing the details of the data streams, nature of each event and the operations that will be performed against them, or how the processors themselves will work, I cannot give you a solid answer. Do I think it is possible? Absolutely. Do I think there will be hurdles to overcome to reach and sustain such a rate? Absolutely. Thanks Joe On Mon, Oct 17, 2016 at 9:28 PM, Lee Laim wrote: > Ali, > I used the pcie for all repos and the PutFile destination. > > > > On Oct 18, 2016, at 8:38 AM, Ali Nazemian wrote: > > Hi Lee, > > I was wondering, did you use PCIe for file flow repo or provenance repo or > content repo? or all of them? > > Joe, > > The ETL is not very complicated ETL, so do you think isn't it possible to > reach 800MBps in production even if I use PCIe for file flow repo? Is it > worth spending money on PCIe for the file flow repo? > > Best regards > > On Tue, Oct 18, 2016 at 2:36 AM, Joe Witt wrote: >> >> Thanks Lee. Your response was awesome and really made me want to get >> hands on a set of boxes like this so we could do some testing. >> >> Thanks >> Joe >> >> On Mon, Oct 17, 2016 at 11:32 AM, Lee Laim wrote: >> > Joe, >> > Good points regarding throughput on real flows and sustained basis. My >> > test >> > was only pushing one aspect of the system. >> > >> > That said, I would be interested discussing/developing a more >> > comprehensive >> > test flow to capture more real world use cases. I'll check to see if >> > that >> > conversation has started. >> > >> > Thanks, >> > Lee >> > >> > >> > >> > >> > >> > Lee Laim >> > 610-864-1657 >> > >> > On Oct 17, 2016, at 9:55 PM, Ali Nazemian wrote: >> > >> > Dear Joe, >> > Thank you very much. >> > >> > Best regards >> > >> > >> > On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt wrote: >> >> >> >> Ali >> >> >> >> I suspect bottlenecks in the software itself and the flow design will >> >> become a factor before you 800 MB/s. You'd likely hit CPU efficiency >> >> issues before this caused by the flow processors themselves and due to >> >> garbage collection. Probably the most important factor though will be >> >> the transaction rate and whether the flow is configured to tradeoff >> >> some latency for higher throughput. So many variables at play but >> >> under idealized conditions and a system like you describe it is >> >> theoretically feasible to hit that value. >> >> >> >> Practically speaking I think you'd be looking at a couple hundred MB/s >> >> per server like this on real flows on a sustained basis. >> >> >> >> Thanks >> >> Joe >> >> >> >> On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian >> >> wrote: >> >> > Dear Nifi users/developers, >> >> > Hi, >> >> > >> >> > I was wondering how can I calculate the theoretical throughput of a >> >> > Nifi >> >> > server? let's suppose we can eliminate different bottlenecks such as >> >> > the >> >> > file flow rep and provenance repo bottleneck by using a very high-end >> >> > SSD. >> >> > Moreover, assume that a very high-end network infrastructure is >> >> > available. >> >> > In this case, is it possible to reach 800MB throughput per second per >> >> > each >> >> > server? Suppose each server comes with 24 disk slots. 16 disk slots >> >> > are >> >> > used >> >> > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the >> >> > content repo. Let's say each content repo can achieve 100 MB >> >> > throughput. >> >> > May >> >> > I say the total throughput per each server can be 8x100=800MBps? Is >> >> > it >> >> > possible to reach this amount of throughput practically? >> >> > Thank you very much. >> >> > >> >> > Best regards, >> >> > Ali >> > >> > >> > >> > >> > -- >> > A.Nazemian > > > > > -- > A.Nazemian
Re: Calculating the theoretical throughput of a Nifi server
Ali, I used the pcie for all repos and the PutFile destination. > On Oct 18, 2016, at 8:38 AM, Ali Nazemian wrote: > > Hi Lee, > > I was wondering, did you use PCIe for file flow repo or provenance repo or > content repo? or all of them? > > Joe, > > The ETL is not very complicated ETL, so do you think isn't it possible to > reach 800MBps in production even if I use PCIe for file flow repo? Is it > worth spending money on PCIe for the file flow repo? > > Best regards > >> On Tue, Oct 18, 2016 at 2:36 AM, Joe Witt wrote: >> Thanks Lee. Your response was awesome and really made me want to get >> hands on a set of boxes like this so we could do some testing. >> >> Thanks >> Joe >> >> On Mon, Oct 17, 2016 at 11:32 AM, Lee Laim wrote: >> > Joe, >> > Good points regarding throughput on real flows and sustained basis. My >> > test >> > was only pushing one aspect of the system. >> > >> > That said, I would be interested discussing/developing a more comprehensive >> > test flow to capture more real world use cases. I'll check to see if that >> > conversation has started. >> > >> > Thanks, >> > Lee >> > >> > >> > >> > >> > >> > Lee Laim >> > 610-864-1657 >> > >> > On Oct 17, 2016, at 9:55 PM, Ali Nazemian wrote: >> > >> > Dear Joe, >> > Thank you very much. >> > >> > Best regards >> > >> > >> > On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt wrote: >> >> >> >> Ali >> >> >> >> I suspect bottlenecks in the software itself and the flow design will >> >> become a factor before you 800 MB/s. You'd likely hit CPU efficiency >> >> issues before this caused by the flow processors themselves and due to >> >> garbage collection. Probably the most important factor though will be >> >> the transaction rate and whether the flow is configured to tradeoff >> >> some latency for higher throughput. So many variables at play but >> >> under idealized conditions and a system like you describe it is >> >> theoretically feasible to hit that value. >> >> >> >> Practically speaking I think you'd be looking at a couple hundred MB/s >> >> per server like this on real flows on a sustained basis. >> >> >> >> Thanks >> >> Joe >> >> >> >> On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian >> >> wrote: >> >> > Dear Nifi users/developers, >> >> > Hi, >> >> > >> >> > I was wondering how can I calculate the theoretical throughput of a Nifi >> >> > server? let's suppose we can eliminate different bottlenecks such as the >> >> > file flow rep and provenance repo bottleneck by using a very high-end >> >> > SSD. >> >> > Moreover, assume that a very high-end network infrastructure is >> >> > available. >> >> > In this case, is it possible to reach 800MB throughput per second per >> >> > each >> >> > server? Suppose each server comes with 24 disk slots. 16 disk slots are >> >> > used >> >> > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the >> >> > content repo. Let's say each content repo can achieve 100 MB throughput. >> >> > May >> >> > I say the total throughput per each server can be 8x100=800MBps? Is it >> >> > possible to reach this amount of throughput practically? >> >> > Thank you very much. >> >> > >> >> > Best regards, >> >> > Ali >> > >> > >> > >> > >> > -- >> > A.Nazemian > > > > -- > A.Nazemian
Re: Calculating the theoretical throughput of a Nifi server
Hi Lee, I was wondering, did you use PCIe for file flow repo or provenance repo or content repo? or all of them? Joe, The ETL is not very complicated ETL, so do you think isn't it possible to reach 800MBps in production even if I use PCIe for file flow repo? Is it worth spending money on PCIe for the file flow repo? Best regards On Tue, Oct 18, 2016 at 2:36 AM, Joe Witt wrote: > Thanks Lee. Your response was awesome and really made me want to get > hands on a set of boxes like this so we could do some testing. > > Thanks > Joe > > On Mon, Oct 17, 2016 at 11:32 AM, Lee Laim wrote: > > Joe, > > Good points regarding throughput on real flows and sustained basis. My > test > > was only pushing one aspect of the system. > > > > That said, I would be interested discussing/developing a more > comprehensive > > test flow to capture more real world use cases. I'll check to see if that > > conversation has started. > > > > Thanks, > > Lee > > > > > > > > > > > > Lee Laim > > 610-864-1657 > > > > On Oct 17, 2016, at 9:55 PM, Ali Nazemian wrote: > > > > Dear Joe, > > Thank you very much. > > > > Best regards > > > > > > On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt wrote: > >> > >> Ali > >> > >> I suspect bottlenecks in the software itself and the flow design will > >> become a factor before you 800 MB/s. You'd likely hit CPU efficiency > >> issues before this caused by the flow processors themselves and due to > >> garbage collection. Probably the most important factor though will be > >> the transaction rate and whether the flow is configured to tradeoff > >> some latency for higher throughput. So many variables at play but > >> under idealized conditions and a system like you describe it is > >> theoretically feasible to hit that value. > >> > >> Practically speaking I think you'd be looking at a couple hundred MB/s > >> per server like this on real flows on a sustained basis. > >> > >> Thanks > >> Joe > >> > >> On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian > >> wrote: > >> > Dear Nifi users/developers, > >> > Hi, > >> > > >> > I was wondering how can I calculate the theoretical throughput of a > Nifi > >> > server? let's suppose we can eliminate different bottlenecks such as > the > >> > file flow rep and provenance repo bottleneck by using a very high-end > >> > SSD. > >> > Moreover, assume that a very high-end network infrastructure is > >> > available. > >> > In this case, is it possible to reach 800MB throughput per second per > >> > each > >> > server? Suppose each server comes with 24 disk slots. 16 disk slots > are > >> > used > >> > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the > >> > content repo. Let's say each content repo can achieve 100 MB > throughput. > >> > May > >> > I say the total throughput per each server can be 8x100=800MBps? Is > it > >> > possible to reach this amount of throughput practically? > >> > Thank you very much. > >> > > >> > Best regards, > >> > Ali > > > > > > > > > > -- > > A.Nazemian > -- A.Nazemian
Re: Calculating the theoretical throughput of a Nifi server
Thanks Lee. Your response was awesome and really made me want to get hands on a set of boxes like this so we could do some testing. Thanks Joe On Mon, Oct 17, 2016 at 11:32 AM, Lee Laim wrote: > Joe, > Good points regarding throughput on real flows and sustained basis. My test > was only pushing one aspect of the system. > > That said, I would be interested discussing/developing a more comprehensive > test flow to capture more real world use cases. I'll check to see if that > conversation has started. > > Thanks, > Lee > > > > > > Lee Laim > 610-864-1657 > > On Oct 17, 2016, at 9:55 PM, Ali Nazemian wrote: > > Dear Joe, > Thank you very much. > > Best regards > > > On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt wrote: >> >> Ali >> >> I suspect bottlenecks in the software itself and the flow design will >> become a factor before you 800 MB/s. You'd likely hit CPU efficiency >> issues before this caused by the flow processors themselves and due to >> garbage collection. Probably the most important factor though will be >> the transaction rate and whether the flow is configured to tradeoff >> some latency for higher throughput. So many variables at play but >> under idealized conditions and a system like you describe it is >> theoretically feasible to hit that value. >> >> Practically speaking I think you'd be looking at a couple hundred MB/s >> per server like this on real flows on a sustained basis. >> >> Thanks >> Joe >> >> On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian >> wrote: >> > Dear Nifi users/developers, >> > Hi, >> > >> > I was wondering how can I calculate the theoretical throughput of a Nifi >> > server? let's suppose we can eliminate different bottlenecks such as the >> > file flow rep and provenance repo bottleneck by using a very high-end >> > SSD. >> > Moreover, assume that a very high-end network infrastructure is >> > available. >> > In this case, is it possible to reach 800MB throughput per second per >> > each >> > server? Suppose each server comes with 24 disk slots. 16 disk slots are >> > used >> > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the >> > content repo. Let's say each content repo can achieve 100 MB throughput. >> > May >> > I say the total throughput per each server can be 8x100=800MBps? Is it >> > possible to reach this amount of throughput practically? >> > Thank you very much. >> > >> > Best regards, >> > Ali > > > > > -- > A.Nazemian
Re: Calculating the theoretical throughput of a Nifi server
Joe, Good points regarding throughput on real flows and sustained basis. My test was only pushing one aspect of the system. That said, I would be interested discussing/developing a more comprehensive test flow to capture more real world use cases. I'll check to see if that conversation has started. Thanks, Lee Lee Laim 610-864-1657 On Oct 17, 2016, at 9:55 PM, Ali Nazemian wrote: > Dear Joe, > Thank you very much. > > Best regards > > >> On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt wrote: >> Ali >> >> I suspect bottlenecks in the software itself and the flow design will >> become a factor before you 800 MB/s. You'd likely hit CPU efficiency >> issues before this caused by the flow processors themselves and due to >> garbage collection. Probably the most important factor though will be >> the transaction rate and whether the flow is configured to tradeoff >> some latency for higher throughput. So many variables at play but >> under idealized conditions and a system like you describe it is >> theoretically feasible to hit that value. >> >> Practically speaking I think you'd be looking at a couple hundred MB/s >> per server like this on real flows on a sustained basis. >> >> Thanks >> Joe >> >> On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian wrote: >> > Dear Nifi users/developers, >> > Hi, >> > >> > I was wondering how can I calculate the theoretical throughput of a Nifi >> > server? let's suppose we can eliminate different bottlenecks such as the >> > file flow rep and provenance repo bottleneck by using a very high-end SSD. >> > Moreover, assume that a very high-end network infrastructure is available. >> > In this case, is it possible to reach 800MB throughput per second per each >> > server? Suppose each server comes with 24 disk slots. 16 disk slots are >> > used >> > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the >> > content repo. Let's say each content repo can achieve 100 MB throughput. >> > May >> > I say the total throughput per each server can be 8x100=800MBps? Is it >> > possible to reach this amount of throughput practically? >> > Thank you very much. >> > >> > Best regards, >> > Ali > > > > -- > A.Nazemian
Re: Calculating the theoretical throughput of a Nifi server
Hi Ali, I observed ~1GB/sec on a test PutFile processor using an enterprise PCIe NVMe ssd on a single instance on desktop class hardware. I plan to run more in depth tests on server class hardware but will likely be on 1 Gb network. I should note I'm not sure exactly how much provenance was being written. The nifi-0.7.0 instance was a fresh install with no major configuration changes. I was using the generate flowfiles processor to generate 100MB flowfiles and writing as fast as possible with a PutFile processor. The SSD posted the following on the AS-SSD benchmark (completely unoptimized): 1.8GB/sec for sequential write; 2.3GB/sec for 4K random write (64 treads); 114MB/sec for 4K random write (1 thread) On the PCie bus, you should easily surpass 800 MB/sec, especially if your flowfiles are large and you have few provenance events. The theoretical bandwidth is 985 MB/sec/lane, up to 16 lanes; I was running x4. The NVMe standard should also help with smaller flowfiles. Hope this helps, Lee > On Oct 17, 2016, at 12:06 PM, Ali Nazemian wrote: > > Dear Nifi users/developers, > Hi, > > I was wondering how can I calculate the theoretical throughput of a Nifi > server? let's suppose we can eliminate different bottlenecks such as the file > flow rep and provenance repo bottleneck by using a very high-end SSD. > Moreover, assume that a very high-end network infrastructure is available. In > this case, is it possible to reach 800MB throughput per second per each > server? Suppose each server comes with 24 disk slots. 16 disk slots are used > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the content > repo. Let's say each content repo can achieve 100 MB throughput. May I say > the total throughput per each server can be 8x100=800MBps? Is it possible to > reach this amount of throughput practically? > Thank you very much. > > Best regards, > Ali
Re: Calculating the theoretical throughput of a Nifi server
Dear Joe, Thank you very much. Best regards On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt wrote: > Ali > > I suspect bottlenecks in the software itself and the flow design will > become a factor before you 800 MB/s. You'd likely hit CPU efficiency > issues before this caused by the flow processors themselves and due to > garbage collection. Probably the most important factor though will be > the transaction rate and whether the flow is configured to tradeoff > some latency for higher throughput. So many variables at play but > under idealized conditions and a system like you describe it is > theoretically feasible to hit that value. > > Practically speaking I think you'd be looking at a couple hundred MB/s > per server like this on real flows on a sustained basis. > > Thanks > Joe > > On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian > wrote: > > Dear Nifi users/developers, > > Hi, > > > > I was wondering how can I calculate the theoretical throughput of a Nifi > > server? let's suppose we can eliminate different bottlenecks such as the > > file flow rep and provenance repo bottleneck by using a very high-end > SSD. > > Moreover, assume that a very high-end network infrastructure is > available. > > In this case, is it possible to reach 800MB throughput per second per > each > > server? Suppose each server comes with 24 disk slots. 16 disk slots are > used > > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the > > content repo. Let's say each content repo can achieve 100 MB throughput. > May > > I say the total throughput per each server can be 8x100=800MBps? Is it > > possible to reach this amount of throughput practically? > > Thank you very much. > > > > Best regards, > > Ali > -- A.Nazemian
Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?
Prabhu, Certainly, the performance that you are seeing, taking 4-5 hours to move 3M rows into SQLServer is far from ideal, but the good news is that it is also far from typical. You should be able to see far better results. To help us understand what is limiting the performance, and to make sure that we understand what you are seeing, I have a series of questions that would help us to understand what is going on. The first processor that you see that exhibits poor performance is ExtractText, correct? Can you share the configuration that you have for that processor? How big is your Java heap? This is configured in conf/bootstrap.conf; by default it is configured as: java.arg.2=-Xms512m java.arg.3=-Xmx512m Do you have backpressure configured on the connection between ExtractText and ReplaceText? Also, when you say that you specify concurrent tasks, what are you configuring the concurrent tasks to be? Have you changed the maximum number of concurrent tasks available to your dataflow? By default, NiFi will use only 10 threads max. How many CPU's are available on this machine? And finally, are these the only processors in your flow, or do you have other dataflows going on in the same instance as NiFi? Thanks -Mark > On Oct 17, 2016, at 3:35 AM, prabhu Mahendran wrote: > > Hi All, > > I have tried to perform the below operation. > > dat file(input)-->JSON-->SQL-->SQLServer > > > GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->ConvertJsonToSQL-->PutSQL. > > My Input File(.dat)-->3,00,000 rows. > > Objective: Move the data from '.dat' file into SQLServer. > > I can able to Store the data in SQL Server by using combination of above > processors.But it takes almost 4-5 hrs to move complete data into SQLServer. > > Combination of SplitText's perform data read quickly.But Extract Text takes > long time to pass given data matches with user defined expression.If input > comes 107 MB but it send outputs in KB size only even ReplaceText processor > also processing data in KB Size only. > > In accordance with above slow processing leads the more time taken for data > into SQLsever. > > > Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow > file in Kilobytes only. > > If i have specify concurrent tasks for those > ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and disk > usage. > > It just 30 MB data ,But processors takes 6 hrs for data movement into > SQLServer. > > Faced Problem is.., > >Almost 6 hrs taken for move the 3lakhs data into SQL Server. >ExtractText,ReplaceText take long time for processing data(it send > output flowfile kb size only). > Can anyone help me to solve below requirement? > > Need to reduce the number of time taken by the processors for move the lakhs > of data into SQL Server. > > > > If anything i'm done wrong,please help me to done it right. > > > > > > >
Re: Calculating the theoretical throughput of a Nifi server
Ali I suspect bottlenecks in the software itself and the flow design will become a factor before you 800 MB/s. You'd likely hit CPU efficiency issues before this caused by the flow processors themselves and due to garbage collection. Probably the most important factor though will be the transaction rate and whether the flow is configured to tradeoff some latency for higher throughput. So many variables at play but under idealized conditions and a system like you describe it is theoretically feasible to hit that value. Practically speaking I think you'd be looking at a couple hundred MB/s per server like this on real flows on a sustained basis. Thanks Joe On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian wrote: > Dear Nifi users/developers, > Hi, > > I was wondering how can I calculate the theoretical throughput of a Nifi > server? let's suppose we can eliminate different bottlenecks such as the > file flow rep and provenance repo bottleneck by using a very high-end SSD. > Moreover, assume that a very high-end network infrastructure is available. > In this case, is it possible to reach 800MB throughput per second per each > server? Suppose each server comes with 24 disk slots. 16 disk slots are used > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the > content repo. Let's say each content repo can achieve 100 MB throughput. May > I say the total throughput per each server can be 8x100=800MBps? Is it > possible to reach this amount of throughput practically? > Thank you very much. > > Best regards, > Ali
How to increase the processing speed of the ExtractText and ReplaceText Processor?
Hi All, I have tried to perform the below operation. dat file(input)-->JSON-->SQL-->SQLServer GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->ConvertJsonToSQL-->PutSQL. My Input File(.dat)-->3,00,000 rows. *Objective:* Move the data from '.dat' file into SQLServer. I can able to Store the data in SQL Server by using combination of above processors.But it takes almost 4-5 hrs to move complete data into SQLServer. Combination of SplitText's perform data read quickly.But Extract Text takes long time to pass given data matches with user defined expression.If input comes 107 MB but it send outputs in KB size only even ReplaceText processor also processing data in KB Size only. In accordance with above slow processing leads the more time taken for data into SQLsever. Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow file in Kilobytes only. If i have specify concurrent tasks for those ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and disk usage. It just 30 MB data ,But processors takes 6 hrs for data movement into SQLServer. Faced Problem is.., 1.Almost 6 hrs taken for move the 3lakhs data into SQL Server. 2.ExtractText,ReplaceText take long time for processing data(it send output flowfile kb size only). Can anyone help me to solve below *requirement*? Need to reduce the number of time taken by the processors for move the lakhs of data into SQL Server. If anything i'm done wrong,please help me to done it right.