Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

2016-10-17 Thread prabhu Mahendran
Lee,

Thanks for your idea.


I have one doubt regarding Execute Stream that needs CommandPath and
ArgumentDelimiter.

I have given this regex (.+)[|](.+)[|](.+)[|](.+) in Extract Text processor.

How can i give this reg ex to execute Stream processor?

or

Is any other processor which having same functionality like ExtractText
processor?

Thanks

On Tue, Oct 18, 2016 at 11:42 AM, Lee Laim  wrote:

>
> Prabhu,
>
> You might also try to replace ExtractText with a series of
> ExecuteStreamCommand processors that perform system calls (sed/awk/grep or
> the Windows equivalents) on the flowfiles contents.  You can even write the
> result directly to a flowfile attribute.
>
> I suspect there are wildcards in your ExtractText regex that are taking a
> while to buffer and compare.
>
> Lee
>
> On Oct 18, 2016, at 2:31 PM, prabhu Mahendran 
> wrote:
>
> Mark,
>
> Thanks for your response.
>
> Please find the response for your questions.
>
> ==>The first processor that you see that exhibits poor performance is
> ExtractText, correct?
>  Yes,Extract Text exhibits poor performance.
>
> ==>How big is your Java heap?
> I have set 1 GB for java heap.
>
> ==>Do you have back pressure configured on the connection between
> ExtractText and ReplaceText?
>There is no back pressure between extract and
> replace text.
>
> ==>when you say that you specify concurrent tasks, what are you
> configuring the concurrent tasks
> to be?
>   I have specify concurrent tasks to be 2 for the
> extract text processor due to slower processing rate.Which
>   is specified in Concurrent Task Text box.
>
> ==>Have you changed the maximum number of concurrent tasks available to
> your dataflow?
>  No i haven't changed.
>
> ==>How many CPU's are available on this machine?
> Only single cpu are available in this machine with
> core i5 processor CPU @2.20Ghz.
>
> ==> Are these the only processors in your flow, or do you have other
> dataflows going on in the
> same instance as NiFi?
>Yes this is the only processor in work flow which
> is running and no other instances are running.
>
> Thanks
>
> On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne  wrote:
>
>> Prabhu,
>>
>> Certainly, the performance that you are seeing, taking 4-5 hours to move
>> 3M rows into SQLServer is far from
>> ideal, but the good news is that it is also far from typical. You should
>> be able to see far better results.
>>
>> To help us understand what is limiting the performance, and to make sure
>> that we understand what you are seeing,
>> I have a series of questions that would help us to understand what is
>> going on.
>>
>> The first processor that you see that exhibits poor performance is
>> ExtractText, correct?
>> Can you share the configuration that you have for that processor?
>>
>> How big is your Java heap? This is configured in conf/bootstrap.conf; by
>> default it is configured as:
>> java.arg.2=-Xms512m
>> java.arg.3=-Xmx512m
>>
>> Do you have backpressure configured on the connection between ExtractText
>> and ReplaceText?
>>
>> Also, when you say that you specify concurrent tasks, what are you
>> configuring the concurrent tasks
>> to be? Have you changed the maximum number of concurrent tasks available
>> to your dataflow? By default, NiFi will
>> use only 10 threads max. How many CPU's are available on this machine?
>>
>> And finally, are these the only processors in your flow, or do you have
>> other dataflows going on in the
>> same instance as NiFi?
>>
>> Thanks
>> -Mark
>>
>>
>> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran 
>> wrote:
>>
>> Hi All,
>>
>> I have tried to perform the below operation.
>>
>> dat file(input)-->JSON-->SQL-->SQLServer
>>
>>
>> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-
>> ->ConvertJsonToSQL-->PutSQL.
>>
>> My Input File(.dat)-->3,00,000 rows.
>>
>> *Objective:* Move the data from '.dat' file into SQLServer.
>>
>> I can able to Store the data in SQL Server by using combination of above
>> processors.But it takes almost 4-5 hrs to move complete data into SQLServer.
>>
>> Combination of SplitText's perform data read quickly.But Extract Text
>> takes long time to pass given data matches with user defined expression.If
>> input comes 107 MB but it send outputs in KB size only even ReplaceText
>> processor also processing data in KB Size only.
>>
>> In accordance with above slow processing leads the more time taken for
>> data into SQLsever.
>>
>>
>> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing
>> flow file in Kilobytes only.
>>
>> If i have specify concurrent tasks for those
>> ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and
>> disk usage.
>>
>> It just 30 MB data ,But processors takes 6 hrs for data movement into
>> SQLServer.
>>
>> Faced Problem is..,
>>
>>
>>1.Almost 6 hrs ta

Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

2016-10-17 Thread Lee Laim

Prabhu, 

You might also try to replace ExtractText with a series of ExecuteStreamCommand 
processors that perform system calls (sed/awk/grep or the Windows equivalents) 
on the flowfiles contents.  You can even write the result directly to a 
flowfile attribute.

I suspect there are wildcards in your ExtractText regex that are taking a while 
to buffer and compare.  

Lee 

On Oct 18, 2016, at 2:31 PM, prabhu Mahendran  wrote:

> Mark,
> 
> Thanks for your response.
> 
> Please find the response for your questions.
> 
> ==>The first processor that you see that exhibits poor performance is 
> ExtractText, correct?
>  Yes,Extract Text exhibits poor performance.
> 
> ==>How big is your Java heap?
> I have set 1 GB for java heap.
> 
> ==>Do you have back pressure configured on the connection between ExtractText 
> and ReplaceText?
>There is no back pressure between extract and 
> replace text.
> 
> ==>when you say that you specify concurrent tasks, what are you configuring 
> the concurrent tasks
> to be?
>   I have specify concurrent tasks to be 2 for the 
> extract text processor due to slower processing rate.Which
>is specified in Concurrent Task Text box.
> 
> ==>Have you changed the maximum number of concurrent tasks available to your 
> dataflow?
>  No i haven't changed.
> 
> ==>How many CPU's are available on this machine?
> Only single cpu are available in this machine with 
> core i5 processor CPU @2.20Ghz.
> 
> ==> Are these the only processors in your flow, or do you have other 
> dataflows going on in the
> same instance as NiFi?
>Yes this is the only processor in work flow which is 
> running and no other instances are running.
> 
> Thanks
> 
>> On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne  wrote:
>> Prabhu,
>> 
>> Certainly, the performance that you are seeing, taking 4-5 hours to move 3M 
>> rows into SQLServer is far from
>> ideal, but the good news is that it is also far from typical. You should be 
>> able to see far better results.
>> 
>> To help us understand what is limiting the performance, and to make sure 
>> that we understand what you are seeing, 
>> I have a series of questions that would help us to understand what is going 
>> on.
>> 
>> The first processor that you see that exhibits poor performance is 
>> ExtractText, correct?
>> Can you share the configuration that you have for that processor?
>> 
>> How big is your Java heap? This is configured in conf/bootstrap.conf; by 
>> default it is configured as:
>> java.arg.2=-Xms512m
>> java.arg.3=-Xmx512m
>> 
>> Do you have backpressure configured on the connection between ExtractText 
>> and ReplaceText?
>> 
>> Also, when you say that you specify concurrent tasks, what are you 
>> configuring the concurrent tasks
>> to be? Have you changed the maximum number of concurrent tasks available to 
>> your dataflow? By default, NiFi will
>> use only 10 threads max. How many CPU's are available on this machine?
>> 
>> And finally, are these the only processors in your flow, or do you have 
>> other dataflows going on in the
>> same instance as NiFi?
>> 
>> Thanks
>> -Mark
>> 
>> 
>>> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran  
>>> wrote:
>>> 
>>> Hi All,
>>> 
>>> I have tried to perform the below operation.
>>> 
>>> dat file(input)-->JSON-->SQL-->SQLServer
>>> 
>>> 
>>> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->ConvertJsonToSQL-->PutSQL.
>>> 
>>> My Input File(.dat)-->3,00,000 rows.
>>> 
>>> Objective: Move the data from '.dat' file into SQLServer.
>>> 
>>> I can able to Store the data in SQL Server by using combination of above 
>>> processors.But it takes almost 4-5 hrs to move complete data into SQLServer.
>>> 
>>> Combination of SplitText's perform data read quickly.But Extract Text takes 
>>> long time to pass given data matches with user defined expression.If input 
>>> comes 107 MB but it send outputs in KB size only even ReplaceText processor 
>>> also processing data in KB Size only.
>>> 
>>> In accordance with above slow processing leads the more time taken for data 
>>> into SQLsever. 
>>> 
>>> 
>>> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow 
>>> file in Kilobytes only.
>>> 
>>> If i have specify concurrent tasks for those 
>>> ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and 
>>> disk usage.
>>> 
>>> It just 30 MB data ,But processors takes 6 hrs for data movement into 
>>> SQLServer.
>>> 
>>> Faced Problem is..,
>>> 
>>>Almost 6 hrs taken for move the 3lakhs data into SQL Server.
>>>ExtractText,ReplaceText take long time for processing data(it send 
>>> output flowfile kb size only).
>>> Can anyone help me to solve below requirement?
>>> 
>>> Need to reduce the number of time taken by the processors for move the 
>>> lakhs of data 

Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

2016-10-17 Thread prabhu Mahendran
Mark,

Thanks for your response.

Please find the response for your questions.

==>The first processor that you see that exhibits poor performance is
ExtractText, correct?
 Yes,Extract Text exhibits poor performance.

==>How big is your Java heap?
I have set 1 GB for java heap.

==>Do you have back pressure configured on the connection between
ExtractText and ReplaceText?
   There is no back pressure between extract and
replace text.

==>when you say that you specify concurrent tasks, what are you configuring
the concurrent tasks
to be?
  I have specify concurrent tasks to be 2 for the
extract text processor due to slower processing rate.Which
  is specified in Concurrent Task Text box.

==>Have you changed the maximum number of concurrent tasks available to
your dataflow?
 No i haven't changed.

==>How many CPU's are available on this machine?
Only single cpu are available in this machine with
core i5 processor CPU @2.20Ghz.

==> Are these the only processors in your flow, or do you have other
dataflows going on in the
same instance as NiFi?
   Yes this is the only processor in work flow which is
running and no other instances are running.

Thanks

On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne  wrote:

> Prabhu,
>
> Certainly, the performance that you are seeing, taking 4-5 hours to move
> 3M rows into SQLServer is far from
> ideal, but the good news is that it is also far from typical. You should
> be able to see far better results.
>
> To help us understand what is limiting the performance, and to make sure
> that we understand what you are seeing,
> I have a series of questions that would help us to understand what is
> going on.
>
> The first processor that you see that exhibits poor performance is
> ExtractText, correct?
> Can you share the configuration that you have for that processor?
>
> How big is your Java heap? This is configured in conf/bootstrap.conf; by
> default it is configured as:
> java.arg.2=-Xms512m
> java.arg.3=-Xmx512m
>
> Do you have backpressure configured on the connection between ExtractText
> and ReplaceText?
>
> Also, when you say that you specify concurrent tasks, what are you
> configuring the concurrent tasks
> to be? Have you changed the maximum number of concurrent tasks available
> to your dataflow? By default, NiFi will
> use only 10 threads max. How many CPU's are available on this machine?
>
> And finally, are these the only processors in your flow, or do you have
> other dataflows going on in the
> same instance as NiFi?
>
> Thanks
> -Mark
>
>
> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran 
> wrote:
>
> Hi All,
>
> I have tried to perform the below operation.
>
> dat file(input)-->JSON-->SQL-->SQLServer
>
>
> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->
> ConvertJsonToSQL-->PutSQL.
>
> My Input File(.dat)-->3,00,000 rows.
>
> *Objective:* Move the data from '.dat' file into SQLServer.
>
> I can able to Store the data in SQL Server by using combination of above
> processors.But it takes almost 4-5 hrs to move complete data into SQLServer.
>
> Combination of SplitText's perform data read quickly.But Extract Text
> takes long time to pass given data matches with user defined expression.If
> input comes 107 MB but it send outputs in KB size only even ReplaceText
> processor also processing data in KB Size only.
>
> In accordance with above slow processing leads the more time taken for
> data into SQLsever.
>
>
> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow
> file in Kilobytes only.
>
> If i have specify concurrent tasks for those 
> ExtractText,ReplaceText,ConvertJsonToSQL
> then it occupy the 100% cpu and disk usage.
>
> It just 30 MB data ,But processors takes 6 hrs for data movement into
> SQLServer.
>
> Faced Problem is..,
>
>
>1.Almost 6 hrs taken for move the 3lakhs data into SQL Server.
>2.ExtractText,ReplaceText take long time for processing
>data(it send output flowfile kb size only).
>
> Can anyone help me to solve below *requirement*?
>
> Need to reduce the number of time taken by the processors for move the
> lakhs of data into SQL Server.
>
>
>
> If anything i'm done wrong,please help me to done it right.
>
>
>
>
>
>
>
>
>


Re: Calculating the theoretical throughput of a Nifi server

2016-10-17 Thread Joe Witt
Ali,

Without knowing the details of the data streams, nature of each event
and the operations that will be performed against them, or how the
processors themselves will work, I cannot give you a solid answer.  Do
I think it is possible?  Absolutely.  Do I think there will be hurdles
to overcome to reach and sustain such a rate?  Absolutely.

Thanks
Joe

On Mon, Oct 17, 2016 at 9:28 PM, Lee Laim  wrote:
> Ali,
> I used the pcie for all repos and the PutFile destination.
>
>
>
> On Oct 18, 2016, at 8:38 AM, Ali Nazemian  wrote:
>
> Hi Lee,
>
> I was wondering, did you use PCIe for file flow repo or provenance repo or
> content repo? or all of them?
>
> Joe,
>
> The ETL is not very complicated ETL, so do you think isn't it possible to
> reach 800MBps in production even if I use PCIe for file flow repo? Is it
> worth spending money on PCIe for the file flow repo?
>
> Best regards
>
> On Tue, Oct 18, 2016 at 2:36 AM, Joe Witt  wrote:
>>
>> Thanks Lee.  Your response was awesome and really made me want to get
>> hands on a set of boxes like this so we could do some testing.
>>
>> Thanks
>> Joe
>>
>> On Mon, Oct 17, 2016 at 11:32 AM, Lee Laim  wrote:
>> > Joe,
>> > Good points regarding throughput on real flows and sustained basis.  My
>> > test
>> > was only pushing one aspect of the system.
>> >
>> > That said, I would be interested discussing/developing a more
>> > comprehensive
>> > test flow to capture more real world use cases. I'll check to see if
>> > that
>> > conversation has started.
>> >
>> > Thanks,
>> > Lee
>> >
>> >
>> >
>> >
>> >
>> > Lee Laim
>> > 610-864-1657
>> >
>> > On Oct 17, 2016, at 9:55 PM, Ali Nazemian  wrote:
>> >
>> > Dear Joe,
>> > Thank you very much.
>> >
>> > Best regards
>> >
>> >
>> > On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt  wrote:
>> >>
>> >> Ali
>> >>
>> >> I suspect bottlenecks in the software itself and the flow design will
>> >> become a factor before you 800 MB/s. You'd likely hit CPU efficiency
>> >> issues before this caused by the flow processors themselves and due to
>> >> garbage collection.  Probably the most important factor though will be
>> >> the transaction rate and whether the flow is configured to tradeoff
>> >> some latency for higher throughput.  So many variables at play but
>> >> under idealized conditions and a system like you describe it is
>> >> theoretically feasible to hit that value.
>> >>
>> >> Practically speaking I think you'd be looking at a couple hundred MB/s
>> >> per server like this on real flows on a sustained basis.
>> >>
>> >> Thanks
>> >> Joe
>> >>
>> >> On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian 
>> >> wrote:
>> >> > Dear Nifi users/developers,
>> >> > Hi,
>> >> >
>> >> > I was wondering how can I calculate the theoretical throughput of a
>> >> > Nifi
>> >> > server? let's suppose we can eliminate different bottlenecks such as
>> >> > the
>> >> > file flow rep and provenance repo bottleneck by using a very high-end
>> >> > SSD.
>> >> > Moreover, assume that a very high-end network infrastructure is
>> >> > available.
>> >> > In this case, is it possible to reach 800MB throughput per second per
>> >> > each
>> >> > server? Suppose each server comes with 24 disk slots. 16 disk slots
>> >> > are
>> >> > used
>> >> > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the
>> >> > content repo. Let's say each content repo can achieve 100 MB
>> >> > throughput.
>> >> > May
>> >> > I say the total throughput per each server can be 8x100=800MBps?  Is
>> >> > it
>> >> > possible to reach this amount of throughput practically?
>> >> > Thank you very much.
>> >> >
>> >> > Best regards,
>> >> > Ali
>> >
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>
>
>
>
> --
> A.Nazemian


Re: Calculating the theoretical throughput of a Nifi server

2016-10-17 Thread Lee Laim
Ali, 
I used the pcie for all repos and the PutFile destination.  



> On Oct 18, 2016, at 8:38 AM, Ali Nazemian  wrote:
> 
> Hi Lee,
> 
> I was wondering, did you use PCIe for file flow repo or provenance repo or 
> content repo? or all of them?
> 
> Joe,
> 
> The ETL is not very complicated ETL, so do you think isn't it possible to 
> reach 800MBps in production even if I use PCIe for file flow repo? Is it 
> worth spending money on PCIe for the file flow repo?
> 
> Best regards
> 
>> On Tue, Oct 18, 2016 at 2:36 AM, Joe Witt  wrote:
>> Thanks Lee.  Your response was awesome and really made me want to get
>> hands on a set of boxes like this so we could do some testing.
>> 
>> Thanks
>> Joe
>> 
>> On Mon, Oct 17, 2016 at 11:32 AM, Lee Laim  wrote:
>> > Joe,
>> > Good points regarding throughput on real flows and sustained basis.  My 
>> > test
>> > was only pushing one aspect of the system.
>> >
>> > That said, I would be interested discussing/developing a more comprehensive
>> > test flow to capture more real world use cases. I'll check to see if that
>> > conversation has started.
>> >
>> > Thanks,
>> > Lee
>> >
>> >
>> >
>> >
>> >
>> > Lee Laim
>> > 610-864-1657
>> >
>> > On Oct 17, 2016, at 9:55 PM, Ali Nazemian  wrote:
>> >
>> > Dear Joe,
>> > Thank you very much.
>> >
>> > Best regards
>> >
>> >
>> > On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt  wrote:
>> >>
>> >> Ali
>> >>
>> >> I suspect bottlenecks in the software itself and the flow design will
>> >> become a factor before you 800 MB/s. You'd likely hit CPU efficiency
>> >> issues before this caused by the flow processors themselves and due to
>> >> garbage collection.  Probably the most important factor though will be
>> >> the transaction rate and whether the flow is configured to tradeoff
>> >> some latency for higher throughput.  So many variables at play but
>> >> under idealized conditions and a system like you describe it is
>> >> theoretically feasible to hit that value.
>> >>
>> >> Practically speaking I think you'd be looking at a couple hundred MB/s
>> >> per server like this on real flows on a sustained basis.
>> >>
>> >> Thanks
>> >> Joe
>> >>
>> >> On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian 
>> >> wrote:
>> >> > Dear Nifi users/developers,
>> >> > Hi,
>> >> >
>> >> > I was wondering how can I calculate the theoretical throughput of a Nifi
>> >> > server? let's suppose we can eliminate different bottlenecks such as the
>> >> > file flow rep and provenance repo bottleneck by using a very high-end
>> >> > SSD.
>> >> > Moreover, assume that a very high-end network infrastructure is
>> >> > available.
>> >> > In this case, is it possible to reach 800MB throughput per second per
>> >> > each
>> >> > server? Suppose each server comes with 24 disk slots. 16 disk slots are
>> >> > used
>> >> > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the
>> >> > content repo. Let's say each content repo can achieve 100 MB throughput.
>> >> > May
>> >> > I say the total throughput per each server can be 8x100=800MBps?  Is it
>> >> > possible to reach this amount of throughput practically?
>> >> > Thank you very much.
>> >> >
>> >> > Best regards,
>> >> > Ali
>> >
>> >
>> >
>> >
>> > --
>> > A.Nazemian
> 
> 
> 
> -- 
> A.Nazemian


Re: Calculating the theoretical throughput of a Nifi server

2016-10-17 Thread Ali Nazemian
Hi Lee,

I was wondering, did you use PCIe for file flow repo or provenance repo or
content repo? or all of them?

Joe,

The ETL is not very complicated ETL, so do you think isn't it possible to
reach 800MBps in production even if I use PCIe for file flow repo? Is it
worth spending money on PCIe for the file flow repo?

Best regards

On Tue, Oct 18, 2016 at 2:36 AM, Joe Witt  wrote:

> Thanks Lee.  Your response was awesome and really made me want to get
> hands on a set of boxes like this so we could do some testing.
>
> Thanks
> Joe
>
> On Mon, Oct 17, 2016 at 11:32 AM, Lee Laim  wrote:
> > Joe,
> > Good points regarding throughput on real flows and sustained basis.  My
> test
> > was only pushing one aspect of the system.
> >
> > That said, I would be interested discussing/developing a more
> comprehensive
> > test flow to capture more real world use cases. I'll check to see if that
> > conversation has started.
> >
> > Thanks,
> > Lee
> >
> >
> >
> >
> >
> > Lee Laim
> > 610-864-1657
> >
> > On Oct 17, 2016, at 9:55 PM, Ali Nazemian  wrote:
> >
> > Dear Joe,
> > Thank you very much.
> >
> > Best regards
> >
> >
> > On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt  wrote:
> >>
> >> Ali
> >>
> >> I suspect bottlenecks in the software itself and the flow design will
> >> become a factor before you 800 MB/s. You'd likely hit CPU efficiency
> >> issues before this caused by the flow processors themselves and due to
> >> garbage collection.  Probably the most important factor though will be
> >> the transaction rate and whether the flow is configured to tradeoff
> >> some latency for higher throughput.  So many variables at play but
> >> under idealized conditions and a system like you describe it is
> >> theoretically feasible to hit that value.
> >>
> >> Practically speaking I think you'd be looking at a couple hundred MB/s
> >> per server like this on real flows on a sustained basis.
> >>
> >> Thanks
> >> Joe
> >>
> >> On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian 
> >> wrote:
> >> > Dear Nifi users/developers,
> >> > Hi,
> >> >
> >> > I was wondering how can I calculate the theoretical throughput of a
> Nifi
> >> > server? let's suppose we can eliminate different bottlenecks such as
> the
> >> > file flow rep and provenance repo bottleneck by using a very high-end
> >> > SSD.
> >> > Moreover, assume that a very high-end network infrastructure is
> >> > available.
> >> > In this case, is it possible to reach 800MB throughput per second per
> >> > each
> >> > server? Suppose each server comes with 24 disk slots. 16 disk slots
> are
> >> > used
> >> > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the
> >> > content repo. Let's say each content repo can achieve 100 MB
> throughput.
> >> > May
> >> > I say the total throughput per each server can be 8x100=800MBps?  Is
> it
> >> > possible to reach this amount of throughput practically?
> >> > Thank you very much.
> >> >
> >> > Best regards,
> >> > Ali
> >
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian


Re: Calculating the theoretical throughput of a Nifi server

2016-10-17 Thread Joe Witt
Thanks Lee.  Your response was awesome and really made me want to get
hands on a set of boxes like this so we could do some testing.

Thanks
Joe

On Mon, Oct 17, 2016 at 11:32 AM, Lee Laim  wrote:
> Joe,
> Good points regarding throughput on real flows and sustained basis.  My test
> was only pushing one aspect of the system.
>
> That said, I would be interested discussing/developing a more comprehensive
> test flow to capture more real world use cases. I'll check to see if that
> conversation has started.
>
> Thanks,
> Lee
>
>
>
>
>
> Lee Laim
> 610-864-1657
>
> On Oct 17, 2016, at 9:55 PM, Ali Nazemian  wrote:
>
> Dear Joe,
> Thank you very much.
>
> Best regards
>
>
> On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt  wrote:
>>
>> Ali
>>
>> I suspect bottlenecks in the software itself and the flow design will
>> become a factor before you 800 MB/s. You'd likely hit CPU efficiency
>> issues before this caused by the flow processors themselves and due to
>> garbage collection.  Probably the most important factor though will be
>> the transaction rate and whether the flow is configured to tradeoff
>> some latency for higher throughput.  So many variables at play but
>> under idealized conditions and a system like you describe it is
>> theoretically feasible to hit that value.
>>
>> Practically speaking I think you'd be looking at a couple hundred MB/s
>> per server like this on real flows on a sustained basis.
>>
>> Thanks
>> Joe
>>
>> On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian 
>> wrote:
>> > Dear Nifi users/developers,
>> > Hi,
>> >
>> > I was wondering how can I calculate the theoretical throughput of a Nifi
>> > server? let's suppose we can eliminate different bottlenecks such as the
>> > file flow rep and provenance repo bottleneck by using a very high-end
>> > SSD.
>> > Moreover, assume that a very high-end network infrastructure is
>> > available.
>> > In this case, is it possible to reach 800MB throughput per second per
>> > each
>> > server? Suppose each server comes with 24 disk slots. 16 disk slots are
>> > used
>> > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the
>> > content repo. Let's say each content repo can achieve 100 MB throughput.
>> > May
>> > I say the total throughput per each server can be 8x100=800MBps?  Is it
>> > possible to reach this amount of throughput practically?
>> > Thank you very much.
>> >
>> > Best regards,
>> > Ali
>
>
>
>
> --
> A.Nazemian


Re: Calculating the theoretical throughput of a Nifi server

2016-10-17 Thread Lee Laim
Joe,
Good points regarding throughput on real flows and sustained basis.  My test 
was only pushing one aspect of the system.

That said, I would be interested discussing/developing a more comprehensive 
test flow to capture more real world use cases. I'll check to see if that 
conversation has started.

Thanks,
Lee





Lee Laim 
610-864-1657

On Oct 17, 2016, at 9:55 PM, Ali Nazemian  wrote:

> Dear Joe,
> Thank you very much. 
> 
> Best regards
> 
> 
>> On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt  wrote:
>> Ali
>> 
>> I suspect bottlenecks in the software itself and the flow design will
>> become a factor before you 800 MB/s. You'd likely hit CPU efficiency
>> issues before this caused by the flow processors themselves and due to
>> garbage collection.  Probably the most important factor though will be
>> the transaction rate and whether the flow is configured to tradeoff
>> some latency for higher throughput.  So many variables at play but
>> under idealized conditions and a system like you describe it is
>> theoretically feasible to hit that value.
>> 
>> Practically speaking I think you'd be looking at a couple hundred MB/s
>> per server like this on real flows on a sustained basis.
>> 
>> Thanks
>> Joe
>> 
>> On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian  wrote:
>> > Dear Nifi users/developers,
>> > Hi,
>> >
>> > I was wondering how can I calculate the theoretical throughput of a Nifi
>> > server? let's suppose we can eliminate different bottlenecks such as the
>> > file flow rep and provenance repo bottleneck by using a very high-end SSD.
>> > Moreover, assume that a very high-end network infrastructure is available.
>> > In this case, is it possible to reach 800MB throughput per second per each
>> > server? Suppose each server comes with 24 disk slots. 16 disk slots are 
>> > used
>> > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the
>> > content repo. Let's say each content repo can achieve 100 MB throughput. 
>> > May
>> > I say the total throughput per each server can be 8x100=800MBps?  Is it
>> > possible to reach this amount of throughput practically?
>> > Thank you very much.
>> >
>> > Best regards,
>> > Ali
> 
> 
> 
> -- 
> A.Nazemian


Re: Calculating the theoretical throughput of a Nifi server

2016-10-17 Thread Lee Laim
Hi Ali,

I observed ~1GB/sec on a test PutFile processor using an enterprise PCIe  NVMe 
ssd on a single instance on desktop class hardware.  I plan to run more in 
depth tests on server class hardware but will likely be on 1 Gb network. I 
should note I'm not sure exactly how much provenance was being written. 

The nifi-0.7.0  instance was a fresh install with no major configuration 
changes.  I was using the generate flowfiles processor to generate 100MB 
flowfiles and writing as fast as possible with a PutFile processor.

The SSD posted the following on the AS-SSD benchmark (completely unoptimized):
1.8GB/sec for sequential write;
2.3GB/sec for 4K random write (64 treads);
114MB/sec for 4K random write (1 thread)

On the PCie bus, you should easily surpass 800 MB/sec, especially if your 
flowfiles are large and you have few provenance events.  The theoretical 
bandwidth is 985 MB/sec/lane, up to 16 lanes; I was running x4.  The NVMe 
standard should also help with smaller flowfiles.  

Hope this helps,
Lee



> On Oct 17, 2016, at 12:06 PM, Ali Nazemian  wrote:
> 
> Dear Nifi users/developers,
> Hi,
> 
> I was wondering how can I calculate the theoretical throughput of a Nifi 
> server? let's suppose we can eliminate different bottlenecks such as the file 
> flow rep and provenance repo bottleneck by using a very high-end SSD.  
> Moreover, assume that a very high-end network infrastructure is available. In 
> this case, is it possible to reach 800MB throughput per second per each 
> server? Suppose each server comes with 24 disk slots. 16 disk slots are used 
> for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the content 
> repo. Let's say each content repo can achieve 100 MB throughput. May I say 
> the total throughput per each server can be 8x100=800MBps?  Is it possible to 
> reach this amount of throughput practically?
> Thank you very much.
> 
> Best regards,
> Ali


Re: Calculating the theoretical throughput of a Nifi server

2016-10-17 Thread Ali Nazemian
Dear Joe,
Thank you very much.

Best regards


On Mon, Oct 17, 2016 at 10:08 PM, Joe Witt  wrote:

> Ali
>
> I suspect bottlenecks in the software itself and the flow design will
> become a factor before you 800 MB/s. You'd likely hit CPU efficiency
> issues before this caused by the flow processors themselves and due to
> garbage collection.  Probably the most important factor though will be
> the transaction rate and whether the flow is configured to tradeoff
> some latency for higher throughput.  So many variables at play but
> under idealized conditions and a system like you describe it is
> theoretically feasible to hit that value.
>
> Practically speaking I think you'd be looking at a couple hundred MB/s
> per server like this on real flows on a sustained basis.
>
> Thanks
> Joe
>
> On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian 
> wrote:
> > Dear Nifi users/developers,
> > Hi,
> >
> > I was wondering how can I calculate the theoretical throughput of a Nifi
> > server? let's suppose we can eliminate different bottlenecks such as the
> > file flow rep and provenance repo bottleneck by using a very high-end
> SSD.
> > Moreover, assume that a very high-end network infrastructure is
> available.
> > In this case, is it possible to reach 800MB throughput per second per
> each
> > server? Suppose each server comes with 24 disk slots. 16 disk slots are
> used
> > for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the
> > content repo. Let's say each content repo can achieve 100 MB throughput.
> May
> > I say the total throughput per each server can be 8x100=800MBps?  Is it
> > possible to reach this amount of throughput practically?
> > Thank you very much.
> >
> > Best regards,
> > Ali
>



-- 
A.Nazemian


Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

2016-10-17 Thread Mark Payne
Prabhu,

Certainly, the performance that you are seeing, taking 4-5 hours to move 3M 
rows into SQLServer is far from
ideal, but the good news is that it is also far from typical. You should be 
able to see far better results.

To help us understand what is limiting the performance, and to make sure that 
we understand what you are seeing, 
I have a series of questions that would help us to understand what is going on.

The first processor that you see that exhibits poor performance is ExtractText, 
correct?
Can you share the configuration that you have for that processor?

How big is your Java heap? This is configured in conf/bootstrap.conf; by 
default it is configured as:
java.arg.2=-Xms512m
java.arg.3=-Xmx512m

Do you have backpressure configured on the connection between ExtractText and 
ReplaceText?

Also, when you say that you specify concurrent tasks, what are you configuring 
the concurrent tasks
to be? Have you changed the maximum number of concurrent tasks available to 
your dataflow? By default, NiFi will
use only 10 threads max. How many CPU's are available on this machine?

And finally, are these the only processors in your flow, or do you have other 
dataflows going on in the
same instance as NiFi?

Thanks
-Mark


> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran  wrote:
> 
> Hi All,
> 
> I have tried to perform the below operation.
> 
> dat file(input)-->JSON-->SQL-->SQLServer
> 
> 
> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->ConvertJsonToSQL-->PutSQL.
> 
> My Input File(.dat)-->3,00,000 rows.
> 
> Objective: Move the data from '.dat' file into SQLServer.
> 
> I can able to Store the data in SQL Server by using combination of above 
> processors.But it takes almost 4-5 hrs to move complete data into SQLServer.
> 
> Combination of SplitText's perform data read quickly.But Extract Text takes 
> long time to pass given data matches with user defined expression.If input 
> comes 107 MB but it send outputs in KB size only even ReplaceText processor 
> also processing data in KB Size only.
> 
> In accordance with above slow processing leads the more time taken for data 
> into SQLsever. 
> 
> 
> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow 
> file in Kilobytes only.
> 
> If i have specify concurrent tasks for those 
> ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and disk 
> usage.
> 
> It just 30 MB data ,But processors takes 6 hrs for data movement into 
> SQLServer.
> 
> Faced Problem is..,
> 
>Almost 6 hrs taken for move the 3lakhs data into SQL Server.
>ExtractText,ReplaceText take long time for processing data(it send 
> output flowfile kb size only).
> Can anyone help me to solve below requirement?
> 
> Need to reduce the number of time taken by the processors for move the lakhs 
> of data into SQL Server.
> 
> 
> 
> If anything i'm done wrong,please help me to done it right.
> 
> 
> 
> 
> 
> 
> 



Re: Calculating the theoretical throughput of a Nifi server

2016-10-17 Thread Joe Witt
Ali

I suspect bottlenecks in the software itself and the flow design will
become a factor before you 800 MB/s. You'd likely hit CPU efficiency
issues before this caused by the flow processors themselves and due to
garbage collection.  Probably the most important factor though will be
the transaction rate and whether the flow is configured to tradeoff
some latency for higher throughput.  So many variables at play but
under idealized conditions and a system like you describe it is
theoretically feasible to hit that value.

Practically speaking I think you'd be looking at a couple hundred MB/s
per server like this on real flows on a sustained basis.

Thanks
Joe

On Sun, Oct 16, 2016 at 11:06 PM, Ali Nazemian  wrote:
> Dear Nifi users/developers,
> Hi,
>
> I was wondering how can I calculate the theoretical throughput of a Nifi
> server? let's suppose we can eliminate different bottlenecks such as the
> file flow rep and provenance repo bottleneck by using a very high-end SSD.
> Moreover, assume that a very high-end network infrastructure is available.
> In this case, is it possible to reach 800MB throughput per second per each
> server? Suppose each server comes with 24 disk slots. 16 disk slots are used
> for creating 8 x RAID1(SAS 10k) mount points and are dedicated to the
> content repo. Let's say each content repo can achieve 100 MB throughput. May
> I say the total throughput per each server can be 8x100=800MBps?  Is it
> possible to reach this amount of throughput practically?
> Thank you very much.
>
> Best regards,
> Ali


How to increase the processing speed of the ExtractText and ReplaceText Processor?

2016-10-17 Thread prabhu Mahendran
Hi All,

I have tried to perform the below operation.

dat file(input)-->JSON-->SQL-->SQLServer


GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->ConvertJsonToSQL-->PutSQL.

My Input File(.dat)-->3,00,000 rows.

*Objective:* Move the data from '.dat' file into SQLServer.

I can able to Store the data in SQL Server by using combination of above
processors.But it takes almost 4-5 hrs to move complete data into SQLServer.

Combination of SplitText's perform data read quickly.But Extract Text takes
long time to pass given data matches with user defined expression.If input
comes 107 MB but it send outputs in KB size only even ReplaceText processor
also processing data in KB Size only.

In accordance with above slow processing leads the more time taken for data
into SQLsever.


Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow
file in Kilobytes only.

If i have specify concurrent tasks for those
ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and
disk usage.

It just 30 MB data ,But processors takes 6 hrs for data movement into
SQLServer.

Faced Problem is..,


   1.Almost 6 hrs taken for move the 3lakhs data into SQL Server.
   2.ExtractText,ReplaceText take long time for processing data(it
   send output flowfile kb size only).

Can anyone help me to solve below *requirement*?

Need to reduce the number of time taken by the processors for move the
lakhs of data into SQL Server.



If anything i'm done wrong,please help me to done it right.