RE: How to increase the processing speed of the ExtractText and ReplaceText Processor?

Kevin Verhoeven Wed, 19 Oct 2016 09:44:39 -0700

I’m not clear on how much data you are processing, does the data(.dat) file 
have 3,00,000 rows?

Kevin

From: prabhu Mahendran [mailto:prabhuu161...@gmail.com]
Sent: Wednesday, October 19, 2016 2:05 AM
To: users@nifi.apache.org
Subject: Re: How to increase the processing speed of the ExtractText and 
ReplaceText Processor?

Mark,

Thanks for the response.

My Sample input data(.dat) like below..,

1|2|3|4
6|7|8|9
11|12|13|14

In Extract Text,i have add input row only with addition of default properties 
like below screenshot.

[Inline image 1]
In Replace text ,

just replace value like 
{"data1":"${inputrow.1}","data2":"${inputrow.2}","data3":"${inputrow.3}","data4":"${inputrow.4}"}
[Inline image 2]

Here there is no bulletins indicates back pressure on processors.

Can i know prerequisites needed for move the 3,00,000 data into sql server in 
duration 10-20 minutes?
What are the number of CPU' s needed?
How much heap size and perm gen size we need to set for move that data into sql 
server?

Thanks

On Tue, Oct 18, 2016 at 7:05 PM, Mark Payne 
<marka...@hotmail.com<mailto:marka...@hotmail.com>> wrote:
Prabhu,

Thanks for the details. All of this seems fairly normal. Given that you have 
only a single core,
I don't think multiple concurrent tasks will help you. Can you share your 
configuration for ExtractText
and ReplaceText? Depending on the regex'es being used, they can be extremely 
expensive to evaluate.
The regex that you mentioned in the other email - "(.+)[|](.+)[|](.+)[|](.+)" 
is in fact extremely expensive.
Any time that you have ".*" or ".+" in your regex, it is going to be extremely 
expensive, especially with
longer FlowFile content.

Also, do you see any bulletins indicating that the provenance repository is 
applying backpressure? Given
that you are splitting your FlowFiles into individual lines, the provenance 
repository may be under a lot
of pressure.

Another thing to check, is how much garbage collection is occurring. This can 
certainly destroy your performance
quickly. You can get this information by going to the "Summary Table" in the 
top-right of the UI and then clicking the
"System Diagnostics" link in the bottom-right corner of that Summary Table.

Thanks
-Mark

On Oct 18, 2016, at 1:31 AM, prabhu Mahendran 
<prabhuu161...@gmail.com<mailto:prabhuu161...@gmail.com>> wrote:

Mark,

Thanks for your response.

Please find the response for your questions.

==>The first processor that you see that exhibits poor performance is 
ExtractText, correct?
                             Yes,Extract Text exhibits poor performance.

==>How big is your Java heap?
                            I have set 1 GB for java heap.

==>Do you have back pressure configured on the connection between ExtractText 
and ReplaceText?
                           There is no back pressure between extract and 
replace text.

==>when you say that you specify concurrent tasks, what are you configuring the 
concurrent tasks
to be?
                          I have specify concurrent tasks to be 2 for the 
extract text processor due to slower processing rate.Which                      
     is specified in Concurrent Task Text box.

==>Have you changed the maximum number of concurrent tasks available to your 
dataflow?
                         No i haven't changed.

==>How many CPU's are available on this machine?
                        Only single cpu are available in this machine with core 
i5 processor CPU @2.20Ghz.

==> Are these the only processors in your flow, or do you have other dataflows 
going on in the
same instance as NiFi?
                       Yes this is the only processor in work flow which is 
running and no other instances are running.

Thanks

On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne 
<marka...@hotmail.com<mailto:marka...@hotmail.com>> wrote:
Prabhu,

Certainly, the performance that you are seeing, taking 4-5 hours to move 3M 
rows into SQLServer is far from
ideal, but the good news is that it is also far from typical. You should be 
able to see far better results.

To help us understand what is limiting the performance, and to make sure that 
we understand what you are seeing,
I have a series of questions that would help us to understand what is going on.

The first processor that you see that exhibits poor performance is ExtractText, 
correct?
Can you share the configuration that you have for that processor?

How big is your Java heap? This is configured in conf/bootstrap.conf; by 
default it is configured as:
java.arg.2=-Xms512m
java.arg.3=-Xmx512m

Do you have backpressure configured on the connection between ExtractText and 
ReplaceText?

Also, when you say that you specify concurrent tasks, what are you configuring 
the concurrent tasks
to be? Have you changed the maximum number of concurrent tasks available to 
your dataflow? By default, NiFi will
use only 10 threads max. How many CPU's are available on this machine?

And finally, are these the only processors in your flow, or do you have other 
dataflows going on in the
same instance as NiFi?

Thanks
-Mark

On Oct 17, 2016, at 3:35 AM, prabhu Mahendran 
<prabhuu161...@gmail.com<mailto:prabhuu161...@gmail.com>> wrote:

Hi All,

I have tried to perform the below operation.

dat file(input)-->JSON-->SQL-->SQLServer

GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->ConvertJsonToSQL-->PutSQL.

My Input File(.dat)-->3,00,000 rows.

Objective: Move the data from '.dat' file into SQLServer.

I can able to Store the data in SQL Server by using combination of above 
processors.But it takes almost 4-5 hrs to move complete data into SQLServer.

Combination of SplitText's perform data read quickly.But Extract Text takes 
long time to pass given data matches with user defined expression.If input 
comes 107 MB but it send outputs in KB size only even ReplaceText processor 
also processing data in KB Size only.

In accordance with above slow processing leads the more time taken for data 
into SQLsever.

Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow file 
in Kilobytes only.

If i have specify concurrent tasks for those 
ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and disk 
usage.

It just 30 MB data ,But processors takes 6 hrs for data movement into SQLServer.

Faced Problem is..,

  1.         Almost 6 hrs taken for move the 3lakhs data into SQL Server.
  2.         ExtractText,ReplaceText take long time for processing data(it send 
output flowfile kb size only).
Can anyone help me to solve below requirement?

Need to reduce the number of time taken by the processors for move the lakhs of 
data into SQL Server.

If anything i'm done wrong,please help me to done it right.

RE: How to increase the processing speed of the ExtractText and ReplaceText Processor?

Reply via email to