Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

prabhu Mahendran Thu, 20 Oct 2016 02:08:00 -0700

Lee,

I have tried your suggested flow which can able to insert the data into sql
server in 50 minutes And it also take long time.


*==>*your Query:
*You might be processing the entire dat file (instead of a single row) for
each record.*
  How can i process entire dat file into SQL Server?


*==>Query:Without any new optimizations you'll need ~25 threads and
sufficient memory to feed the threads.*
  My processors runs in 10 threads only by setting concurrent threads,How
to increase it to be 25 threads.

If you try quick test then please share "what is regex which you have used?"

Is there any other processor having functionality like extract text?

Thanks



On Wed, Oct 19, 2016 at 11:29 PM, Lee Laim <lee.l...@gmail.com> wrote:

> Prabu,
>
> In order to move 3M rows in 10 minutes, you'll need to process 5000
> rows/second.
> During your 4 hour run, you were processing ~200 rows/second.
>
> Without any new optimizations you'll need ~25 threads and sufficient
> memory to feed the threads.  I agree with Mark and you should be able to
> get far more than 200 rows/second.
>
> I ran a quick test using your ExtractText regex on similar data I was able
> to process over 100,000 rows/minute through the extract text processor.
> The input data was a single row of 4 fields delimited by the "|" symbol.
>
> *You might be processing the entire dat file (instead of a single row) for
> each record.*
> *Can you check the flow file attributes and content going into
> ExtractText?  *
>
>
> Here is the flow with some notes:
>
> 1.GetFile (a 30 MB .dat file consisting of 3M rows; each row is about 10
> bytes)
>
> 2 SplitText -> SplitText  (to break the 3M rows down to manageable chunks
> of 10,000 lines per flow file, then split again to 1 line per flow file)
>
> 3. ExtractText to extract the 4 fields
>
> 4. ReplaceText to generate json (You can alternatively use
> AttributesToJson here)
>
> 5. ConvertJSONtoSQL
>
> 6. PutSQL - (This should be true bottleneck; Index the DB well and use
> many threads)
>
>
> If my assumptions are incorrect, please let me know.
>
> Thanks,
> Lee
>
> On Thu, Oct 20, 2016 at 1:43 AM, Kevin Verhoeven <
> kevin.verhoe...@ds-iq.com> wrote:
>
>> I’m not clear on how much data you are processing, does the data(.dat)
>> file have 3,00,000 rows?
>>
>>
>>
>> Kevin
>>
>>
>>
>> *From:* prabhu Mahendran [mailto:prabhuu161...@gmail.com]
>> *Sent:* Wednesday, October 19, 2016 2:05 AM
>> *To:* users@nifi.apache.org
>> *Subject:* Re: How to increase the processing speed of the ExtractText
>> and ReplaceText Processor?
>>
>>
>>
>> Mark,
>>
>> Thanks for the response.
>>
>> My Sample input data(.dat) like below..,
>>
>> 1|2|3|4
>> 6|7|8|9
>> 11|12|13|14
>>
>> In Extract Text,i have add input row only with addition of default
>> properties like below screenshot.
>>
>> [image: Inline image 1]
>> In Replace text ,
>>
>> just replace value like {"data1":"${inputrow.1}","data
>> 2":"${inputrow.2}","data3":"${inputrow.3}","data4":"${inputrow.4}"}
>> [image: Inline image 2]
>>
>>
>> Here there is no bulletins indicates back pressure on processors.
>>
>> Can i know prerequisites needed for move the 3,00,000 data into sql
>> server in duration 10-20 minutes?
>> What are the number of CPU' s needed?
>> How much heap size and perm gen size we need to set for move that data
>> into sql server?
>>
>> Thanks
>>
>>
>>
>> On Tue, Oct 18, 2016 at 7:05 PM, Mark Payne <marka...@hotmail.com> wrote:
>>
>> Prabhu,
>>
>>
>>
>> Thanks for the details. All of this seems fairly normal. Given that you
>> have only a single core,
>>
>> I don't think multiple concurrent tasks will help you. Can you share your
>> configuration for ExtractText
>>
>> and ReplaceText? Depending on the regex'es being used, they can be
>> extremely expensive to evaluate.
>>
>> The regex that you mentioned in the other email -
>> "(.+)[|](.+)[|](.+)[|](.+)" is in fact extremely expensive.
>>
>> Any time that you have ".*" or ".+" in your regex, it is going to be
>> extremely expensive, especially with
>>
>> longer FlowFile content.
>>
>>
>>
>> Also, do you see any bulletins indicating that the provenance repository
>> is applying backpressure? Given
>>
>> that you are splitting your FlowFiles into individual lines, the
>> provenance repository may be under a lot
>>
>> of pressure.
>>
>>
>>
>> Another thing to check, is how much garbage collection is occurring. This
>> can certainly destroy your performance
>>
>> quickly. You can get this information by going to the "Summary Table" in
>> the top-right of the UI and then clicking the
>>
>> "System Diagnostics" link in the bottom-right corner of that Summary
>> Table.
>>
>>
>>
>> Thanks
>>
>> -Mark
>>
>>
>>
>>
>>
>> On Oct 18, 2016, at 1:31 AM, prabhu Mahendran <prabhuu161...@gmail.com>
>> wrote:
>>
>>
>>
>> Mark,
>>
>> Thanks for your response.
>>
>> Please find the response for your questions.
>>
>> ==>The first processor that you see that exhibits poor performance is
>> ExtractText, correct?
>>                              Yes,Extract Text exhibits poor performance.
>>
>> ==>How big is your Java heap?
>>                             I have set 1 GB for java heap.
>>
>> ==>Do you have back pressure configured on the connection between
>> ExtractText and ReplaceText?
>>                            There is no back pressure between extract and
>> replace text.
>>
>> ==>when you say that you specify concurrent tasks, what are you
>> configuring the concurrent tasks
>>
>> to be?
>>                           I have specify concurrent tasks to be 2 for the
>> extract text processor due to slower processing rate.Which
>>           is specified in Concurrent Task Text box.
>>
>> ==>Have you changed the maximum number of concurrent tasks available to
>> your dataflow?
>>                          No i haven't changed.
>>
>> ==>How many CPU's are available on this machine?
>>                         Only single cpu are available in this machine
>> with core i5 processor CPU @2.20Ghz.
>>
>> ==> Are these the only processors in your flow, or do you have other
>> dataflows going on in the
>>
>> same instance as NiFi?
>>                        Yes this is the only processor in work flow which
>> is running and no other instances are running.
>>
>> Thanks
>>
>>
>>
>> On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne <marka...@hotmail.com> wrote:
>>
>> Prabhu,
>>
>>
>>
>> Certainly, the performance that you are seeing, taking 4-5 hours to move
>> 3M rows into SQLServer is far from
>>
>> ideal, but the good news is that it is also far from typical. You should
>> be able to see far better results.
>>
>>
>>
>> To help us understand what is limiting the performance, and to make sure
>> that we understand what you are seeing,
>>
>> I have a series of questions that would help us to understand what is
>> going on.
>>
>>
>>
>> The first processor that you see that exhibits poor performance is
>> ExtractText, correct?
>>
>> Can you share the configuration that you have for that processor?
>>
>>
>>
>> How big is your Java heap? This is configured in conf/bootstrap.conf; by
>> default it is configured as:
>>
>> java.arg.2=-Xms512m
>>
>> java.arg.3=-Xmx512m
>>
>>
>>
>> Do you have backpressure configured on the connection between ExtractText
>> and ReplaceText?
>>
>>
>>
>> Also, when you say that you specify concurrent tasks, what are you
>> configuring the concurrent tasks
>>
>> to be? Have you changed the maximum number of concurrent tasks available
>> to your dataflow? By default, NiFi will
>>
>> use only 10 threads max. How many CPU's are available on this machine?
>>
>>
>>
>> And finally, are these the only processors in your flow, or do you have
>> other dataflows going on in the
>>
>> same instance as NiFi?
>>
>>
>>
>> Thanks
>>
>> -Mark
>>
>>
>>
>>
>>
>> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran <prabhuu161...@gmail.com>
>> wrote:
>>
>>
>>
>> Hi All,
>>
>> I have tried to perform the below operation.
>>
>> dat file(input)-->JSON-->SQL-->SQLServer
>>
>>
>> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-
>> ->ConvertJsonToSQL-->PutSQL.
>>
>> My Input File(.dat)-->3,00,000 rows.
>>
>> *Objective:* Move the data from '.dat' file into SQLServer.
>>
>> I can able to Store the data in SQL Server by using combination of above
>> processors.But it takes almost 4-5 hrs to move complete data into SQLServer.
>>
>> Combination of SplitText's perform data read quickly.But Extract Text
>> takes long time to pass given data matches with user defined expression.If
>> input comes 107 MB but it send outputs in KB size only even ReplaceText
>> processor also processing data in KB Size only.
>>
>> In accordance with above slow processing leads the more time taken for
>> data into SQLsever.
>>
>>
>> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing
>> flow file in Kilobytes only.
>>
>> If i have specify concurrent tasks for those
>> ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and
>> disk usage.
>>
>> It just 30 MB data ,But processors takes 6 hrs for data movement into
>> SQLServer.
>>
>> Faced Problem is..,
>>
>>    1.        Almost 6 hrs taken for move the 3lakhs data into SQL Server.
>>    2.        ExtractText,ReplaceText take long time for processing
>>    data(it send output flowfile kb size only).
>>
>> Can anyone help me to solve below *requirement*?
>>
>> Need to reduce the number of time taken by the processors for move the
>> lakhs of data into SQL Server.
>>
>>
>>
>> If anything i'm done wrong,please help me to done it right.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

Reply via email to