Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

Andy LoPresto Mon, 24 Oct 2016 11:45:46 -0700

Excellent information here. Thanks Lee and Peter.

Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69


> On Oct 24, 2016, at 6:57 AM, Peter Wicks (pwicks) <pwi...@micron.com> wrote:
> 
> Prabhu,
> 
> Lee mentioned making sure you have good indexes, but I would caution you on 
> this point.  If you have a unique constraint then SQL Server will build an 
> index on this automatically, but I would suggest dropping all other indexes 
> that aren’t related to data integrity. Each time SQL Server updates a column 
> that is indexed it’s going to be updating that index also.  This will add a 
> lot of overhead.
> 
> You might be thinking that you need these indexes though for user queries. To 
> work around this I often see the use of a staging table. This table has no 
> indexes beyond the absolute minimum to ensure data integrity, and sometimes 
> even these are removed and data integrity/duplicate removal is handled 
> through the use of SQL or a Stored Procedure.  A periodic job will move all 
> data from this staging table into a final table.  If you execute the copy and 
> a truncate in a single transaction it allows you to do this safely:
> 
> INSERT INTO “Some_Final_Table” SELECT * FROM 
> “Staging_Table_With_Exact_Same_schema”;TRUNCATE TABLE 
> “Staging_Table_With_Exact_Same_schema”;
> 
> If you do it this way you can keep the indexes you need for user access while 
> still allowing maximum data throughput to SQL Server.
> 
> I’ve seen a lot of comments online about batch sizing around 500 being 
> optimal, but of course this will vary on the system configuration; both your 
> NiFi server and the SQL Server.
> 
> I have had issues getting good performance out of PutSQL even with the above, 
> I don’t think this is the fault of the processor, but more due to the volume 
> of data and JDBC batch row processing not really being designed for this kind 
> of volume. In my case I was trying to push about 10M rows over a longer time 
> period, but was still running into trouble. After working on the issue for a 
> while I found that a database specific loader was needed. I am loading to 
> Teradata, so I wrote up a Teradata FastLoad processor.  In your case the MS 
> SQL Server JDBC Driver includes a `SQLServerBulkCopy` loader, 
> https://msdn.microsoft.com/en-us/library/mt221490%28v=sql.110%29.aspx 
> <https://msdn.microsoft.com/en-us/library/mt221490%28v=sql.110%29.aspx>.  
> Unfortunately, this would require writing code either through a scripted 
> processor, or as a whole new processor.
> 
> Since writing a custom processor may be more than you want to jump into right 
> now you should probably take a look at `bcp`.  I didn’t catch if you were on 
> Windows or a Unix platform, but if you are on Windows I’d check out the 
> command line based Bulk Copy Program for MS 
> SQL:https://msdn.microsoft.com/en-us/library/ms162802.aspx 
> <https://msdn.microsoft.com/en-us/library/ms162802.aspx>.  Using this would 
> allow you to prepare your data into an intermediary format, like CSV first, 
> then send it all at once through `bcp`.
> 
> 
> Thanks,
>   Peter Wicks
> 
> From: Lee Laim [mailto:lee.l...@gmail.com]
> Sent: Monday, October 24, 2016 7:17 AM
> To: users@nifi.apache.org
> Subject: Re: How to increase the processing speed of the ExtractText and 
> ReplaceText Processor?
> 
> Hello Prabhu,
> 
> 50 minutes is a good start! Now we have to determine where the next 
> bottleneck is -check to see where the flow files are queueing.  You can also 
> check the "average task duration" statistic for each processor.  I suspect 
> the bottleneck is at  PutSQL and will carry this assumption forward.
> 
> There are several knobs you can adjust at the assumed PutSQL bottleneck:
> 1.  Increase the run duration and keep the PutSQL processor running for 2 
> secondsbefore releasing the thread.
> 2. Set Fragmented Transactions to false.  This removes constraints that take 
> time to check.
> 3. Consider changing batch size, systematically and observe throughput 
> changes. I'd move up in increments of 100.
> 4*. Increase the number of concurrent tasks for the bottleneck processor to 3 
> or higher.  Increase, systematically to observe if you get more flow files 
> through.   You can increase the max timer driven threads of the NiFi instance 
> in the NiFi Flow Settings (top right of the canvas).  you can set the max to 
> 25, but you are truly limited by hardware here. Consider a more powerful 
> system to manage this flow, especially with the time constraint you need. It 
> is often easier to throw more hardware at the problem than to debug.
> Other areas:
> 5. On the output of the last SplitText processor,  Invoke back pressure 
> object threshold = 1.  This will slow (temporarily stop) the first split text 
> processor and reduce the number of overall flow files to manage.  It also 
> reduces the NiFi processor demand for the cpu threads.
> 6. Increase nifi.queue.swap.threshold in nifi.properties-  reduce disk access.
> 7. Check connection/load on the SQL server.
> 
> To address your queries,I used the same expression you provided: 
> (.+)[|](.+)[|](.+)[|](.+)
> You can use an ExtractStreamCommand processor to 'extract text', but in this 
> case, with small flow files, it won't offer much gain.
> 
> *With an i5 processor, you have 4 cpu threads to process flow files, manage 
> NiFi, read/write to disk, and handle all other non-NiFi processes.  Moving to 
> an i7 or Xeon, hyper threading will provide NiFi more resources to really get 
> work done.  While clustering is great for increasing throughput, I wouldn't 
> recommend clustering on a set of smaller i5 systems, as there is added 
> communication overhead that will need to be handled by the cpu.  It can be 
> done, but there are easier ways to increase throughput at this point in 
> development.
> 
> Hope this helps.  Also, if there is anything I stated that is contrary to 
> what others have observed, please chime in.
> 
> Thanks,
> Lee
> 
> 
> On Thu, Oct 20, 2016 at 6:02 PM, prabhu Mahendran <prabhuu161...@gmail.com 
> <mailto:prabhuu161...@gmail.com>> wrote:
> Lee,
> 
> I have tried your suggested flow which can able to insert the data into sql 
> server in 50 minutes And it also take long time.
> 
> ==>your Query:You might be processing the entire dat file (instead of a 
> single row) for each record.
> 
>   How can i process entire dat file into SQL Server?
> 
> ==>Query:Without any new optimizations you'll need ~25 threads and sufficient 
> memory to feed the threads.
> 
>   My processors runs in 10 threads only by setting concurrent threads,How to 
> increase it to be 25 threads.
> 
> If you try quick test then please share "what is regex which you have used?"
> 
> Is there any other processor having functionality like extract text?
> 
> Thanks
> 
> 
> 
> On Wed, Oct 19, 2016 at 11:29 PM, Lee Laim <lee.l...@gmail.com 
> <mailto:lee.l...@gmail.com>> wrote:
> Prabu,
> 
> In order to move 3M rows in 10 minutes, you'll need to process 5000 
> rows/second.
> During your 4 hour run, you were processing ~200 rows/second.
> 
> Without any new optimizations you'll need ~25 threads and sufficient memory 
> to feed the threads.  I agree with Mark and you should be able to get far 
> more than 200 rows/second.
> 
> I ran a quick test using your ExtractText regex on similar data I was able to 
> process over 100,000 rows/minute through the extract text processor.  The 
> input data was a single row of 4 fields delimited by the "|" symbol.
> 
> You might be processing the entire dat file (instead of a single row) for 
> each record.
> Can you check the flow file attributes and content going into ExtractText?
> 
> 
> Here is the flow with some notes:
> 
> 1.GetFile (a 30 MB .dat file consisting of 3M rows; each row is about 10 
> bytes)
> 
> 2 SplitText -> SplitText  (to break the 3M rows down to manageable chunks of 
> 10,000 lines per flow file, then split again to 1 line per flow file)
> 
> 3. ExtractText to extract the 4 fields
> 
> 4. ReplaceText to generate json (You can alternatively use AttributesToJson 
> here)
> 
> 5. ConvertJSONtoSQL
> 
> 6. PutSQL - (This should be true bottleneck; Index the DB well and use many 
> threads)
> 
> If my assumptions are incorrect, please let me know.
> 
> Thanks,
> Lee
> 
> On Thu, Oct 20, 2016 at 1:43 AM, Kevin Verhoeven <kevin.verhoe...@ds-iq.com 
> <mailto:kevin.verhoe...@ds-iq.com>> wrote:
> I’m not clear on how much data you are processing, does the data(.dat) file 
> have 3,00,000 rows?
> 
> Kevin
> 
> From: prabhu Mahendran [mailto:prabhuu161...@gmail.com 
> <mailto:prabhuu161...@gmail.com>]
> Sent: Wednesday, October 19, 2016 2:05 AM
> To: users@nifi.apache.org <mailto:users@nifi.apache.org>
> Subject: Re: How to increase the processing speed of the ExtractText and 
> ReplaceText Processor?
> 
> Mark,
> 
> Thanks for the response.
> 
> My Sample input data(.dat) like below..,
> 
> 1|2|3|4
> 6|7|8|9
> 11|12|13|14
> 
> In Extract Text,i have add input row only with addition of default properties 
> like below screenshot.
> 
> <image001.png>
> In Replace text ,
> 
> just replace value like 
> {"data1":"${inputrow.1}","data2":"${inputrow.2}","data3":"${inputrow.3}","data4":"${inputrow.4}"}
> <image002.png>
> 
> 
> Here there is no bulletins indicates back pressure on processors.
> 
> Can i know prerequisites needed for move the 3,00,000 data into sql server in 
> duration 10-20 minutes?
> What are the number of CPU' s needed?
> How much heap size and perm gen size we need to set for move that data into 
> sql server?
> 
> Thanks
> 
> 
> On Tue, Oct 18, 2016 at 7:05 PM, Mark Payne <marka...@hotmail.com 
> <mailto:marka...@hotmail.com>> wrote:
> Prabhu,
> 
> Thanks for the details. All of this seems fairly normal. Given that you have 
> only a single core,
> I don't think multiple concurrent tasks will help you. Can you share your 
> configuration for ExtractText
> and ReplaceText? Depending on the regex'es being used, they can be extremely 
> expensive to evaluate.
> The regex that you mentioned in the other email - "(.+)[|](.+)[|](.+)[|](.+)" 
> is in fact extremely expensive.
> Any time that you have ".*" or ".+" in your regex, it is going to be 
> extremely expensive, especially with
> longer FlowFile content.
> 
> Also, do you see any bulletins indicating that the provenance repository is 
> applying backpressure? Given
> that you are splitting your FlowFiles into individual lines, the provenance 
> repository may be under a lot
> of pressure.
> 
> Another thing to check, is how much garbage collection is occurring. This can 
> certainly destroy your performance
> quickly. You can get this information by going to the "Summary Table" in the 
> top-right of the UI and then clicking the
> "System Diagnostics" link in the bottom-right corner of that Summary Table.
> 
> Thanks
> -Mark
> 
> 
> On Oct 18, 2016, at 1:31 AM, prabhu Mahendran <prabhuu161...@gmail.com 
> <mailto:prabhuu161...@gmail.com>> wrote:
> 
> Mark,
> 
> Thanks for your response.
> 
> Please find the response for your questions.
> 
> ==>The first processor that you see that exhibits poor performance is 
> ExtractText, correct?
>                              Yes,Extract Text exhibits poor performance.
> 
> ==>How big is your Java heap?
>                             I have set 1 GB for java heap.
> 
> ==>Do you have back pressure configured on the connection between ExtractText 
> and ReplaceText?
>                            There is no back pressure between extract and 
> replace text.
> 
> ==>when you say that you specify concurrent tasks, what are you configuring 
> the concurrent tasks
> to be?
>                           I have specify concurrent tasks to be 2 for the 
> extract text processor due to slower processing rate.Which                    
>        is specified in Concurrent Task Text box.
> 
> ==>Have you changed the maximum number of concurrent tasks available to your 
> dataflow?
>                          No i haven't changed.
> 
> ==>How many CPU's are available on this machine?
>                         Only single cpu are available in this machine with 
> core i5 processor CPU @2.20Ghz.
> 
> ==> Are these the only processors in your flow, or do you have other 
> dataflows going on in the
> same instance as NiFi?
>                        Yes this is the only processor in work flow which is 
> running and no other instances are running.
> 
> Thanks
> 
> On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne <marka...@hotmail.com 
> <mailto:marka...@hotmail.com>> wrote:
> Prabhu,
> 
> Certainly, the performance that you are seeing, taking 4-5 hours to move 3M 
> rows into SQLServer is far from
> ideal, but the good news is that it is also far from typical. You should be 
> able to see far better results.
> 
> To help us understand what is limiting the performance, and to make sure that 
> we understand what you are seeing,
> I have a series of questions that would help us to understand what is going 
> on.
> 
> The first processor that you see that exhibits poor performance is 
> ExtractText, correct?
> Can you share the configuration that you have for that processor?
> 
> How big is your Java heap? This is configured in conf/bootstrap.conf; by 
> default it is configured as:
> java.arg.2=-Xms512m
> java.arg.3=-Xmx512m
> 
> Do you have backpressure configured on the connection between ExtractText and 
> ReplaceText?
> 
> Also, when you say that you specify concurrent tasks, what are you 
> configuring the concurrent tasks
> to be? Have you changed the maximum number of concurrent tasks available to 
> your dataflow? By default, NiFi will
> use only 10 threads max. How many CPU's are available on this machine?
> 
> And finally, are these the only processors in your flow, or do you have other 
> dataflows going on in the
> same instance as NiFi?
> 
> Thanks
> -Mark
> 
> 
> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran <prabhuu161...@gmail.com 
> <mailto:prabhuu161...@gmail.com>> wrote:
> 
> Hi All,
> 
> I have tried to perform the below operation.
> 
> dat file(input)-->JSON-->SQL-->SQLServer
> 
> 
> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->ConvertJsonToSQL-->PutSQL.
> 
> My Input File(.dat)-->3,00,000 rows.
> 
> Objective: Move the data from '.dat' file into SQLServer.
> 
> I can able to Store the data in SQL Server by using combination of above 
> processors.But it takes almost 4-5 hrs to move complete data into SQLServer.
> 
> Combination of SplitText's perform data read quickly.But Extract Text takes 
> long time to pass given data matches with user defined expression.If input 
> comes 107 MB but it send outputs in KB size only even ReplaceText processor 
> also processing data in KB Size only.
> 
> In accordance with above slow processing leads the more time taken for data 
> into SQLsever.
> 
> 
> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow 
> file in Kilobytes only.
> 
> If i have specify concurrent tasks for those 
> ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and disk 
> usage.
> 
> It just 30 MB data ,But processors takes 6 hrs for data movement into 
> SQLServer.
> 
> Faced Problem is..,
> 
>        Almost 6 hrs taken for move the 3lakhs data into SQL Server.
>        ExtractText,ReplaceText take long time for processing data(it send 
> output flowfile kb size only).
> Can anyone help me to solve below requirement?
> 
> Need to reduce the number of time taken by the processors for move the lakhs 
> of data into SQL Server.
> 
> 
> 
> If anything i'm done wrong,please help me to done it right.
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

Reply via email to