Excellent information here. Thanks Lee and Peter. Andy LoPresto alopre...@apache.org alopresto.apa...@gmail.com PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
> On Oct 24, 2016, at 6:57 AM, Peter Wicks (pwicks) <pwi...@micron.com> wrote: > > Prabhu, > > Lee mentioned making sure you have good indexes, but I would caution you on > this point. If you have a unique constraint then SQL Server will build an > index on this automatically, but I would suggest dropping all other indexes > that aren’t related to data integrity. Each time SQL Server updates a column > that is indexed it’s going to be updating that index also. This will add a > lot of overhead. > > You might be thinking that you need these indexes though for user queries. To > work around this I often see the use of a staging table. This table has no > indexes beyond the absolute minimum to ensure data integrity, and sometimes > even these are removed and data integrity/duplicate removal is handled > through the use of SQL or a Stored Procedure. A periodic job will move all > data from this staging table into a final table. If you execute the copy and > a truncate in a single transaction it allows you to do this safely: > > INSERT INTO “Some_Final_Table” SELECT * FROM > “Staging_Table_With_Exact_Same_schema”;TRUNCATE TABLE > “Staging_Table_With_Exact_Same_schema”; > > If you do it this way you can keep the indexes you need for user access while > still allowing maximum data throughput to SQL Server. > > I’ve seen a lot of comments online about batch sizing around 500 being > optimal, but of course this will vary on the system configuration; both your > NiFi server and the SQL Server. > > I have had issues getting good performance out of PutSQL even with the above, > I don’t think this is the fault of the processor, but more due to the volume > of data and JDBC batch row processing not really being designed for this kind > of volume. In my case I was trying to push about 10M rows over a longer time > period, but was still running into trouble. After working on the issue for a > while I found that a database specific loader was needed. I am loading to > Teradata, so I wrote up a Teradata FastLoad processor. In your case the MS > SQL Server JDBC Driver includes a `SQLServerBulkCopy` loader, > https://msdn.microsoft.com/en-us/library/mt221490%28v=sql.110%29.aspx > <https://msdn.microsoft.com/en-us/library/mt221490%28v=sql.110%29.aspx>. > Unfortunately, this would require writing code either through a scripted > processor, or as a whole new processor. > > Since writing a custom processor may be more than you want to jump into right > now you should probably take a look at `bcp`. I didn’t catch if you were on > Windows or a Unix platform, but if you are on Windows I’d check out the > command line based Bulk Copy Program for MS > SQL:https://msdn.microsoft.com/en-us/library/ms162802.aspx > <https://msdn.microsoft.com/en-us/library/ms162802.aspx>. Using this would > allow you to prepare your data into an intermediary format, like CSV first, > then send it all at once through `bcp`. > > > Thanks, > Peter Wicks > > From: Lee Laim [mailto:lee.l...@gmail.com] > Sent: Monday, October 24, 2016 7:17 AM > To: users@nifi.apache.org > Subject: Re: How to increase the processing speed of the ExtractText and > ReplaceText Processor? > > Hello Prabhu, > > 50 minutes is a good start! Now we have to determine where the next > bottleneck is -check to see where the flow files are queueing. You can also > check the "average task duration" statistic for each processor. I suspect > the bottleneck is at PutSQL and will carry this assumption forward. > > There are several knobs you can adjust at the assumed PutSQL bottleneck: > 1. Increase the run duration and keep the PutSQL processor running for 2 > secondsbefore releasing the thread. > 2. Set Fragmented Transactions to false. This removes constraints that take > time to check. > 3. Consider changing batch size, systematically and observe throughput > changes. I'd move up in increments of 100. > 4*. Increase the number of concurrent tasks for the bottleneck processor to 3 > or higher. Increase, systematically to observe if you get more flow files > through. You can increase the max timer driven threads of the NiFi instance > in the NiFi Flow Settings (top right of the canvas). you can set the max to > 25, but you are truly limited by hardware here. Consider a more powerful > system to manage this flow, especially with the time constraint you need. It > is often easier to throw more hardware at the problem than to debug. > Other areas: > 5. On the output of the last SplitText processor, Invoke back pressure > object threshold = 1. This will slow (temporarily stop) the first split text > processor and reduce the number of overall flow files to manage. It also > reduces the NiFi processor demand for the cpu threads. > 6. Increase nifi.queue.swap.threshold in nifi.properties- reduce disk access. > 7. Check connection/load on the SQL server. > > To address your queries,I used the same expression you provided: > (.+)[|](.+)[|](.+)[|](.+) > You can use an ExtractStreamCommand processor to 'extract text', but in this > case, with small flow files, it won't offer much gain. > > *With an i5 processor, you have 4 cpu threads to process flow files, manage > NiFi, read/write to disk, and handle all other non-NiFi processes. Moving to > an i7 or Xeon, hyper threading will provide NiFi more resources to really get > work done. While clustering is great for increasing throughput, I wouldn't > recommend clustering on a set of smaller i5 systems, as there is added > communication overhead that will need to be handled by the cpu. It can be > done, but there are easier ways to increase throughput at this point in > development. > > Hope this helps. Also, if there is anything I stated that is contrary to > what others have observed, please chime in. > > Thanks, > Lee > > > On Thu, Oct 20, 2016 at 6:02 PM, prabhu Mahendran <prabhuu161...@gmail.com > <mailto:prabhuu161...@gmail.com>> wrote: > Lee, > > I have tried your suggested flow which can able to insert the data into sql > server in 50 minutes And it also take long time. > > ==>your Query:You might be processing the entire dat file (instead of a > single row) for each record. > > How can i process entire dat file into SQL Server? > > ==>Query:Without any new optimizations you'll need ~25 threads and sufficient > memory to feed the threads. > > My processors runs in 10 threads only by setting concurrent threads,How to > increase it to be 25 threads. > > If you try quick test then please share "what is regex which you have used?" > > Is there any other processor having functionality like extract text? > > Thanks > > > > On Wed, Oct 19, 2016 at 11:29 PM, Lee Laim <lee.l...@gmail.com > <mailto:lee.l...@gmail.com>> wrote: > Prabu, > > In order to move 3M rows in 10 minutes, you'll need to process 5000 > rows/second. > During your 4 hour run, you were processing ~200 rows/second. > > Without any new optimizations you'll need ~25 threads and sufficient memory > to feed the threads. I agree with Mark and you should be able to get far > more than 200 rows/second. > > I ran a quick test using your ExtractText regex on similar data I was able to > process over 100,000 rows/minute through the extract text processor. The > input data was a single row of 4 fields delimited by the "|" symbol. > > You might be processing the entire dat file (instead of a single row) for > each record. > Can you check the flow file attributes and content going into ExtractText? > > > Here is the flow with some notes: > > 1.GetFile (a 30 MB .dat file consisting of 3M rows; each row is about 10 > bytes) > > 2 SplitText -> SplitText (to break the 3M rows down to manageable chunks of > 10,000 lines per flow file, then split again to 1 line per flow file) > > 3. ExtractText to extract the 4 fields > > 4. ReplaceText to generate json (You can alternatively use AttributesToJson > here) > > 5. ConvertJSONtoSQL > > 6. PutSQL - (This should be true bottleneck; Index the DB well and use many > threads) > > If my assumptions are incorrect, please let me know. > > Thanks, > Lee > > On Thu, Oct 20, 2016 at 1:43 AM, Kevin Verhoeven <kevin.verhoe...@ds-iq.com > <mailto:kevin.verhoe...@ds-iq.com>> wrote: > I’m not clear on how much data you are processing, does the data(.dat) file > have 3,00,000 rows? > > Kevin > > From: prabhu Mahendran [mailto:prabhuu161...@gmail.com > <mailto:prabhuu161...@gmail.com>] > Sent: Wednesday, October 19, 2016 2:05 AM > To: users@nifi.apache.org <mailto:users@nifi.apache.org> > Subject: Re: How to increase the processing speed of the ExtractText and > ReplaceText Processor? > > Mark, > > Thanks for the response. > > My Sample input data(.dat) like below.., > > 1|2|3|4 > 6|7|8|9 > 11|12|13|14 > > In Extract Text,i have add input row only with addition of default properties > like below screenshot. > > <image001.png> > In Replace text , > > just replace value like > {"data1":"${inputrow.1}","data2":"${inputrow.2}","data3":"${inputrow.3}","data4":"${inputrow.4}"} > <image002.png> > > > Here there is no bulletins indicates back pressure on processors. > > Can i know prerequisites needed for move the 3,00,000 data into sql server in > duration 10-20 minutes? > What are the number of CPU' s needed? > How much heap size and perm gen size we need to set for move that data into > sql server? > > Thanks > > > On Tue, Oct 18, 2016 at 7:05 PM, Mark Payne <marka...@hotmail.com > <mailto:marka...@hotmail.com>> wrote: > Prabhu, > > Thanks for the details. All of this seems fairly normal. Given that you have > only a single core, > I don't think multiple concurrent tasks will help you. Can you share your > configuration for ExtractText > and ReplaceText? Depending on the regex'es being used, they can be extremely > expensive to evaluate. > The regex that you mentioned in the other email - "(.+)[|](.+)[|](.+)[|](.+)" > is in fact extremely expensive. > Any time that you have ".*" or ".+" in your regex, it is going to be > extremely expensive, especially with > longer FlowFile content. > > Also, do you see any bulletins indicating that the provenance repository is > applying backpressure? Given > that you are splitting your FlowFiles into individual lines, the provenance > repository may be under a lot > of pressure. > > Another thing to check, is how much garbage collection is occurring. This can > certainly destroy your performance > quickly. You can get this information by going to the "Summary Table" in the > top-right of the UI and then clicking the > "System Diagnostics" link in the bottom-right corner of that Summary Table. > > Thanks > -Mark > > > On Oct 18, 2016, at 1:31 AM, prabhu Mahendran <prabhuu161...@gmail.com > <mailto:prabhuu161...@gmail.com>> wrote: > > Mark, > > Thanks for your response. > > Please find the response for your questions. > > ==>The first processor that you see that exhibits poor performance is > ExtractText, correct? > Yes,Extract Text exhibits poor performance. > > ==>How big is your Java heap? > I have set 1 GB for java heap. > > ==>Do you have back pressure configured on the connection between ExtractText > and ReplaceText? > There is no back pressure between extract and > replace text. > > ==>when you say that you specify concurrent tasks, what are you configuring > the concurrent tasks > to be? > I have specify concurrent tasks to be 2 for the > extract text processor due to slower processing rate.Which > is specified in Concurrent Task Text box. > > ==>Have you changed the maximum number of concurrent tasks available to your > dataflow? > No i haven't changed. > > ==>How many CPU's are available on this machine? > Only single cpu are available in this machine with > core i5 processor CPU @2.20Ghz. > > ==> Are these the only processors in your flow, or do you have other > dataflows going on in the > same instance as NiFi? > Yes this is the only processor in work flow which is > running and no other instances are running. > > Thanks > > On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne <marka...@hotmail.com > <mailto:marka...@hotmail.com>> wrote: > Prabhu, > > Certainly, the performance that you are seeing, taking 4-5 hours to move 3M > rows into SQLServer is far from > ideal, but the good news is that it is also far from typical. You should be > able to see far better results. > > To help us understand what is limiting the performance, and to make sure that > we understand what you are seeing, > I have a series of questions that would help us to understand what is going > on. > > The first processor that you see that exhibits poor performance is > ExtractText, correct? > Can you share the configuration that you have for that processor? > > How big is your Java heap? This is configured in conf/bootstrap.conf; by > default it is configured as: > java.arg.2=-Xms512m > java.arg.3=-Xmx512m > > Do you have backpressure configured on the connection between ExtractText and > ReplaceText? > > Also, when you say that you specify concurrent tasks, what are you > configuring the concurrent tasks > to be? Have you changed the maximum number of concurrent tasks available to > your dataflow? By default, NiFi will > use only 10 threads max. How many CPU's are available on this machine? > > And finally, are these the only processors in your flow, or do you have other > dataflows going on in the > same instance as NiFi? > > Thanks > -Mark > > > On Oct 17, 2016, at 3:35 AM, prabhu Mahendran <prabhuu161...@gmail.com > <mailto:prabhuu161...@gmail.com>> wrote: > > Hi All, > > I have tried to perform the below operation. > > dat file(input)-->JSON-->SQL-->SQLServer > > > GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->ConvertJsonToSQL-->PutSQL. > > My Input File(.dat)-->3,00,000 rows. > > Objective: Move the data from '.dat' file into SQLServer. > > I can able to Store the data in SQL Server by using combination of above > processors.But it takes almost 4-5 hrs to move complete data into SQLServer. > > Combination of SplitText's perform data read quickly.But Extract Text takes > long time to pass given data matches with user defined expression.If input > comes 107 MB but it send outputs in KB size only even ReplaceText processor > also processing data in KB Size only. > > In accordance with above slow processing leads the more time taken for data > into SQLsever. > > > Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow > file in Kilobytes only. > > If i have specify concurrent tasks for those > ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and disk > usage. > > It just 30 MB data ,But processors takes 6 hrs for data movement into > SQLServer. > > Faced Problem is.., > > Almost 6 hrs taken for move the 3lakhs data into SQL Server. > ExtractText,ReplaceText take long time for processing data(it send > output flowfile kb size only). > Can anyone help me to solve below requirement? > > Need to reduce the number of time taken by the processors for move the lakhs > of data into SQL Server. > > > > If anything i'm done wrong,please help me to done it right. >
signature.asc
Description: Message signed with OpenPGP using GPGMail