Hi Mark, Thanks a lot for the insights. I'm using RouteText because I needed the ${line} attribute. I've separated my disks and added the logging you recommended. Final question, and that's I guess a little optimization: Is it better to 1) RouteText with an empty group field, then having a splitline processor OR 2) RouteText with a group field being (.*), and as my lines are unique, they'll come out already splitted
Thanks! Stephane On Thu, Jul 7, 2016 at 1:31 AM Mark Payne <marka...@hotmail.com> wrote: > Stephane, > > So the Processors that you mention there mostly would require that you > split your data up into one-line chunks. > > When you indicate that the expression you would use is > "${filename:contains('new'):and(filename:contains('2016'))}" > that looks like you are routing only on the attributes, not on the content > of the text itself. If this is the case, you should > use RouteOnAttribute, as it will be much more efficient than RouteText. In > general, though, that expression would be > much more efficient than using a regex to match against .*new.*2016.* > > So I would certainly recommend using RouteOnAttribute and using the > Expression Language to route based on attributes. > You can also just add two different properties: > > containsNew = ${filename:contains('new')} > is2016 = ${filename:contains('2016')} > > And then set the routing strategy to Route to 'match' if all match. This > will help make the processor's configuration easier > to understand if you look at it again in the future. > > Ingesting 1000 packets per second should not be a problem at all on a > single node. Some things to consider: > > - Ideally, you would have a separate disk for your content repo, your > flowfile repo, and your prov repo. > > - You may want to change the log level to WARN for processors (by adding > to your conf/logback.xml <logger name="org.apache.nifi.processors" > level="WARN" />) > This may or may not make a difference, depending on how resource > constrained your disks are. > > - Making the change above to use RouteOnAttribute will certainly help > alleviate pressure on both your CPU and your disk. > > - If you don't have enough disks to separate out each of your > repositories, would recommend at least putting prov repo on its own disk. > > - If you do have enough disks, you can strip the content repo and your > prov repo across multiple disks to scale vertically, and you'll > see much better performance this way. > > > Thanks > -Markk > > > On Jul 3, 2016, at 8:27 PM, Stéphane Maarek <stephane.maa...@gmail.com> > wrote: > > Hi Mark, > > 1. I send Flowfile coming through a ListenUDP, with a batch of 100. So > most of the time, the flowfiles are multiple lines long. Yet, after the > route text, I get as many flowfiles as lines, regardless of the grouping > parameter. Is that expected? > > 2. I have opened a JIRA: https://issues.apache.org/jira/browse/NIFI-2169 > > I have few questions: > Regarding the fact that it's better to operate on text that have many > lines, and if I manage to get RouteText to output many lines: > a) Can ExtractText, ReplaceText, PutMongo, ConvertJSONtoSQL, PutSQL > operate on each individual line within a flowfile? (that's basically all > the components in my flow) > b) is satisfies expression: > ${filename:contains('new'):and(filename:contains('2016'))} going to perform > better than RegEx: .*new.*2016.* ? > c) I have a lot of data coming in (1000 udp packets a second), and yes, > the provenance database has been cramming because we have 6 processors > dealing with this flow before the data exits NiFi. Are there any > optimization I could deal with out of the box? > > Thanks, > Stephane > > On Fri, Jul 1, 2016 at 10:48 PM Mark Payne <marka...@hotmail.com> wrote: > >> Hi Stephane, >> >> For #1, when you say that you get as many output as lines of text, are >> you sending in FlowFiles that are only >> one line of text each? The Processor does not aggregate multiple >> FlowFiles together, so if you are sending in >> 1-line FlowFiles, it can only route that FlowFile in 1-line outputs. >> >> Re #2: The regular expression is compiled every time. This is done, >> though, because the Regex allows the Expression >> Language to be used, so the Regex could actually be different for each >> FlowFile. That being said, it could certainly be >> improved by either (a) pre-compiling in the case that no Expression >> Language is used and/or (b) cache up to say 10 >> Regex'es once they are compiled. Do you mind filing a JIRA to improve the >> efficiency of this processor? >> >> Also, when you say that the processor is having trouble keeping up with a >> batch size of 1, there are a few thoughts that >> come to mind: >> >> * How many concurrent tasks do you have assigned to the processor? Have >> you tried increasing it? >> * When processing text in NiFi it is is generally going to be much more >> efficient to process a single FlowFile with many lines, >> instead of many small FlowFiles, due to the expense of the Data >> Provenance that has to be generated. There are some things >> that we can do to improve efficiency of the data provenance as well, but >> those improvements have generally been made >> 'high' priority rather than 'extremely high priority' :) so i would >> expect to see them coming out possibly toward the end of this year, >> after 1.0 and a few other major features come out. >> * Rather than using a Regular Expression, the "Satisfies Expression" >> Matching Strategy is likely to be more efficient in many cases >> if it is able to provide the routing logic that you need. It also tends >> to be easier to read than regular expressions, which is nice when >> you (or someone else) goes back later to modify the flow. >> >> Please let me know if anything here doesn't make sense or if you have any >> more questions. >> >> Thanks! >> -Mark >> >> >> > On Jun 30, 2016, at 9:04 PM, Stéphane Maarek <stephane.maa...@gmail.com> >> wrote: >> > >> > Hi, >> > >> > I have a question regarding RouteText. The processor works just fine >> for me but maybe I'm missing a couple subtleties: >> > >> > 1) I have a regex to group data by (a pair of IDs), but what do I use >> the grouping attribute for? I still get as many outputs as lines >> > 2) My data is coming from a listenUDP. If my batch size is 1, RouteText >> is having a lot of trouble processing all the data. I would guess that it >> compiles the regex everytime it is executed, is it correct? When I increase >> the batch size to 100, RouteText processes everything well. I was wondering >> if there could be some sort of optimization on the RouteText to keep the >> regex compile nonetheless of the state of the processor? >> > >> > >> > Thanks a lot! >> > Stephane >> >> >