Hi Mark,

Thanks a lot for the insights. I'm using RouteText because I needed the
${line} attribute. I've separated my disks and added the logging you
recommended.
Final question, and that's I guess a little optimization:
Is it better to
1) RouteText with an empty group field, then having a splitline processor OR
2) RouteText with a group field being (.*), and as my lines are unique,
they'll come out already splitted

Thanks!
Stephane

On Thu, Jul 7, 2016 at 1:31 AM Mark Payne <marka...@hotmail.com> wrote:

> Stephane,
>
> So the Processors that you mention there mostly would require that you
> split your data up into one-line chunks.
>
> When you indicate that the expression you would use is
> "${filename:contains('new'):and(filename:contains('2016'))}"
> that looks like you are routing only on the attributes, not on the content
> of the text itself. If this is the case, you should
> use RouteOnAttribute, as it will be much more efficient than RouteText. In
> general, though, that expression would be
> much more efficient than using a regex to match against .*new.*2016.*
>
> So I would certainly recommend using RouteOnAttribute and using the
> Expression Language to route based on attributes.
> You can also just add two different properties:
>
> containsNew = ${filename:contains('new')}
> is2016 = ${filename:contains('2016')}
>
> And then set the routing strategy to Route to 'match' if all match. This
> will help make the processor's configuration easier
> to understand if you look at it again in the future.
>
> Ingesting 1000 packets per second should not be a problem at all on a
> single node. Some things to consider:
>
> - Ideally, you would have a separate disk for your content repo, your
> flowfile repo, and your prov repo.
>
> - You may want to change the log level to WARN for processors (by adding
> to your conf/logback.xml <logger name="org.apache.nifi.processors"
> level="WARN" />)
>   This may or may not make a difference, depending on how resource
> constrained your disks are.
>
> - Making the change above to use RouteOnAttribute will certainly help
> alleviate pressure on both your CPU and your disk.
>
> - If you don't have enough disks to separate out each of your
> repositories, would recommend at least putting prov repo on its own disk.
>
> - If you do have enough disks, you can strip the content repo and your
> prov repo across multiple disks to scale vertically, and you'll
>   see much better performance this way.
>
>
> Thanks
> -Markk
>
>
> On Jul 3, 2016, at 8:27 PM, Stéphane Maarek <stephane.maa...@gmail.com>
> wrote:
>
> Hi Mark,
>
> 1. I send Flowfile coming through a ListenUDP, with a batch of 100. So
> most of the time, the flowfiles are multiple lines long. Yet, after the
> route text, I get as many flowfiles as lines, regardless of the grouping
> parameter. Is that expected?
>
> 2. I have opened a JIRA: https://issues.apache.org/jira/browse/NIFI-2169
>
> I have few questions:
> Regarding the fact that it's better to operate on text that have many
> lines, and if I manage to get RouteText to output many lines:
>  a) Can ExtractText, ReplaceText, PutMongo, ConvertJSONtoSQL, PutSQL
> operate on each individual line within a flowfile? (that's basically all
> the components in my flow)
> b) is satisfies expression:
> ${filename:contains('new'):and(filename:contains('2016'))} going to perform
> better than RegEx: .*new.*2016.* ?
> c) I have a lot of data coming in (1000 udp packets a second), and yes,
> the provenance database has been cramming because we have 6 processors
> dealing with this flow before the data exits NiFi. Are there any
> optimization I could deal with out of the box?
>
> Thanks,
> Stephane
>
> On Fri, Jul 1, 2016 at 10:48 PM Mark Payne <marka...@hotmail.com> wrote:
>
>> Hi Stephane,
>>
>> For #1, when you say that you get as many output as lines of text, are
>> you sending in FlowFiles that are only
>> one line of text each? The Processor does not aggregate multiple
>> FlowFiles together, so if you are sending in
>> 1-line FlowFiles, it can only route that FlowFile in 1-line outputs.
>>
>> Re #2: The regular expression is compiled every time. This is done,
>> though, because the Regex allows the Expression
>> Language to be used, so the Regex could actually be different for each
>> FlowFile. That being said, it could certainly be
>> improved by either (a) pre-compiling in the case that no Expression
>> Language is used and/or (b) cache up to say 10
>> Regex'es once they are compiled. Do you mind filing a JIRA to improve the
>> efficiency of this processor?
>>
>> Also, when you say that the processor is having trouble keeping up with a
>> batch size of 1, there are a few thoughts that
>> come to mind:
>>
>> * How many concurrent tasks do you have assigned to the processor? Have
>> you tried increasing it?
>> * When processing text in NiFi it is is generally going to be much more
>> efficient to process a single FlowFile with many lines,
>> instead of many small FlowFiles, due to the expense of the Data
>> Provenance that has to be generated. There are some things
>> that we can do to improve efficiency of the data provenance as well, but
>> those improvements have generally been made
>> 'high' priority rather than 'extremely high priority' :) so i would
>> expect to see them coming out possibly toward the end of this year,
>> after 1.0 and a few other major features come out.
>> * Rather than using a Regular Expression, the "Satisfies Expression"
>> Matching Strategy is likely to be more efficient in many cases
>> if it is able to provide the routing logic that you need. It also tends
>> to be easier to read than regular expressions, which is nice when
>> you (or someone else) goes back later to modify the flow.
>>
>> Please let me know if anything here doesn't make sense or if you have any
>> more questions.
>>
>> Thanks!
>> -Mark
>>
>>
>> > On Jun 30, 2016, at 9:04 PM, Stéphane Maarek <stephane.maa...@gmail.com>
>> wrote:
>> >
>> > Hi,
>> >
>> > I have a question regarding RouteText. The processor works just fine
>> for me but maybe I'm missing a couple subtleties:
>> >
>> > 1) I have a regex to group data by (a pair of IDs), but what do I use
>> the grouping attribute for? I still get as many outputs as lines
>> > 2) My data is coming from a listenUDP. If my batch size is 1, RouteText
>> is having a lot of trouble processing all the data. I would guess that it
>> compiles the regex everytime it is executed, is it correct? When I increase
>> the batch size to 100, RouteText processes everything well. I was wondering
>> if there could be some sort of optimization on the RouteText to keep the
>> regex compile nonetheless of the state of the processor?
>> >
>> >
>> > Thanks a lot!
>> > Stephane
>>
>>
>

Reply via email to