Aw: Re: InputStream and LineIterator

Uwe Geercken Fri, 14 Apr 2017 08:58:16 -0700

Joe,

actually, this is a good question. My processor needs to run on an individual 
line to run the business rules against the data. The idea is to then output 
individual rows/flow files. These flow files contain several attributes 
indicating the state of the ruleengine (the results). In a nect step the user 
could decide how to continue. E.g. to filter/dump those flow files, which 
failed.

If I output multiple rows in one flow file then I will keep more rows in memory 
over a longer period of time and the results from the ruleengine are individual 
per row. So combining them does not seem to work well.

My input file has about two million rows of data. I will see if I can put more 
memory in to see if that is the problem.

But what about my other question. Is looping over LineIterator the right was to 
go? And what does SplitText do differently that it can handle the large file, 
but my processor using LineIterator not. I looked at the source and saw that 
SpliText is using TextLineDemarcator for processing the rows. But I don't 
understand the details of it.

Rgds,

Uwe 

Gesendet: Freitag, 14. April 2017 um 17:21 Uhr
Von: "Joe Witt" <joe.w...@gmail.com>
An: dev@nifi.apache.org
Betreff: Re: InputStream and LineIterator
Uwe

Enabling developers to cleanly handle extremely large objects is
precisely why the ProcessSession interface to interact with FlowFile
content occurs through Input and OutputStreams. With that model you
can load just as much content as you need into memory at any one time
and nothing more. This is really useful for a lot of line by line
reading and transformations, encryption, compression, etc.. kinds of
cases.

Now, there is a gotcha to this model in that we do presently hold the
FlowFile objects/attributes (not the content) that we've got as part
of that ProcessSession in memory. When you have hundreds of
thousands/millions of these in a single session it can eat up a lot of
heap. We do intend to make even that not matter eventually by
swapping out the flowfile objects to disk temporarily but we're not
there yet.

In the mean time knowing the strength of the content access and the
limitation around having tons of flowfiles in a single session means
you can still build really powerful approaches that scale well and are
nice to your java heap.

You'll notice that we recommend people do SplitText in a two stage
process when they're dealing with really large splits. We'll say for
instance have the first split generate 5000 line outputs then the
second split do single line outputs. This is only when they must get
things split to their individual form of course.

So, back to your processor, is it reading in a single large set of
rows and putting out many individual rows? Could you have it put out
rows grouped together by some logic or must you end up with individual
splits?

Thanks
Joe

On Fri, Apr 14, 2017 at 11:14 AM, Uwe Geercken <uwe.geerc...@web.de> wrote:
> Hi everybody,
>
> in my ExecuteRuleEngine processor I have used a LineIteraotr to loop ofer 
> incomming rows of data (CSV).
>
> I was wondering if that is the preferred method for (large) files or if there 
> are better ways to do it. Also, is there a way to influence the number of 
> buffered rows or bytes that the LineIterator uses?
>
> I saw that if I use a large file (about 2 mio rows) with my processor, then 
> the flow stops and aborts at a certain point. Probably a memory issue. I do 
> not have a lot of ram available. On the other hand, I can process the same 
> large file, if I use a SplitText processor before my processor and split the 
> file into chunks of e.g. 200000 rows.
>
> Any hints or recommendations are welcome.
>
> Greetings,
>
> Uwe

Aw: Re: InputStream and LineIterator

Reply via email to