Re: InputStream and LineIterator

Joe Witt Fri, 14 Apr 2017 08:22:20 -0700

Uwe

Enabling developers to cleanly handle extremely large objects is
precisely why the ProcessSession interface to interact with FlowFile
content occurs through Input and OutputStreams.  With that model you
can load just as much content as you need into memory at any one time
and nothing more.  This is really useful for a lot of line by line
reading and transformations, encryption, compression, etc.. kinds of
cases.

Now, there is a gotcha to this model in that we do presently hold the
FlowFile objects/attributes (not the content) that we've got as part
of that ProcessSession in memory. When you have hundreds of
thousands/millions of these in a single session it can eat up a lot of
heap.  We do intend to make even that not matter eventually by
swapping out the flowfile objects to disk temporarily but we're not
there yet.

In the mean time knowing the strength of the content access and the
limitation around having tons of flowfiles in a single session means
you can still build really powerful approaches that scale well and are
nice to your java heap.

You'll notice that we recommend people do SplitText in a two stage
process when they're dealing with really large splits.  We'll say for
instance have the first split generate 5000 line outputs then the
second split do single line outputs.  This is only when they must get
things split to their individual form of course.

So, back to your processor, is it reading in a single large set of
rows and putting out many individual rows?  Could you have it put out
rows grouped together by some logic or must you end up with individual
splits?

Thanks
Joe

On Fri, Apr 14, 2017 at 11:14 AM, Uwe Geercken <uwe.geerc...@web.de> wrote:
> Hi everybody,
>
> in my ExecuteRuleEngine processor I have used a LineIteraotr to loop ofer 
> incomming rows of data (CSV).
>
> I was wondering if that is the preferred method for (large) files or if there 
> are better ways to do it. Also, is there a way to influence the number of 
> buffered rows or bytes that the LineIterator uses?
>
> I saw that if I use a large file (about 2 mio rows) with my processor, then 
> the flow stops and aborts at a certain point. Probably a memory issue. I do 
> not have a lot of ram available. On the other hand, I can process the same 
> large file, if I use a SplitText processor before my processor and split the 
> file into chunks of e.g. 200000 rows.
>
> Any hints or recommendations are welcome.
>
> Greetings,
>
> Uwe

Re: InputStream and LineIterator

Reply via email to