Re: InputStream and LineIterator

Matt Burgess Fri, 14 Apr 2017 08:30:50 -0700

Seems like the need for multiple SplitTexts is only because we keep all the 
splits in memory until we can get a count to put in an attribute, maybe we 
should make that configurable in case the user isn't going to use the 
fragment.* attributes. If they are going to use them, then multiple Splits can 
be an issue anyway because the fragment.* attributes will keep getting 
overwritten; if you need a total indexing (for the upcoming EnforceOrder 
processor for example), you need an UpdateAttribute between each split to save 
off the parent attributes and calculate the total index. I have an example 
template for this, if anyone is interested I can put it up as a Gist or 
something.


Regards,
Matt


> On Apr 14, 2017, at 11:21 AM, Joe Witt <joe.w...@gmail.com> wrote:
> 
> Uwe
> 
> Enabling developers to cleanly handle extremely large objects is
> precisely why the ProcessSession interface to interact with FlowFile
> content occurs through Input and OutputStreams.  With that model you
> can load just as much content as you need into memory at any one time
> and nothing more.  This is really useful for a lot of line by line
> reading and transformations, encryption, compression, etc.. kinds of
> cases.
> 
> Now, there is a gotcha to this model in that we do presently hold the
> FlowFile objects/attributes (not the content) that we've got as part
> of that ProcessSession in memory. When you have hundreds of
> thousands/millions of these in a single session it can eat up a lot of
> heap.  We do intend to make even that not matter eventually by
> swapping out the flowfile objects to disk temporarily but we're not
> there yet.
> 
> In the mean time knowing the strength of the content access and the
> limitation around having tons of flowfiles in a single session means
> you can still build really powerful approaches that scale well and are
> nice to your java heap.
> 
> You'll notice that we recommend people do SplitText in a two stage
> process when they're dealing with really large splits.  We'll say for
> instance have the first split generate 5000 line outputs then the
> second split do single line outputs.  This is only when they must get
> things split to their individual form of course.
> 
> So, back to your processor, is it reading in a single large set of
> rows and putting out many individual rows?  Could you have it put out
> rows grouped together by some logic or must you end up with individual
> splits?
> 
> Thanks
> Joe
> 
>> On Fri, Apr 14, 2017 at 11:14 AM, Uwe Geercken <uwe.geerc...@web.de> wrote:
>> Hi everybody,
>> 
>> in my ExecuteRuleEngine processor I have used a LineIteraotr to loop ofer 
>> incomming rows of data (CSV).
>> 
>> I was wondering if that is the preferred method for (large) files or if 
>> there are better ways to do it. Also, is there a way to influence the number 
>> of buffered rows or bytes that the LineIterator uses?
>> 
>> I saw that if I use a large file (about 2 mio rows) with my processor, then 
>> the flow stops and aborts at a certain point. Probably a memory issue. I do 
>> not have a lot of ram available. On the other hand, I can process the same 
>> large file, if I use a SplitText processor before my processor and split the 
>> file into chunks of e.g. 200000 rows.
>> 
>> Any hints or recommendations are welcome.
>> 
>> Greetings,
>> 
>> Uwe

Re: InputStream and LineIterator

Reply via email to