Re: Re: InputStream and LineIterator

Joe Witt Sun, 16 Apr 2017 18:54:07 -0700

Uwe

Looping over LineIterator, assuming we're talking about the Apache
Commons IO LineIterator, I would think would have good memory handling
behavior.  I've not looked at the details of it but I'd assume it
would only ever hold in memory that which it takes to create the
String that represents the 'nextLine' and maybe some buffer details
for tracking 'hasNext'.  If that is the case then you're still having
to consider how long any single line could be.


In the case of TextLineDemarcator we're taking advantage of NiFi's
ability to segment content without having to actually manipulate the
original content.  For example, SplitText can create 'splits' which
are actually just pointers to its original content with new offsets.
This, combined with disk caching, means that we can get extremely high
throughput for the splits and subsequent reads and we're simply making
pointers to the original and therefore not writing new data.

Thanks
JOe

On Fri, Apr 14, 2017 at 11:57 AM, Uwe Geercken <uwe.geerc...@web.de> wrote:
> Joe,
>
> actually, this is a good question. My processor needs to run on an individual 
> line to run the business rules against the data. The idea is to then output 
> individual rows/flow files. These flow files contain several attributes 
> indicating the state of the ruleengine (the results). In a nect step the user 
> could decide how to continue. E.g. to filter/dump those flow files, which 
> failed.
>
> If I output multiple rows in one flow file then I will keep more rows in 
> memory over a longer period of time and the results from the ruleengine are 
> individual per row. So combining them does not seem to work well.
>
> My input file has about two million rows of data. I will see if I can put 
> more memory in to see if that is the problem.
>
> But what about my other question. Is looping over LineIterator the right was 
> to go? And what does SplitText do differently that it can handle the large 
> file, but my processor using LineIterator not. I looked at the source and saw 
> that SpliText is using TextLineDemarcator for processing the rows. But I 
> don't understand the details of it.
>
> Rgds,
>
> Uwe
>
>
> Gesendet: Freitag, 14. April 2017 um 17:21 Uhr
> Von: "Joe Witt" <joe.w...@gmail.com>
> An: dev@nifi.apache.org
> Betreff: Re: InputStream and LineIterator
> Uwe
>
> Enabling developers to cleanly handle extremely large objects is
> precisely why the ProcessSession interface to interact with FlowFile
> content occurs through Input and OutputStreams. With that model you
> can load just as much content as you need into memory at any one time
> and nothing more. This is really useful for a lot of line by line
> reading and transformations, encryption, compression, etc.. kinds of
> cases.
>
> Now, there is a gotcha to this model in that we do presently hold the
> FlowFile objects/attributes (not the content) that we've got as part
> of that ProcessSession in memory. When you have hundreds of
> thousands/millions of these in a single session it can eat up a lot of
> heap. We do intend to make even that not matter eventually by
> swapping out the flowfile objects to disk temporarily but we're not
> there yet.
>
> In the mean time knowing the strength of the content access and the
> limitation around having tons of flowfiles in a single session means
> you can still build really powerful approaches that scale well and are
> nice to your java heap.
>
> You'll notice that we recommend people do SplitText in a two stage
> process when they're dealing with really large splits. We'll say for
> instance have the first split generate 5000 line outputs then the
> second split do single line outputs. This is only when they must get
> things split to their individual form of course.
>
> So, back to your processor, is it reading in a single large set of
> rows and putting out many individual rows? Could you have it put out
> rows grouped together by some logic or must you end up with individual
> splits?
>
> Thanks
> Joe
>
> On Fri, Apr 14, 2017 at 11:14 AM, Uwe Geercken <uwe.geerc...@web.de> wrote:
>> Hi everybody,
>>
>> in my ExecuteRuleEngine processor I have used a LineIteraotr to loop ofer 
>> incomming rows of data (CSV).
>>
>> I was wondering if that is the preferred method for (large) files or if 
>> there are better ways to do it. Also, is there a way to influence the number 
>> of buffered rows or bytes that the LineIterator uses?
>>
>> I saw that if I use a large file (about 2 mio rows) with my processor, then 
>> the flow stops and aborts at a certain point. Probably a memory issue. I do 
>> not have a lot of ram available. On the other hand, I can process the same 
>> large file, if I use a SplitText processor before my processor and split the 
>> file into chunks of e.g. 200000 rows.
>>
>> Any hints or recommendations are welcome.
>>
>> Greetings,
>>
>> Uwe

Re: Re: InputStream and LineIterator

Reply via email to