Joe, actually, this is a good question. My processor needs to run on an individual line to run the business rules against the data. The idea is to then output individual rows/flow files. These flow files contain several attributes indicating the state of the ruleengine (the results). In a nect step the user could decide how to continue. E.g. to filter/dump those flow files, which failed.
If I output multiple rows in one flow file then I will keep more rows in memory over a longer period of time and the results from the ruleengine are individual per row. So combining them does not seem to work well. My input file has about two million rows of data. I will see if I can put more memory in to see if that is the problem. But what about my other question. Is looping over LineIterator the right was to go? And what does SplitText do differently that it can handle the large file, but my processor using LineIterator not. I looked at the source and saw that SpliText is using TextLineDemarcator for processing the rows. But I don't understand the details of it. Rgds, Uwe Gesendet: Freitag, 14. April 2017 um 17:21 Uhr Von: "Joe Witt" <joe.w...@gmail.com> An: dev@nifi.apache.org Betreff: Re: InputStream and LineIterator Uwe Enabling developers to cleanly handle extremely large objects is precisely why the ProcessSession interface to interact with FlowFile content occurs through Input and OutputStreams. With that model you can load just as much content as you need into memory at any one time and nothing more. This is really useful for a lot of line by line reading and transformations, encryption, compression, etc.. kinds of cases. Now, there is a gotcha to this model in that we do presently hold the FlowFile objects/attributes (not the content) that we've got as part of that ProcessSession in memory. When you have hundreds of thousands/millions of these in a single session it can eat up a lot of heap. We do intend to make even that not matter eventually by swapping out the flowfile objects to disk temporarily but we're not there yet. In the mean time knowing the strength of the content access and the limitation around having tons of flowfiles in a single session means you can still build really powerful approaches that scale well and are nice to your java heap. You'll notice that we recommend people do SplitText in a two stage process when they're dealing with really large splits. We'll say for instance have the first split generate 5000 line outputs then the second split do single line outputs. This is only when they must get things split to their individual form of course. So, back to your processor, is it reading in a single large set of rows and putting out many individual rows? Could you have it put out rows grouped together by some logic or must you end up with individual splits? Thanks Joe On Fri, Apr 14, 2017 at 11:14 AM, Uwe Geercken <uwe.geerc...@web.de> wrote: > Hi everybody, > > in my ExecuteRuleEngine processor I have used a LineIteraotr to loop ofer > incomming rows of data (CSV). > > I was wondering if that is the preferred method for (large) files or if there > are better ways to do it. Also, is there a way to influence the number of > buffered rows or bytes that the LineIterator uses? > > I saw that if I use a large file (about 2 mio rows) with my processor, then > the flow stops and aborts at a certain point. Probably a memory issue. I do > not have a lot of ram available. On the other hand, I can process the same > large file, if I use a SplitText processor before my processor and split the > file into chunks of e.g. 200000 rows. > > Any hints or recommendations are welcome. > > Greetings, > > Uwe