Uwe Enabling developers to cleanly handle extremely large objects is precisely why the ProcessSession interface to interact with FlowFile content occurs through Input and OutputStreams. With that model you can load just as much content as you need into memory at any one time and nothing more. This is really useful for a lot of line by line reading and transformations, encryption, compression, etc.. kinds of cases.
Now, there is a gotcha to this model in that we do presently hold the FlowFile objects/attributes (not the content) that we've got as part of that ProcessSession in memory. When you have hundreds of thousands/millions of these in a single session it can eat up a lot of heap. We do intend to make even that not matter eventually by swapping out the flowfile objects to disk temporarily but we're not there yet. In the mean time knowing the strength of the content access and the limitation around having tons of flowfiles in a single session means you can still build really powerful approaches that scale well and are nice to your java heap. You'll notice that we recommend people do SplitText in a two stage process when they're dealing with really large splits. We'll say for instance have the first split generate 5000 line outputs then the second split do single line outputs. This is only when they must get things split to their individual form of course. So, back to your processor, is it reading in a single large set of rows and putting out many individual rows? Could you have it put out rows grouped together by some logic or must you end up with individual splits? Thanks Joe On Fri, Apr 14, 2017 at 11:14 AM, Uwe Geercken <uwe.geerc...@web.de> wrote: > Hi everybody, > > in my ExecuteRuleEngine processor I have used a LineIteraotr to loop ofer > incomming rows of data (CSV). > > I was wondering if that is the preferred method for (large) files or if there > are better ways to do it. Also, is there a way to influence the number of > buffered rows or bytes that the LineIterator uses? > > I saw that if I use a large file (about 2 mio rows) with my processor, then > the flow stops and aborts at a certain point. Probably a memory issue. I do > not have a lot of ram available. On the other hand, I can process the same > large file, if I use a SplitText processor before my processor and split the > file into chunks of e.g. 200000 rows. > > Any hints or recommendations are welcome. > > Greetings, > > Uwe