Seems like the need for multiple SplitTexts is only because we keep all the splits in memory until we can get a count to put in an attribute, maybe we should make that configurable in case the user isn't going to use the fragment.* attributes. If they are going to use them, then multiple Splits can be an issue anyway because the fragment.* attributes will keep getting overwritten; if you need a total indexing (for the upcoming EnforceOrder processor for example), you need an UpdateAttribute between each split to save off the parent attributes and calculate the total index. I have an example template for this, if anyone is interested I can put it up as a Gist or something.
Regards, Matt > On Apr 14, 2017, at 11:21 AM, Joe Witt <joe.w...@gmail.com> wrote: > > Uwe > > Enabling developers to cleanly handle extremely large objects is > precisely why the ProcessSession interface to interact with FlowFile > content occurs through Input and OutputStreams. With that model you > can load just as much content as you need into memory at any one time > and nothing more. This is really useful for a lot of line by line > reading and transformations, encryption, compression, etc.. kinds of > cases. > > Now, there is a gotcha to this model in that we do presently hold the > FlowFile objects/attributes (not the content) that we've got as part > of that ProcessSession in memory. When you have hundreds of > thousands/millions of these in a single session it can eat up a lot of > heap. We do intend to make even that not matter eventually by > swapping out the flowfile objects to disk temporarily but we're not > there yet. > > In the mean time knowing the strength of the content access and the > limitation around having tons of flowfiles in a single session means > you can still build really powerful approaches that scale well and are > nice to your java heap. > > You'll notice that we recommend people do SplitText in a two stage > process when they're dealing with really large splits. We'll say for > instance have the first split generate 5000 line outputs then the > second split do single line outputs. This is only when they must get > things split to their individual form of course. > > So, back to your processor, is it reading in a single large set of > rows and putting out many individual rows? Could you have it put out > rows grouped together by some logic or must you end up with individual > splits? > > Thanks > Joe > >> On Fri, Apr 14, 2017 at 11:14 AM, Uwe Geercken <uwe.geerc...@web.de> wrote: >> Hi everybody, >> >> in my ExecuteRuleEngine processor I have used a LineIteraotr to loop ofer >> incomming rows of data (CSV). >> >> I was wondering if that is the preferred method for (large) files or if >> there are better ways to do it. Also, is there a way to influence the number >> of buffered rows or bytes that the LineIterator uses? >> >> I saw that if I use a large file (about 2 mio rows) with my processor, then >> the flow stops and aborts at a certain point. Probably a memory issue. I do >> not have a lot of ram available. On the other hand, I can process the same >> large file, if I use a SplitText processor before my processor and split the >> file into chunks of e.g. 200000 rows. >> >> Any hints or recommendations are welcome. >> >> Greetings, >> >> Uwe