Thanks for making this clear! I was distracted because I do have a `java.lang.OutOfMemoryError` on the GetFile processor itself (and a matching `bytes read` spike corresponding to the file size).
On Mon, Nov 14, 2016 at 2:23 PM, Joe Witt <[email protected]> wrote: > The pattern you want for this is > > 1) GetFile or (ListFile + FetchFile) > 2) RouteText > 3) PublishKafka > > As Andrew points out GetFile and FetchFile do *not* read the file > contents into memory. The whole point of NiFi's design in general is > to take advantage of the content repository rather than forcing > components to hold things in memory. While they can elect to hold > things in memory they don't have to and the repository allows reading > from and writing to streams all within a unit of work pattern > transactional model. There is a lot more to say on that topic but you > can see a good bit about it in the docs. > > RouteText is the way to avoid the SplitText memory scenario where > there are so many lines that even holding pointers/metadata about > those lines itself becomes problematic. You can also do as Andrew > points out and split in chunks which also works well. RouteText will > likely yield higher performance though overall if it works for your > case. > > Thanks > Joe > > On Mon, Nov 14, 2016 at 8:11 AM, Andrew Grande <[email protected]> wrote: > > Neither GetFile nor FetchFile read the file into memory, they only deal > with > > the file handle and pass the contents via a handle to the content > repository > > (NiFi streams data into and reads as a stream). > > > > What you will face, however, is an issue with a SplitText when you try to > > split it in 1 transaction. This might fail based on the JVM heap > allocated > > and file size. A recommended best practice in this case is to introduce a > > series of 2 SplitText processors. 1st pass would split into e.g. 10 000 > row > > chunks, 2nd - into individual. Adjust for your expected file sizes and > > available memory. > > > > HTH, > > Andrew > > > > On Mon, Nov 14, 2016 at 7:23 AM Raf Huys <[email protected]> wrote: > >> > >> I would like to read in a large (several gigs) of logdata, and route > every > >> line to a (potentially different) Kafka topic. > >> > >> - I don't want this file to be in memory > >> - I want it to be read once, not more > >> > >> using `GetFile` takes the whole file in memory. Same with `FetchFile` as > >> far as I can see. > >> > >> I also used a `ExecuteProcess` processor in which the file is `cat` and > >> which splits off a flowfile every millisecond. This looked to be a > somewhat > >> streaming approach to the problem, but this processor runs continuously > (or > >> cron based) and by consequence the logfile is re-injected all the time. > >> > >> What's the typical Nifi for this? Tx > >> > >> Raf Huys > -- Mvg, Raf Huys
