Re: stream one large file, only once

Raf Huys Mon, 14 Nov 2016 05:33:20 -0800

Thanks for making this clear!

I was distracted because I do have a `java.lang.OutOfMemoryError` on the
GetFile processor itself (and a matching `bytes read` spike corresponding
to the file size).


On Mon, Nov 14, 2016 at 2:23 PM, Joe Witt <[email protected]> wrote:

> The pattern you want for this is
>
> 1) GetFile or (ListFile + FetchFile)
> 2) RouteText
> 3) PublishKafka
>
> As Andrew points out GetFile and FetchFile do *not* read the file
> contents into memory.  The whole point of NiFi's design in general is
> to take advantage of the content repository rather than forcing
> components to hold things in memory.  While they can elect to hold
> things in memory they don't have to and the repository allows reading
> from and writing to streams all within a unit of work pattern
> transactional model.  There is a lot more to say on that topic but you
> can see a good bit about it in the docs.
>
> RouteText is the way to avoid the SplitText memory scenario where
> there are so many lines that even holding pointers/metadata about
> those lines itself becomes problematic.  You can also do as Andrew
> points out and split in chunks which also works well.  RouteText will
> likely yield higher performance though overall if it works for your
> case.
>
> Thanks
> Joe
>
> On Mon, Nov 14, 2016 at 8:11 AM, Andrew Grande <[email protected]> wrote:
> > Neither GetFile nor FetchFile read the file into memory, they only deal
> with
> > the file handle and pass the contents via a handle to the content
> repository
> > (NiFi streams data into and reads as a stream).
> >
> > What you will face, however, is an issue with a SplitText when you try to
> > split it in 1 transaction. This might fail based on the JVM heap
> allocated
> > and file size. A recommended best practice in this case is to introduce a
> > series of 2 SplitText processors. 1st pass would split into e.g. 10 000
> row
> > chunks, 2nd - into individual. Adjust for your expected file sizes and
> > available memory.
> >
> > HTH,
> > Andrew
> >
> > On Mon, Nov 14, 2016 at 7:23 AM Raf Huys <[email protected]> wrote:
> >>
> >> I would like to read in a large (several gigs) of logdata, and route
> every
> >> line to a (potentially different) Kafka topic.
> >>
> >> - I don't want this file to be in memory
> >> - I want it to be read once, not more
> >>
> >> using `GetFile` takes the whole file in memory. Same with `FetchFile` as
> >> far as I can see.
> >>
> >> I also used a `ExecuteProcess` processor in which the file is `cat` and
> >> which splits off a flowfile every millisecond. This looked to be a
> somewhat
> >> streaming approach to the problem, but this processor runs continuously
> (or
> >> cron based) and by consequence the logfile is re-injected all the time.
> >>
> >> What's the typical Nifi for this? Tx
> >>
> >> Raf Huys
>



-- 
Mvg,

Raf Huys

Re: stream one large file, only once

Reply via email to