Your general assessment about what you'd need is correct. It's a fairly
easy component to build, and I'll throw up a Jira ticket for it. Would
definitely be doable for NiFi 1.8.

Expect the Mongo stuff to go through some real clean up like this in 1.8.
One of the other big changes is I will be moving the processors to using a
controller service as an optional configuration for the Mongo client with
the plan that by probably 1.9 all of the Mongo processors will drop their
own client configurations and use the same pool (currently every processor
instance maintains its own).

On Thu, Jun 21, 2018 at 3:13 AM Kelsey RIDER <[email protected]>
wrote:

> Hello,
>
>
>
> I’ve been experimenting with NiFi and MongoDB. I have a test collection
> with 1 million documents in it. Each document has the same flat JSON
> structure with 11 fields.
>
> My NiFi flow exposes a webservice, which allows the user to fetch all the
> data in CSV format.
>
>
>
> However, 1M documents brings NiFi to its knees. Even after increasing the
> JVM’s Xms and Xmx to 2G, I still get an OutOfMemoryError:
>
>
>
> 2018-06-20 11:27:43,428 WARN [Timer-Driven Process Thread-7]
> o.a.n.controller.tasks.ConnectableTask Admng.OutOfMemoryError: Java heap
> space
>
> java.lang.OutOfMemoryError: Java heap space
>
>         at java.util.Arrays.copyOf(Arrays.java:3332)
>
>         at
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>
>         at
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
>
>         at java.lang.StringBuilder.append(StringBuilder.java:136)
>
>         at
> org.apache.nifi.processors.mongodb.GetMongo.buildBatch(GetMongo.java:222)
>
>         at
> org.apache.nifi.processors.mongodb.GetMongo.onTrigger(GetMongo.java:341)
>
>         at
> org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
>
>         at
> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1147)
>
>         at
> org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:175)
>
>         at
> org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenScheduling
>
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThr
>
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPool
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>
>
> I dug into the code, and discovered that the GetMongo processor takes all
> the Documents returned from Mongo, converts them to Strings, and
> concatenates them in a StringBuilder.
>
>
>
> My question is thus: is there a better way that I should be doing this?
>
> The only idea I’ve had is to use a smaller batch size, but that would mean
> that I’d just need a later processor to concatenate the batches in order to
> get one big CSV.
>
> Is there some sort of “GetMongoRecord” processor that reads each mongo
> Document as a record, in the way ExecuteSQL does? (I’ve done the same test
> with an SQL database, and it handles 1M records just fine.)
>
>
>
> Thanks for your help,
>
>
>
> Kelsey
> Suite à l’évolution des dispositifs de réglementation du travail, si vous
> recevez ce mail avant 7h00, en soirée, durant le week-end ou vos congés
> merci, sauf cas d’urgence exceptionnelle, de ne pas le traiter ni d’y
> répondre immédiatement.
>

Reply via email to