Joe,
  Fantastic! Thank you for the detailed and informative response.

We do use spark and are looking at ways to integrate with nifi to control the ingest and initial data conditioning part.
I'm finding various ways this could easily be done.

D

On 11/10/2015 09:28 PM, Joe Witt wrote:
Darren,

In short, yes I think NiFi can handle such a case in a generic sense quite well.

Read on for the longer response...

NiFi can process extremely large data, extremely large datasets,
extremely small data and high rates, variable sized data, etc.. It
makes this efficient by its design, how the content repository works
whereby it supports pass-by-reference and copy-on-write behavior and
that it operates in a manner that allows disk caching benefits to
really shine through.

Now that said if all that is of interest is pure 'processing' and
having a general purpose processing framework Storm, Spark, others are
focused solely on that space.  NiFi is focused on the management of
dataflows from wherever in your enterprise data is created, produced,
etc.. to and through processing systems and ultimately into storage
systems like HDFS, NoSQL stores, relational databases.

So depending on what you're trying to do to these documents be it
feature extraction, transformation, etc.. NiFi may be a great choice
or NiFi may simply be the tool you use to feed this data into systems
like Storm or Spark or others.  You can absolutely parallelize the
flow of data across a NiFi cluster.  For producers we offer a library
to interact with our site to site protocol which will handle things
like load balancing and failover and make it really easy to stream
data to NiFi.  Or NiFi itself could pull from your system if perhaps
these documents are sitting as files or available via some other
supported interface.

NiFi can be configured to control the rate of processing, queue data,
apply back-pressure, handle errors, and a number of other features
that are beneficial to the dataflow management problem.

NiFi supports making tradeoffs at key points in the flow for batch
(time tolerant) or low latency (time sensitive)
processing/distribution.  Whether data arrives in a streaming or batch
fashion and whether it must be delivered to systems in batch or
streaming fashion is a concern that NiFi handles well so the various
systems can be less coupled.

Regarding its elasticity I will state that NiFi is not elastic in the
sense that it will (at this time) automatically provision additional
nodes to take on the work load and then deprovision them as the load
decreases.  We will get there.  But what we support are key
capabilities like event driven processing with upper bounds on
threads, back-pressure which can propogate to the source causing data
to go to lesser loaded nodes, and so on.  These are elements of
elastic behavior but it is not elastic provisioning (as folks often
mean).

I hope this response is helpful.  If any of this was unclear or you
want to dive deeper just let us know.

Thanks
Joe

On Tue, Nov 10, 2015 at 6:30 PM, Darren Govoni <dar...@ontrenet.com> wrote:
Hi,
   I studied the nifi website a bit and if I missed a key part, forgive me
for asking this question.
But I am wondering if or how nifi can accommodate processing large data sets
with possibly compute intensive operations.
For example, if we have say 2 million documents, how does nifi make
processing these documents efficient?
I understand the visual workflow and its nice. How is that parallelized
across a data set?

Do we submit all the documents to a cluster of flows (how many?) that
execute some number of documents simultaneously?
Does nifi support batch processing? Is it elastic?

Thanks.

Reply via email to