Re: Spark or custom processor?

Conrad Crampton Thu, 02 Jun 2016 06:52:07 -0700

Mark,
A very helpful explanation and distinction on appropriate use for NiFi. I think 
my particular use case currently (probably) falls into the Simple Event 
Processing. I say ‘probably’ because I am bringing in some other data to 
compare the data against (bad domains and maybe others), but certainly isn’t 
doing anything clever at the moment in terms of windowing/ aggregation with 
previously seen data etc.

Thanks for the advice, very helpful.
Conrad

From: Mark Payne <marka...@hotmail.com>
Reply-To: "users@nifi.apache.org" <users@nifi.apache.org>
Date: Thursday, 2 June 2016 at 14:42
To: "users@nifi.apache.org" <users@nifi.apache.org>
Subject: Re: Spark or custom processor?

Conrad,

Typically, the way that we like to think about using NiFi vs. something like 
Spark or Storm is whether
the processing is Simple Event Processing or Complex Event Processing. Simple 
Event Processing
encapsulates those tasks where you are able to operate on a single piece of 
data by itself (or in correlation
with an Enrichment Dataset). So tasks like enrichment, splitting, and 
transformation are squarely within
the wheelhouse of NiFi.

When we talk about doing Complex Event Processing, we are generally talking 
about either processing data
from multiple streams together (think JOIN operations) or analyzing data across 
time windows (think calculating
norms, standard deviation, etc. over the last 30 minutes). The idea here is to 
derive a single new "insight" from
windows of data or joined streams of data - not to transform or enrich 
individual pieces of data. For this, we would
recommend something like Spark, Storm, Flink, etc.

In terms of scalability, NiFi certainly was not designed to scale outward in 
the way that Spark was. With Spark you
may be scaling to thousands of nodes, but with NiFi you would get a pretty poor 
user experience because each change
in the UI must be replicated to all of those nodes. That being said, NiFi does 
scale up very well to take full advantage
of however much CPU and disks you have available. We typically see processing 
of several terabytes of data per day
on a single node, so we have generally not needed to scale out to hundreds or 
thousands of nodes.

I hope this helps to clarify when/where to use each one. If there are things 
that are still unclear or if you have more
questions, as always, don't hesitate to shoot another email!

Thanks
-Mark

On Jun 2, 2016, at 9:28 AM, Conrad Crampton 
<conrad.cramp...@secdata.com<mailto:conrad.cramp...@secdata.com>> wrote:

Hi,
ListenSyslog (using the approach that is being discussed currently in another 
thread – ListenSyslog running on primary node as RGP, all other nodes 
connecting to the port that the RPG exposes).
Various enrichment, routing on attributes etc. and finally into HDFS as Avro.
I want to branch off at an appropriate point in the flow and do some further 
realtime analysis – got the output to port feeding to Spark process working 
fine (notwithstanding the issue that you have been so kind to help with 
previously with the SSLContext), just thinking about if this is most 
appropriate solution.

I have dabbled with a custom processor (for enriching url splitting/ enriching 
etc. – probably could have done with ExecuteScript processor in hindsight) so 
am comfortable with going this route if that is deemed more appropriate.

Thanks
Conrad

From: Bryan Bende <bbe...@gmail.com<mailto:bbe...@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Thursday, 2 June 2016 at 13:12
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: Re: Spark or custom processor?

Conrad,

I would think that you could do this all in NiFi.

How do the log files come into NiFi? TailFile, ListenUDP/ListenTCP, 
List+FetchFile?

-Bryan

On Thu, Jun 2, 2016 at 6:41 AM, Conrad Crampton 
<conrad.cramp...@secdata.com<mailto:conrad.cramp...@secdata.com>> wrote:
Hi,
Any advice on ‘best’ architectural approach whereby some processing function 
has to be applied to every flow file in a dataflow with some (possible) output 
based on flowfile content.
e.g. inspect log files for specific ip then send message to syslog

approach 1
Spark
Output port from NiFi -> Spark listens to that stream -> processes and outputs 
accordingly
Advantages – scale spark job on Yarn, decoupled (reusable) from NiFi
Disadvantages – adds complexity, decoupled from NiFi.

Approach 2
NiFi
Custom processor -> PutSyslog
Advantages – reuse existing NiFi processors/ capability, obvious flow (design 
intent)
Disadvantages – scale??

Any comments/ advice/ experience of either approaches?

Thanks
Conrad

SecureData, combating cyber threats

________________________________
The information contained in this message or any of its attachments may be 
privileged and confidential and intended for the exclusive use of the intended 
recipient. If you are not the intended recipient any disclosure, reproduction, 
distribution or other dissemination or use of this communications is strictly 
prohibited. The views expressed in this email are those of the individual and 
not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if 
followed up by a formal written quote.
SecureData Europe Limited. Registered in England & Wales 04365896. Registered 
Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, 
ME16 9NT

***This email originated outside SecureData***
Click 
here<https://www.mailcontrol.com/sr/0Yez0Z9rJiDGX2PQPOmvUr11KAWLA5a39FXrkhyyO4eQg2DXa9Xl!rwzg+4hlLPKdufvfzcRzpTaNxM9hG2QrA==>
 to report this email as spam.

Re: Spark or custom processor?

Reply via email to