Re: MiNiFi C++ Data Provenance and Related Issues

2016-11-30 Thread Daniel Cave
I will not be continuing this discussion.  I will leave it to others to pick
it up if they feel it's needed.



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14058.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: MiNiFi C++ Data Provenance and Related Issues

2016-11-29 Thread Daniel Cave
"Yes but there can be other hubs too and in parallel."
[Daniel]For MiNiFi C++ -> SystemA -> SystemB -> ... -> NiFi, if you dont
want provenance to travel then I don't see it as an issue since the outgoing
message would be identical to what you have now.  If you feel it's going to
be extremely confusing then I could make it a new clone of the S2S MiNiFi
C++ processor, but I don't see a point to just hide a toggle.  On the NiFi
side for this case you would use the normal S2S intake methods you use now. 
No change.  Also, if you're going from MiNiFi C++ -> SystemA there is no
change.
For MiNiFi C++ -> MiNiFi C++ ->-> NiFi, if you want provenance travel
then yes you are locked into using n*(MiNiFi C++) -> NiFi with the
provenance toggled on and using the new S2S receiving processors in MiNiFi
C++/NiFi (it has to be a new one to avoid backwards compatibility issues)
that can handle provenance.  Again, I don't see this as an issue either
since you are clearly wanting this functionality if you're doing this.
Am I missing something in my logic flow that you are seeing that I need to
account for?

"You've mentioned this a couple times now. "
[Daniel] Agreed and this is how this discussion is meant to be taken.

"I'm not quite sure I understand so please elaborate if my
comments don't apply."
[Daniel]It has to do with when and how it's consumed.  On current path Atlas
won't answer the issues, but as you said there are others and I have my own
in progress as well.  I fundamentally disagree with the current
sink-retrieve-sink ETL paradigm (as you've seen from my public papers, there
are others not public yet as well) as it is a complete waste of time and
resources at this point.  In all my work, data is handled as available (near
real-time) rather than waiting for some ETL processes to run at some
arbitrary point in the future.  By doing this you avoid unnecessary traffic,
storage, processing, maintenance, and design all while improving data
availability.  More specifically to this discussion, the issue comes down to
access from the point of origin.  In an embedded or background instance of
MiNiFi C++, bidirectional followup calls for provenance only are not always
going to be available.  Additionally, where they are available they are not
going to be current and hence are fairly useless for security applications. 
Think of trying this on your laptop, IoT devices, or on financial
transactions.  If I find out 12-36hrs later when you reconnect or I can send
someone to the field to retrieve it or the ETL processes run that there was
an issue, it doesn't do me any good.  As Randy mentioned, you can recombine
all this later, however it is a very resource consuming process.  There is
no reason not to have it available when the data is available since it's
just a matter of allowing for its transfer in line with the data.  NiFi is
not assuming responsibility for anything it doesn't already, this just
extends it's reach to the full NiFi/MiNiFi instance so there should not be
an ownership concern.  This requires an extremely minor update in NiFi, but
is for a fundamental need in MiNiFi C++.

"Ok so I think what you're saying is"
[Daniel] Right, and since you can just disable it if you don't need it there
is no performance or bandwidth hit unless you enable it.

"It is really important to propose and advocate"
[Daniel] I don't see this as a model change, as per my previous questions
MiNiFi C++ seems to not yet have a solid model as the time and effort is
being mainly being put into MiNiFi Java.  Since I have very specific ideas
around MiNiFi C++ (and have discussed them with you last year and others at
HW when MiNiFi was only going to be in C) I have not seen this as a radical
departure but an elaboration on what we had already discussed.  If you or
the community wants to go a different path, I have no issue branching and
going a separate way with these and the LevelDB changes rather than
introducing these changes into the current path.  Being OpenSource there is
no right answer, so I'm certainly open to any suggestions, but I think
you'll find what I'm proposing here is going to be important when you get to
actual implementations of it and it's easier to change now than when you're
locked in later, especially given my issues getting our contributions into
NiFi.  As stated above, I don't see how this affects any other
implementations or use cases of MiNiFi C++/NiFi as proposed.




--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14048.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: MiNiFi C++ Data Provenance and Related Issues

2016-11-29 Thread Daniel Cave
As to Joe and Aldrin's concerns, I feel a bit more detail of what I had in
mind might clear up some of the concerns and vagaries (all valid) that you
mentioned.

As Aldrin mentioned, to me provenance is not about metadata needed for
routing.  I don't doubt there are use cases for that, as Randy mentioned,
however it was not the concern I had in mind that I am looking to address
with this discussion.  If the community wants to add more functionality from
a metadata also, we can certainly add that.

As for Joe's examples and concerns for in-band, I look at MiNiFi C++ as a
direct spoke of a NiFi hub and as such it really can be treated as one
"NiFi" instance.  Additionally, since MiNiFi C++ is a complete rewrite, as
has been previously discussed, making requirement variations from NiFi or
MiNiFi Java is acceptable, in my opinion.  As such, there is no value in
having separate provenance for MiNiFi C++ and NiFi since it is one cradle to
grave path (that happens to use both).  As for bandwidth concerns, this is
actually exactly one of the issues that concerns me as later calling to the
MiNiFi C++ enabled device merely to sort and retrieve provenance (which
would be a heavy operation as currently constructed) is not realistic.  One
of the biggest selling points of NiFi is its full data provenance ability,
and my goal is merely to extend it through the full "flow".  I personally
don't see this as an attribute as currently represented in the flowfiles
since that would not be an efficient structure to handle or maintain through
MiNiFi C++ pathing.  This requires the provenance tree related to that
flowfile to be sent (which should be small-ish in a MiNiFi C++ instance). 
My design for it was that it would be a separate data point on the flowfile
package using a simple, extremely lightweight, and easy to manipulate
structure.  Truthfully, it doesn't even have to be resident all through the
MiNiFi C++ flow if a viable repo replaces LevelDB and my preference is to
add it in at the S2S processor.  The important thing is that it can be sent
with the flowfile through S2S and then added to the main NiFi provenance
repo so as to provide a continuous chain.  This would be easy to toggle
through a single checkbox added to a MiNiFi C++ S2S variant so that if you
choose not to integrate as provenance isn't important to you, you could. 
Since in this model, MiNiFi C++ plus provenance only integrates with NiFi
hubs, there is no reason to concern with outside compatibility for this
specific S2S processor mechanism.

I see the ability to allow for "in-band" communication at the S2S-S2S point
as a requirement for some use cases.



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14045.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: MiNiFi C++ Data Provenance and Related Issues

2016-11-29 Thread Daniel Cave
Since MiNiFi C++ requires completely new code (unlike the Java version), I
don't see any reason we cant deviate where it makes requirement sense.  If
we move the provenance onto the flowfile, then your build issues and my
stability issues can be simplified because the local provenance repo becomes
log only and where the local repo could be handled by a standard logging
mechanism instead.  As you stated, installing additional open source
libraries in production environments is a near non-starter. 

If no one disagrees with the approach or really desperately wants to take it
on, I'm ok with taking the action item to start working on a good transport
structure and looking at making the changes needed for it to work through
S2S. This also requires making changes to NiFi to allow for the provenance
to be added to the main NiFi repo; this is something I was planning on doing
anyway as part of a new enterprise dp/dg engine based on NiFi I'm working
on.

We need someone to test a reliable replacement for LevelDB (be that LMDB,
which I believe comes standard in RHEL distributions, or whatever) and
integrate it or convert the local repo to log only.  I'll get to it
eventually after I make the other changes if no one else does.



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14040.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


MiNiFi C++ Data Provenance and Related Issues

2016-11-28 Thread Daniel Cave
This is a break off from the discussion on the MiNiFi C++ 0.1.0 Release
thread.  I assume a hub and spoke NiFi/MiNiFi C++ architecture.

As discussed on that thread, I am concerned about the existing choice for
data provenance tracking and the implications it leads to as well as the
current data provenance requirements for MiNiFi C++.  MiNiFi C++ must be
highly efficient and carry a minimal footprint in order to be able to
function at background and embedded levels.  As such, performance and space
are priorities as are the ability to communicate to the NiFi hub the needed
information (i.e. there isn't space for a large unindexed data provenance
archive locally nor the processing ability to handle it).

The data provenance registry must be:  1) Fault tolerant, 2) able to be
easily purged, 3) fast to write, 4) easily accessed in session, 5) easily
accessed post session.  The current choice (LevelDB) meets #3, but not the
other 4 requirements.  LevelDB is prone to corruption in cases of
application failure during a write (fails #1).  LevelDB has no indexing, and
if keys are by UUID then there is no way to efficiently sort by date or by
parent/child (fails #2, #4, #5).  The choice for a provenance store should
answer as many of these as possible.  For permanent stores, the choices
would be super lightweight databases or something fault resistent like LMDB. 
I don't have any preference, just that it functionally addresses as many
criteria as possible and absolutely satisfies #1.

A solution to #4 and #5 could be that the entire provenance tree inside
MiNiFi C++ rides with the flowfile and transfers to NiFi (including through
descendants).  This I see as something of a requirement as well, as it is
the only efficient way to provide cradle to grave provenance through the
entire MiNiFi/NiFi system without the need for heavy post processing to
reconstruct the tree.  While this adds slightly to the package being sent
between MiNiFi and NiFi, it's negligible compared to post query this
especially where MiNiFi is embedded or on an IoT device.

Any thoughts?



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: [DISCUSS] MiNiFi C++ 0.1.0 Release

2016-11-22 Thread Daniel Cave
For me personally, I don't see a value add of MiNiFi Java.  The needs that
NiFi can't address MiNiFi Java can't either, so my focus is MiNiFi C++ as
that is the hole that needs fixing, again in my opinion, so that is where my
MiNiFi focus is going to be.

As I go through things I am sure I will have more questions about choices
that have been made so far regarding MiNiFi C++ (as with all things, we all
have different views on how do things and there isn't necessarily a
right/wrong answer).  If there is a better forum to address these more
specific to MiNiFi C++, please let me know.  My most pressing question is
the choice to use LevelDB for the provenance repository rather than LMDB.  A
core tenant of NiFi is fault tolerance in near all cases (as well as full
data provenance).  As LevelDB is vulnerable to corruption during write
operations due to unexpected application interruptions, would not something
more fault tolerant such as LMDB (covered under OpenLDAP Public License) be
preferable?  The question of fault tolerance applies to the flowfile
repository as well.



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/DISCUSS-MiNiFi-C-0-1-0-Release-tp13956p13959.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: [DISCUSS] MiNiFi C++ 0.1.0 Release

2016-11-22 Thread Daniel Cave
Having been out of touch since MiNiFi C++ got added and just getting into it,
is there a reason the C++ version is trying to follow closely the MiNiFi
Java version rather than just insuring connectivity with NiFi?  I have not
been able to find alot of details regarding the roadmap for MiNiFi C++.

It seems to me that this tight coupling is coming at the cost of the
efficiency that should be gained through a C++ version.  MiNiFi C++ should
lend itself to a hub and spoke model with MiNiFi C++ acting as the spoke
clients and NiFi as the hub.  This only works, however, if maximum
efficiency is maintained as spoke needs may range from servers to embedded. 
Additional to embedded advantages, MiNiFi C++ also has the ability to run
natively as a Windows service with direct interaction with the Windows API
which is also difficult at best with the Java version.

Can you please provide some clarity on where things are headed?  For
reference, I have been through the wiki, JIRA, Confluence, Git, etc.



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/DISCUSS-MiNiFi-C-0-1-0-Release-tp13956p13957.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Tool specific Processors

2015-12-03 Thread Daniel Cave
For processors that are designed for a specific tool (that may or may not be 
open sourced itself), how should they be contributed?

Specifically, I am working on a few Oracle BRM processors for Nifi that avoid 
the need to go directly to the database through the problematic SQL processors 
and instead use the established api's.  However, as these are really specialty 
processors for a specific tool (in this case Oracle BRM), I don't know that 
they belong in a standard bundle or as a higher level nar bundle.

Is there an in progress nifi-custom-bundle (or some such) that I could add 
these too so they are community available, or what is the best practice to 
contribute them?

Thanks,
Daniel

Daniel S. Cave
Consultant / Software Developer
Office:  214.333.2000  | Mobile:  214.517.1222 | Fax:  214.343.1107
[ssg_email_footer_2013]