Re: MiNiFi C++ Data Provenance and Related Issues

Joe Witt Tue, 29 Nov 2016 05:56:58 -0800

Might make sense to split these discussions out.  Regarding provenance...

Data provenance is about tracking the origin and attribution of data
and the model that we've got allows that to occur despite that the
fact that we're often handling directed graphs of flows involving
numerous systems.


Transport models:
- "In-band" embedded with the data the provenance is about.

This is the original model we considered several years ago and is how
things are commonly done in other systems with some aspects of
provenance.  The problem with this approach is how do you resolve the
provenance chain when you deliver from MiNIFi-A to System1 and System2
in paralllel?  This is a simple case.  But what does that provenance
chain look like for the object sent to System1 and what does it look
like for the System2 provenance?  Of course one object won't know
about the other object.  So what does that provenance tell us?

- "Out of band" A separate feed of event data which is 'about the
data' but is not the data itself.  This is how NiFi works today.

Now, as Randy points out there are cases where having the provenance
data in-band would help with routing cases.  I'd make the case that
this is not about data provenance as we're generally talking about it.
That is just contextual metadata and is why both MiNiFi and NiFi
support and advocate the flowfile construct which has metadata and
content - just like HTTP does.  If you need information about where
data came from then it should be embedded in the flowfile metadata.
If there are common details that are valuable and can best be relayed
by the last component that touched an object let's discuss that.

-- Now, we could arguably support both models where by we allow you to
optionally send in-band provenance but we must make it clear that the
provenance chain of an in-band message only reflects the linear chain
of provenance that is known by that object and does NOT reflect the
full graph.  However, i'd also have significant concerns about how to
efficiently store this data.  FlowFile attributes are today held in
memory.  So, alternatively we could make a new FlowFile construct for
this such that the flowfile chain (which could be quite large) is in
some form of non-memory-loaded content.  But this would also be a
pretty huge change.

It isn't clear to me that introducing an in-band model is a good path for us.

Thanks
Joe

On Tue, Nov 29, 2016 at 7:38 AM, Aldrin Piri <aldrinp...@gmail.com> wrote:
> Hey folks,
>
> Good commentary and I would encourage you to create associated tickets
> where applicable such that we can track such ideas and their efforts from a
> community project level.
>
> Concerning building, Randy, if you could provide more details on your OS X
> build problems, this would be greatly helpful.  I know a number of
> contributors have OS X machines and seem to have reasonable success so any
> details on your environment would be helpful in trying to track down the
> problem.  Certainly understand the concerns over wanting things to work on
> a wide variety of systems as stock.  This was voiced in part by
> https://issues.apache.org/jira/browse/MINIFI-118.  We certainly have
> options here depending on what the target environment will support, such as
> more static linking which may be acceptable for larger systems running more
> enterprise level OSes.
>
> LMDB certainly seems like it could be an interesting candidate doing some
> initial glances over it and its licensing (OpenLDAP Public License) seems
> like a variant of a 3-Clause BSD, so it should be okay to utilize from an
> ALv2 concern.  Definitely worth pursuing, and as mentioned in the prior
> thread, there are no hard and fast commitments to a particular technology
> but rather, especially in its early stages, to establish the interfaces and
> framework and provide a working implementation such that there is a place
> to start.
>
> Concerning the idea of integrating provenance with FlowFiles, I can
> certainly see the value in bundling it with the FlowFiles from the
> standpoint of minimizing footprint and resource utilization on
> device/source.  One important item to also be mindful of that has come up
> with a number of folks looking to tackle management of dataflow is also
> that of limited communications and/or prohibitive cost when looking at
> large deployments of such agents.  A separate provenance repository allows
> the sending of provenance events out of band when convenient or explicitly
> requested/needed.  In another aspect on that idea, including provenance in
> each FlowFile could exhaust disk more quickly in the event that a means of
> transmission is not available.  In this case, the discrete storage
> mechanisms could allow the purging and removal of provenance without the
> cost of losing data that might otherwise be able to continue being
> buffered.  That's not to say this use case is any more valid or important,
> but another point of consideration in the design choices made for
> data/provenance storage and transmission.
>
> I think the key item of import for the effort is that there are many and
> widely varying use cases and situations for how this particular
> implementation needs to be built, deployed, and utilized but makes for some
> interesting discussions and design processes that should make for a
> rewarding challenge.
>
> Thanks for the input!
>
> On Tue, Nov 29, 2016 at 4:56 AM, Daniel Cave <dc...@ssglimited.com> wrote:
>
>> Since MiNiFi C++ requires completely new code (unlike the Java version), I
>> don't see any reason we cant deviate where it makes requirement sense.  If
>> we move the provenance onto the flowfile, then your build issues and my
>> stability issues can be simplified because the local provenance repo
>> becomes
>> log only and where the local repo could be handled by a standard logging
>> mechanism instead.  As you stated, installing additional open source
>> libraries in production environments is a near non-starter.
>>
>> If no one disagrees with the approach or really desperately wants to take
>> it
>> on, I'm ok with taking the action item to start working on a good transport
>> structure and looking at making the changes needed for it to work through
>> S2S. This also requires making changes to NiFi to allow for the provenance
>> to be added to the main NiFi repo; this is something I was planning on
>> doing
>> anyway as part of a new enterprise dp/dg engine based on NiFi I'm working
>> on.
>>
>> We need someone to test a reliable replacement for LevelDB (be that LMDB,
>> which I believe comes standard in RHEL distributions, or whatever) and
>> integrate it or convert the local repo to log only.  I'll get to it
>> eventually after I make the other changes if no one else does.
>>
>>
>>
>> --
>> View this message in context: http://apache-nifi-developer-
>> list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-
>> Related-Issues-tp14024p14040.html
>> Sent from the Apache NiFi Developer List mailing list archive at
>> Nabble.com.
>>

Re: MiNiFi C++ Data Provenance and Related Issues

Reply via email to