I have no comments as to LevelDB's fit for the requirements Daniel described. However, the current build process & binary artifact produced are problematic in two ways:
1. OSX has proven particularly stubborn about putting the leveldb headers in a place cmake can find them 2. Running the minifi-cpp binary requires RHEL environments to pre-install leveldb-devel (and pre-req epel-release), stumbling blocks for enterprise deployment. +1 for shipping provenance inside FlowFiles. As a consumer of FlowFiles generated by nifi-minifi-cpp agents, I need the context of every FlowFile's provenance to route it properly. The existing solutions of provenance-query-per-flowfile or separately exporting provenance and joining via UUID downstream are painful. The idea arises that minifi-cpp might not need mimic its Java cousins' separate repository stores, particularly if S2S could be optimized to avoid re-transmission of static or slowly changing flow metadata. On 2016-11-28 09:25 (-0500), Daniel Cave <d...@ssglimited.com> wrote: > This is a break off from the discussion on the MiNiFi C++ 0.1.0 Release> > thread. I assume a hub and spoke NiFi/MiNiFi C++ architecture.> > > As discussed on that thread, I am concerned about the existing choice for> > data provenance tracking and the implications it leads to as well as the> > current data provenance requirements for MiNiFi C++. MiNiFi C++ must be> > highly efficient and carry a minimal footprint in order to be able to> > function at background and embedded levels. As such, performance and space> > are priorities as are the ability to communicate to the NiFi hub the needed> > information (i.e. there isn't space for a large unindexed data provenance> > archive locally nor the processing ability to handle it).> > > The data provenance registry must be: 1) Fault tolerant, 2) able to be> > easily purged, 3) fast to write, 4) easily accessed in session, 5) easily> > accessed post session. The current choice (LevelDB) meets #3, but not the> > other 4 requirements. LevelDB is prone to corruption in cases of> > application failure during a write (fails #1). LevelDB has no indexing, and> > if keys are by UUID then there is no way to efficiently sort by date or by> > parent/child (fails #2, #4, #5). The choice for a provenance store should> > answer as many of these as possible. For permanent stores, the choices> > would be super lightweight databases or something fault resistent like LMDB. > > I don't have any preference, just that it functionally addresses as many> > criteria as possible and absolutely satisfies #1.> > > A solution to #4 and #5 could be that the entire provenance tree inside> > MiNiFi C++ rides with the flowfile and transfers to NiFi (including through> > descendants). This I see as something of a requirement as well, as it is> > the only efficient way to provide cradle to grave provenance through the> > entire MiNiFi/NiFi system without the need for heavy post processing to> > reconstruct the tree. While this adds slightly to the package being sent> > between MiNiFi and NiFi, it's negligible compared to post query this> > especially where MiNiFi is embedded or on an IoT device.> > > Any thoughts?> > > > > --> > View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024.html> > Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.> >