Re: MiNiFi C++ Data Provenance and Related Issues
I will not be continuing this discussion. I will leave it to others to pick it up if they feel it's needed. -- View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14058.html Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Re: MiNiFi C++ Data Provenance and Related Issues
"Yes but there can be other hubs too and in parallel." [Daniel]For MiNiFi C++ -> SystemA -> SystemB -> ... -> NiFi, if you dont want provenance to travel then I don't see it as an issue since the outgoing message would be identical to what you have now. If you feel it's going to be extremely confusing then I could make it a new clone of the S2S MiNiFi C++ processor, but I don't see a point to just hide a toggle. On the NiFi side for this case you would use the normal S2S intake methods you use now. No change. Also, if you're going from MiNiFi C++ -> SystemA there is no change. For MiNiFi C++ -> MiNiFi C++ ->-> NiFi, if you want provenance travel then yes you are locked into using n*(MiNiFi C++) -> NiFi with the provenance toggled on and using the new S2S receiving processors in MiNiFi C++/NiFi (it has to be a new one to avoid backwards compatibility issues) that can handle provenance. Again, I don't see this as an issue either since you are clearly wanting this functionality if you're doing this. Am I missing something in my logic flow that you are seeing that I need to account for? "You've mentioned this a couple times now. " [Daniel] Agreed and this is how this discussion is meant to be taken. "I'm not quite sure I understand so please elaborate if my comments don't apply." [Daniel]It has to do with when and how it's consumed. On current path Atlas won't answer the issues, but as you said there are others and I have my own in progress as well. I fundamentally disagree with the current sink-retrieve-sink ETL paradigm (as you've seen from my public papers, there are others not public yet as well) as it is a complete waste of time and resources at this point. In all my work, data is handled as available (near real-time) rather than waiting for some ETL processes to run at some arbitrary point in the future. By doing this you avoid unnecessary traffic, storage, processing, maintenance, and design all while improving data availability. More specifically to this discussion, the issue comes down to access from the point of origin. In an embedded or background instance of MiNiFi C++, bidirectional followup calls for provenance only are not always going to be available. Additionally, where they are available they are not going to be current and hence are fairly useless for security applications. Think of trying this on your laptop, IoT devices, or on financial transactions. If I find out 12-36hrs later when you reconnect or I can send someone to the field to retrieve it or the ETL processes run that there was an issue, it doesn't do me any good. As Randy mentioned, you can recombine all this later, however it is a very resource consuming process. There is no reason not to have it available when the data is available since it's just a matter of allowing for its transfer in line with the data. NiFi is not assuming responsibility for anything it doesn't already, this just extends it's reach to the full NiFi/MiNiFi instance so there should not be an ownership concern. This requires an extremely minor update in NiFi, but is for a fundamental need in MiNiFi C++. "Ok so I think what you're saying is" [Daniel] Right, and since you can just disable it if you don't need it there is no performance or bandwidth hit unless you enable it. "It is really important to propose and advocate" [Daniel] I don't see this as a model change, as per my previous questions MiNiFi C++ seems to not yet have a solid model as the time and effort is being mainly being put into MiNiFi Java. Since I have very specific ideas around MiNiFi C++ (and have discussed them with you last year and others at HW when MiNiFi was only going to be in C) I have not seen this as a radical departure but an elaboration on what we had already discussed. If you or the community wants to go a different path, I have no issue branching and going a separate way with these and the LevelDB changes rather than introducing these changes into the current path. Being OpenSource there is no right answer, so I'm certainly open to any suggestions, but I think you'll find what I'm proposing here is going to be important when you get to actual implementations of it and it's easier to change now than when you're locked in later, especially given my issues getting our contributions into NiFi. As stated above, I don't see how this affects any other implementations or use cases of MiNiFi C++/NiFi as proposed. -- View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14048.html Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Re: MiNiFi C++ Data Provenance and Related Issues
As to Joe and Aldrin's concerns, I feel a bit more detail of what I had in mind might clear up some of the concerns and vagaries (all valid) that you mentioned. As Aldrin mentioned, to me provenance is not about metadata needed for routing. I don't doubt there are use cases for that, as Randy mentioned, however it was not the concern I had in mind that I am looking to address with this discussion. If the community wants to add more functionality from a metadata also, we can certainly add that. As for Joe's examples and concerns for in-band, I look at MiNiFi C++ as a direct spoke of a NiFi hub and as such it really can be treated as one "NiFi" instance. Additionally, since MiNiFi C++ is a complete rewrite, as has been previously discussed, making requirement variations from NiFi or MiNiFi Java is acceptable, in my opinion. As such, there is no value in having separate provenance for MiNiFi C++ and NiFi since it is one cradle to grave path (that happens to use both). As for bandwidth concerns, this is actually exactly one of the issues that concerns me as later calling to the MiNiFi C++ enabled device merely to sort and retrieve provenance (which would be a heavy operation as currently constructed) is not realistic. One of the biggest selling points of NiFi is its full data provenance ability, and my goal is merely to extend it through the full "flow". I personally don't see this as an attribute as currently represented in the flowfiles since that would not be an efficient structure to handle or maintain through MiNiFi C++ pathing. This requires the provenance tree related to that flowfile to be sent (which should be small-ish in a MiNiFi C++ instance). My design for it was that it would be a separate data point on the flowfile package using a simple, extremely lightweight, and easy to manipulate structure. Truthfully, it doesn't even have to be resident all through the MiNiFi C++ flow if a viable repo replaces LevelDB and my preference is to add it in at the S2S processor. The important thing is that it can be sent with the flowfile through S2S and then added to the main NiFi provenance repo so as to provide a continuous chain. This would be easy to toggle through a single checkbox added to a MiNiFi C++ S2S variant so that if you choose not to integrate as provenance isn't important to you, you could. Since in this model, MiNiFi C++ plus provenance only integrates with NiFi hubs, there is no reason to concern with outside compatibility for this specific S2S processor mechanism. I see the ability to allow for "in-band" communication at the S2S-S2S point as a requirement for some use cases. -- View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14045.html Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Re: MiNiFi C++ Data Provenance and Related Issues
Since MiNiFi C++ requires completely new code (unlike the Java version), I don't see any reason we cant deviate where it makes requirement sense. If we move the provenance onto the flowfile, then your build issues and my stability issues can be simplified because the local provenance repo becomes log only and where the local repo could be handled by a standard logging mechanism instead. As you stated, installing additional open source libraries in production environments is a near non-starter. If no one disagrees with the approach or really desperately wants to take it on, I'm ok with taking the action item to start working on a good transport structure and looking at making the changes needed for it to work through S2S. This also requires making changes to NiFi to allow for the provenance to be added to the main NiFi repo; this is something I was planning on doing anyway as part of a new enterprise dp/dg engine based on NiFi I'm working on. We need someone to test a reliable replacement for LevelDB (be that LMDB, which I believe comes standard in RHEL distributions, or whatever) and integrate it or convert the local repo to log only. I'll get to it eventually after I make the other changes if no one else does. -- View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14040.html Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
MiNiFi C++ Data Provenance and Related Issues
This is a break off from the discussion on the MiNiFi C++ 0.1.0 Release thread. I assume a hub and spoke NiFi/MiNiFi C++ architecture. As discussed on that thread, I am concerned about the existing choice for data provenance tracking and the implications it leads to as well as the current data provenance requirements for MiNiFi C++. MiNiFi C++ must be highly efficient and carry a minimal footprint in order to be able to function at background and embedded levels. As such, performance and space are priorities as are the ability to communicate to the NiFi hub the needed information (i.e. there isn't space for a large unindexed data provenance archive locally nor the processing ability to handle it). The data provenance registry must be: 1) Fault tolerant, 2) able to be easily purged, 3) fast to write, 4) easily accessed in session, 5) easily accessed post session. The current choice (LevelDB) meets #3, but not the other 4 requirements. LevelDB is prone to corruption in cases of application failure during a write (fails #1). LevelDB has no indexing, and if keys are by UUID then there is no way to efficiently sort by date or by parent/child (fails #2, #4, #5). The choice for a provenance store should answer as many of these as possible. For permanent stores, the choices would be super lightweight databases or something fault resistent like LMDB. I don't have any preference, just that it functionally addresses as many criteria as possible and absolutely satisfies #1. A solution to #4 and #5 could be that the entire provenance tree inside MiNiFi C++ rides with the flowfile and transfers to NiFi (including through descendants). This I see as something of a requirement as well, as it is the only efficient way to provide cradle to grave provenance through the entire MiNiFi/NiFi system without the need for heavy post processing to reconstruct the tree. While this adds slightly to the package being sent between MiNiFi and NiFi, it's negligible compared to post query this especially where MiNiFi is embedded or on an IoT device. Any thoughts? -- View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024.html Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Re: [DISCUSS] MiNiFi C++ 0.1.0 Release
For me personally, I don't see a value add of MiNiFi Java. The needs that NiFi can't address MiNiFi Java can't either, so my focus is MiNiFi C++ as that is the hole that needs fixing, again in my opinion, so that is where my MiNiFi focus is going to be. As I go through things I am sure I will have more questions about choices that have been made so far regarding MiNiFi C++ (as with all things, we all have different views on how do things and there isn't necessarily a right/wrong answer). If there is a better forum to address these more specific to MiNiFi C++, please let me know. My most pressing question is the choice to use LevelDB for the provenance repository rather than LMDB. A core tenant of NiFi is fault tolerance in near all cases (as well as full data provenance). As LevelDB is vulnerable to corruption during write operations due to unexpected application interruptions, would not something more fault tolerant such as LMDB (covered under OpenLDAP Public License) be preferable? The question of fault tolerance applies to the flowfile repository as well. -- View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/DISCUSS-MiNiFi-C-0-1-0-Release-tp13956p13959.html Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Re: [DISCUSS] MiNiFi C++ 0.1.0 Release
Having been out of touch since MiNiFi C++ got added and just getting into it, is there a reason the C++ version is trying to follow closely the MiNiFi Java version rather than just insuring connectivity with NiFi? I have not been able to find alot of details regarding the roadmap for MiNiFi C++. It seems to me that this tight coupling is coming at the cost of the efficiency that should be gained through a C++ version. MiNiFi C++ should lend itself to a hub and spoke model with MiNiFi C++ acting as the spoke clients and NiFi as the hub. This only works, however, if maximum efficiency is maintained as spoke needs may range from servers to embedded. Additional to embedded advantages, MiNiFi C++ also has the ability to run natively as a Windows service with direct interaction with the Windows API which is also difficult at best with the Java version. Can you please provide some clarity on where things are headed? For reference, I have been through the wiki, JIRA, Confluence, Git, etc. -- View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/DISCUSS-MiNiFi-C-0-1-0-Release-tp13956p13957.html Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Tool specific Processors
For processors that are designed for a specific tool (that may or may not be open sourced itself), how should they be contributed? Specifically, I am working on a few Oracle BRM processors for Nifi that avoid the need to go directly to the database through the problematic SQL processors and instead use the established api's. However, as these are really specialty processors for a specific tool (in this case Oracle BRM), I don't know that they belong in a standard bundle or as a higher level nar bundle. Is there an in progress nifi-custom-bundle (or some such) that I could add these too so they are community available, or what is the best practice to contribute them? Thanks, Daniel Daniel S. Cave Consultant / Software Developer Office: 214.333.2000 | Mobile: 214.517.1222 | Fax: 214.343.1107 [ssg_email_footer_2013]