Thanks, all, for the great feedback so far. I did leave out any mention of how I would envision handling the authorizations.xml and users.xml in the new approach, but yes, as has been suggested I do believe that it will be important to also inherit those from the cluster.
Bryan also brought up a good point regarding handling of bundle information. My thought, at present, would be to leave that logic as-is for now. Unless you have any suggested improvements that could be made? > On Jun 7, 2018, at 10:48 AM, Joe Witt <joe.w...@gmail.com> wrote: > > Mark > > I definitely think it is time to move on from the fingerprinting > model. I recall the conversations long ago that led us down this path > and ultimately the thing that mattered most was ensuring data loss > cases were prevented. This still addresses that, reduces a ton of > code, and simplifies the experience. Ultimately, we want to ensure a > couple things: > 1) When a node is in a weird state it should be able to be > disconnected from the cluster and debugged if needed. > 2) If a node is meant to be part of a cluster as per its configuration > then all things should help us get it back to that state and as > automated as possible. The only prevention should be avoiding > unintended data loss. > > As bryan and others have noted there are other files such as > authorizations. They should be inherited by the cluster consensus if > possible. Following logic of #2 above the user makes it clear during > initial node setup that the node ultimately belongs in a cluster. > > All of these improvements are making a huge difference so thanks for > all the efforts. I've got a flow now with 100,000+ processors in it > and the responsiveness is spot on! > > Thanks > Joe > > On Thu, Jun 7, 2018 at 10:20 AM, Bryan Bende <bbe...@gmail.com> wrote: >> Using the versioned flow logic seems like a good idea. >> >> Would the authorizer fingerprints still be checked as part of joining >> the cluster? >> >> Currently that is appended to the overall fingerprint to ensure each >> node has the same users/policies, or at least same config (i.e. LDAP). >> Would be nice if a node could merge the clusters users/groups/policies >> into itself, as long as it had a subset of what the cluster had >> (assuming a configurable authorizer). >> >> Also, I believe the list of bundles on the node is compared to the >> bundles in the cluster to ensure consistency. I don't think this is >> technically part of the fingerprint, but just curious if you envision >> that improving/changing at all. >> >> >> On Thu, Jun 7, 2018 at 10:14 AM, Otto Fowler <ottobackwa...@gmail.com> wrote: >>> Great write up. >>> >>> While I am not an expert on clustering, it would seem that having one >>> method of comparing flows, perhaps >>> with different strategies within that would be more maintainable as well. >>> >>> Are you proposing that there is a unified flow comparison >>> capability/implementation/service that is shared between >>> clustering and versioned use cases? >>> >>> >>> On June 7, 2018 at 09:46:27, Mark Payne (marka...@hotmail.com) wrote: >>> >>> Hi all, >>> >>> Over the past couple of months, I have been doing a lot of testing with >>> large scale flows and >>> talking to others who are using large scale flows in production. ("Large >>> scale" flows in this case >>> means several thousand to tens of thousands of Processors). While NiFi does >>> a really good job of >>> handling the data flow, one area that needs some improvement is around >>> NiFi's clustering. >>> So for the 1.7.0 version of NiFi, we have spent quite a bit of time >>> focusing improving the >>> clustering mechanism to hold up to more demanding flows. The focus really >>> can be broken down into >>> three focus areas: UI sluggishness [1] [2] [3], Cluster Stability [4] [2] >>> [5] [3], >>> and User Experience [5] [6] [7] (note that many of these JIRA's are listed >>> under more than 1 >>> category.) >>> >>> With the above-mentioned JIRA's, I think we have significantly improved the >>> stability and >>> user experience around clustering. Local testing shows that in some cases, >>> requests that previously >>> took 15+ seconds (such as instantiating a template with several thousand >>> processors) now take >>> around 1 second. This provides a better user experience and also improves >>> our cluster stability >>> because it prevents nodes from dropping out the cluster due to timeouts. >>> >>> There is, however, another important area that I believe is ripe for >>> improvement in our current >>> model. That is the mechanism used when a node joins a cluster, in order to >>> determine if the >>> cluster's flow can be inherited by the node. While the above work will >>> improve stability >>> considerably, we need to be very mindful that failures will still occur. >>> And we need to be good at >>> recovering from those. >>> >>> The way that we do this currently is that we download the flow from the >>> cluster, and then we >>> "fingerprint" the flow. We then "fingerprint" our own flow and see if they >>> match. What we mean by >>> fingerprinting is that we go through the flow and pick out which elements >>> should make a flow >>> uninheritable and concatenate all of those together into one long String. >>> The original purpose of >>> this was to ensure that we don't lose any data when we join a node back to >>> a cluster. When this was >>> developed, though, we took a very strict approach of enforcing that the >>> node's flow must match >>> the cluster's flow - with only a few exceptions. For example, the position >>> of a processor on the >>> graph could be different; we simply inherit the cluster's value. The run >>> status of a processor can >>> be different; we simply inherit the cluster's value. >>> >>> This fingerprinting approach has its benefit - it forces the user to be >>> mindful of any differences >>> between the node and cluster. However, it has several downsides as well. If >>> a node fails to perform >>> some update, it cannot join back to the cluster until the discrepancy is >>> addressed. Additionally, it >>> is difficult to understand just what the discrepancy is because the best >>> info that we can provide is >>> a segment of the fingerprint where the flows differ, and this is not very >>> clear. It's also difficult >>> to understand exactly which flow differences are relevant and which are >>> not. >>> >>> The class that performs the fingerprinting is rather complex, and updates >>> are rather error-prone >>> because it is easy to forget to update the fingerprint when a new "feature" >>> is added to a component. >>> Worse still is that if a component gains a "collection" of objects, it is >>> easy to forget to sort >>> that collection, which results in incorrect fingerprinting that prevents a >>> node from joining a >>> cluster when it should be able to. >>> >>> Most importantly, though, the current approach requires manual user >>> intervention when the flow >>> differs, and almost always the solution that is suggested/used is to shut >>> down the node, remove the >>> flow.xml.gz, the users.xml, and authorizations.xml, and then restart. This >>> will cause the >>> node to inherit the cluster's flow. >>> >>> Clearly, this isn't ideal. I'd like to propose a far simpler approach to >>> determining flow >>> inheritability. Because the main goal of checking inheritability was to >>> ensure that there is no >>> data loss, I would propose that we use the same mechanism for inheriting a >>> cluster flow as we do for >>> updating to a new version of a Versioned Flow. We would first determine >>> which connections would be >>> removed from the flow if we inherit the cluster's flow. If there are no >>> connections removed, then >>> the flow is inheritable. If there are any connections removed, we will stop >>> each removed >>> connection's source and destination. We will then check if any connection >>> has any queued data. >>> If so, then we will restart all components that we started and fail. This >>> is critical because the >>> only way we can lose data when inheriting a flow is if we remove a >>> connection with data queued. >>> Otherwise, we determine that inheriting the flow will not cause data loss >>> and therefore the flow is >>> inheritable. >>> >>> This approach will still ensure that we have no data loss. It also results >>> in a more resilient >>> recovery that requires no human intervention (unless inheriting the flow >>> would cause data loss - >>> in that case, I believe human intervention is still warranted. But we will >>> be able to inform the >>> user of which connection(s) have data and would be removed so that they can >>> address the concern.) >>> Another added benefit of this approach is that it would allow automation >>> tools to provision a node >>> NiFi node with a "seed flow" and if it joins a cluster with a flow, it will >>> simply inherit the >>> cluster's flow instead of using the seed flow. Currently, in order to do >>> this, the automation tools >>> would have to determine if a cluster already exists and if so not provide >>> the seeded flow. I think >>> this may be more important as users start running more and more on >>> Kubernetes. >>> >>> While I believe 1.7.0 will provide some great benefits to our clustering >>> model, I do think that >>> we can do better with respect to determining flow inheritance. The proposed >>> inheritance model >>> provides a mechanism that results in a user experience that more closely >>> aligns with user >>> expectations in my opinion. It would result in NiFi being more stable and >>> reliable. However, it is >>> a large enough departure from how we have been doing things to-date that I >>> thought it appropriate >>> to start a DISCUSS thread to ensure that everyone is on the same page >>> first. >>> >>> Any thoughts? >>> >>> Thanks >>> -Mark >>> >>> >>> [1] NIFI-5241 >>> [2] NIFI-950 >>> [3] NIFI-5112 >>> [4] NIFI-5204 >>> [5] NIFI-5208 >>> [6] NIFI-5186 >>> [7] NIFI-5153