Hi Mark, Thanks for the detailed explanations! The suggested approach makes total sense to me and it'll greatly improve the user experience as I often see a node unable to join the cluster because of a discrepancy in some component configuration... Right now, the only solution is to do as you said: remove the files or copy the "good" flow.xml.gz from one cluster node to the disconnected node.
Pierre 2018-06-07 15:46 GMT+02:00 Mark Payne <marka...@hotmail.com>: > Hi all, > > Over the past couple of months, I have been doing a lot of testing with > large scale flows and > talking to others who are using large scale flows in production. ("Large > scale" flows in this case > means several thousand to tens of thousands of Processors). While NiFi > does a really good job of > handling the data flow, one area that needs some improvement is around > NiFi's clustering. > So for the 1.7.0 version of NiFi, we have spent quite a bit of time > focusing improving the > clustering mechanism to hold up to more demanding flows. The focus really > can be broken down into > three focus areas: UI sluggishness [1] [2] [3], Cluster Stability [4] [2] > [5] [3], > and User Experience [5] [6] [7] (note that many of these JIRA's are listed > under more than 1 > category.) > > With the above-mentioned JIRA's, I think we have significantly improved > the stability and > user experience around clustering. Local testing shows that in some cases, > requests that previously > took 15+ seconds (such as instantiating a template with several thousand > processors) now take > around 1 second. This provides a better user experience and also improves > our cluster stability > because it prevents nodes from dropping out the cluster due to timeouts. > > There is, however, another important area that I believe is ripe for > improvement in our current > model. That is the mechanism used when a node joins a cluster, in order to > determine if the > cluster's flow can be inherited by the node. While the above work will > improve stability > considerably, we need to be very mindful that failures will still occur. > And we need to be good at > recovering from those. > > The way that we do this currently is that we download the flow from the > cluster, and then we > "fingerprint" the flow. We then "fingerprint" our own flow and see if they > match. What we mean by > fingerprinting is that we go through the flow and pick out which elements > should make a flow > uninheritable and concatenate all of those together into one long String. > The original purpose of > this was to ensure that we don't lose any data when we join a node back to > a cluster. When this was > developed, though, we took a very strict approach of enforcing that the > node's flow must match > the cluster's flow - with only a few exceptions. For example, the position > of a processor on the > graph could be different; we simply inherit the cluster's value. The run > status of a processor can > be different; we simply inherit the cluster's value. > > This fingerprinting approach has its benefit - it forces the user to be > mindful of any differences > between the node and cluster. However, it has several downsides as well. > If a node fails to perform > some update, it cannot join back to the cluster until the discrepancy is > addressed. Additionally, it > is difficult to understand just what the discrepancy is because the best > info that we can provide is > a segment of the fingerprint where the flows differ, and this is not very > clear. It's also difficult > to understand exactly which flow differences are relevant and which are > not. > > The class that performs the fingerprinting is rather complex, and updates > are rather error-prone > because it is easy to forget to update the fingerprint when a new > "feature" is added to a component. > Worse still is that if a component gains a "collection" of objects, it is > easy to forget to sort > that collection, which results in incorrect fingerprinting that prevents a > node from joining a > cluster when it should be able to. > > Most importantly, though, the current approach requires manual user > intervention when the flow > differs, and almost always the solution that is suggested/used is to shut > down the node, remove the > flow.xml.gz, the users.xml, and authorizations.xml, and then restart. This > will cause the > node to inherit the cluster's flow. > > Clearly, this isn't ideal. I'd like to propose a far simpler approach to > determining flow > inheritability. Because the main goal of checking inheritability was to > ensure that there is no > data loss, I would propose that we use the same mechanism for inheriting a > cluster flow as we do for > updating to a new version of a Versioned Flow. We would first determine > which connections would be > removed from the flow if we inherit the cluster's flow. If there are no > connections removed, then > the flow is inheritable. If there are any connections removed, we will > stop each removed > connection's source and destination. We will then check if any connection > has any queued data. > If so, then we will restart all components that we started and fail. This > is critical because the > only way we can lose data when inheriting a flow is if we remove a > connection with data queued. > Otherwise, we determine that inheriting the flow will not cause data loss > and therefore the flow is > inheritable. > > This approach will still ensure that we have no data loss. It also results > in a more resilient > recovery that requires no human intervention (unless inheriting the flow > would cause data loss - > in that case, I believe human intervention is still warranted. But we will > be able to inform the > user of which connection(s) have data and would be removed so that they > can address the concern.) > Another added benefit of this approach is that it would allow automation > tools to provision a node > NiFi node with a "seed flow" and if it joins a cluster with a flow, it > will simply inherit the > cluster's flow instead of using the seed flow. Currently, in order to do > this, the automation tools > would have to determine if a cluster already exists and if so not provide > the seeded flow. I think > this may be more important as users start running more and more on > Kubernetes. > > While I believe 1.7.0 will provide some great benefits to our clustering > model, I do think that > we can do better with respect to determining flow inheritance. The > proposed inheritance model > provides a mechanism that results in a user experience that more closely > aligns with user > expectations in my opinion. It would result in NiFi being more stable and > reliable. However, it is > a large enough departure from how we have been doing things to-date that I > thought it appropriate > to start a DISCUSS thread to ensure that everyone is on the same page > first. > > Any thoughts? > > Thanks > -Mark > > > [1] NIFI-5241 > [2] NIFI-950 > [3] NIFI-5112 > [4] NIFI-5204 > [5] NIFI-5208 > [6] NIFI-5186 > [7] NIFI-5153 > >