Hi Mark,

Thanks for the detailed explanations!
The suggested approach makes total sense to me and it'll greatly improve
the user experience as I often see a node unable to join the cluster
because of a discrepancy in some component configuration... Right now, the
only solution is to do as you said: remove the files or copy the "good"
flow.xml.gz from one cluster node to the disconnected node.

Pierre

2018-06-07 15:46 GMT+02:00 Mark Payne <marka...@hotmail.com>:

> Hi all,
>
> Over the past couple of months, I have been doing a lot of testing with
> large scale flows and
> talking to others who are using large scale flows in production. ("Large
> scale" flows in this case
> means several thousand to tens of thousands of Processors). While NiFi
> does a really good job of
> handling the data flow, one area that needs some improvement is around
> NiFi's clustering.
> So for the 1.7.0 version of NiFi, we have spent quite a bit of time
> focusing improving the
> clustering mechanism to hold up to more demanding flows. The focus really
> can be broken down into
> three focus areas: UI sluggishness [1] [2] [3], Cluster Stability [4] [2]
> [5] [3],
> and User Experience [5] [6] [7] (note that many of these JIRA's are listed
> under more than 1
> category.)
>
> With the above-mentioned JIRA's, I think we have significantly improved
> the stability and
> user experience around clustering. Local testing shows that in some cases,
> requests that previously
> took 15+ seconds (such as instantiating a template with several thousand
> processors) now take
> around 1 second. This provides a better user experience and also improves
> our cluster stability
> because it prevents nodes from dropping out the cluster due to timeouts.
>
> There is, however, another important area that I believe is ripe for
> improvement in our current
> model. That is the mechanism used when a node joins a cluster, in order to
> determine if the
> cluster's flow can be inherited by the node. While the above work will
> improve stability
> considerably, we need to be very mindful that failures will still occur.
> And we need to be good at
> recovering from those.
>
> The way that we do this currently is that we download the flow from the
> cluster, and then we
> "fingerprint" the flow. We then "fingerprint" our own flow and see if they
> match. What we mean by
> fingerprinting is that we go through the flow and pick out which elements
> should make a flow
> uninheritable and concatenate all of those together into one long String.
> The original purpose of
> this was to ensure that we don't lose any data when we join a node back to
> a cluster. When this was
> developed, though, we took a very strict approach of enforcing that the
> node's flow must match
> the cluster's flow - with only a few exceptions. For example, the position
> of a processor on the
> graph could be different; we simply inherit the cluster's value. The run
> status of a processor can
> be different; we simply inherit the cluster's value.
>
> This fingerprinting approach has its benefit - it forces the user to be
> mindful of any differences
> between the node and cluster. However, it has several downsides as well.
> If a node fails to perform
> some update, it cannot join back to the cluster until the discrepancy is
> addressed. Additionally, it
> is difficult to understand just what the discrepancy is because the best
> info that we can provide is
> a segment of the fingerprint where the flows differ, and this is not very
> clear. It's also difficult
> to understand exactly which flow differences are relevant and which are
> not.
>
> The class that performs the fingerprinting is rather complex, and updates
> are rather error-prone
> because it is easy to forget to update the fingerprint when a new
> "feature" is added to a component.
> Worse still is that if a component gains a "collection" of objects, it is
> easy to forget to sort
> that collection, which results in incorrect fingerprinting that prevents a
> node from joining a
> cluster when it should be able to.
>
> Most importantly, though, the current approach requires manual user
> intervention when the flow
> differs, and almost always the solution that is suggested/used is to shut
> down the node, remove the
> flow.xml.gz, the users.xml, and authorizations.xml, and then restart. This
> will cause the
> node to inherit the cluster's flow.
>
> Clearly, this isn't ideal. I'd like to propose a far simpler approach to
> determining flow
> inheritability. Because the main goal of checking inheritability was to
> ensure that there is no
> data loss, I would propose that we use the same mechanism for inheriting a
> cluster flow as we do for
> updating to a new version of a Versioned Flow. We would first determine
> which connections would be
> removed from the flow if we inherit the cluster's flow. If there are no
> connections removed, then
> the flow is inheritable. If there are any connections removed, we will
> stop each removed
> connection's source and destination. We will then check if any connection
> has any queued data.
> If so, then we will restart all components that we started and fail. This
> is critical because the
> only way we can lose data when inheriting a flow is if we remove a
> connection with data queued.
> Otherwise, we determine that inheriting the flow will not cause data loss
> and therefore the flow is
> inheritable.
>
> This approach will still ensure that we have no data loss. It also results
> in a more resilient
> recovery that requires no human intervention (unless inheriting the flow
> would cause data loss -
> in that case, I believe human intervention is still warranted. But we will
> be able to inform the
> user of which connection(s) have data and would be removed so that they
> can address the concern.)
> Another added benefit of this approach is that it would allow automation
> tools to provision a node
> NiFi node with a "seed flow" and if it joins a cluster with a flow, it
> will simply inherit the
> cluster's flow instead of using the seed flow. Currently, in order to do
> this, the automation tools
> would have to determine if a cluster already exists and if so not provide
> the seeded flow. I think
> this may be more important as users start running more and more on
> Kubernetes.
>
> While I believe 1.7.0 will provide some great benefits to our clustering
> model, I do think that
> we can do better with respect to determining flow inheritance. The
> proposed inheritance model
> provides a mechanism that results in a user experience that more closely
> aligns with user
> expectations in my opinion. It would result in NiFi being more stable and
> reliable. However, it is
> a large enough departure from how we have been doing things to-date that I
> thought it appropriate
> to start a DISCUSS thread to ensure that everyone is on the same page
> first.
>
> Any thoughts?
>
> Thanks
> -Mark
>
>
> [1] NIFI-5241
> [2] NIFI-950
> [3] NIFI-5112
> [4] NIFI-5204
> [5] NIFI-5208
> [6] NIFI-5186
> [7] NIFI-5153
>
>

Reply via email to