Great write up.

While I am not an expert on clustering, it would seem that having one
method of comparing flows, perhaps
with different strategies within that would be more maintainable as well.

Are you proposing that there is a unified flow comparison
capability/implementation/service that is shared between
clustering and versioned use cases?


On June 7, 2018 at 09:46:27, Mark Payne (marka...@hotmail.com) wrote:

Hi all,

Over the past couple of months, I have been doing a lot of testing with
large scale flows and
talking to others who are using large scale flows in production. ("Large
scale" flows in this case
means several thousand to tens of thousands of Processors). While NiFi does
a really good job of
handling the data flow, one area that needs some improvement is around
NiFi's clustering.
So for the 1.7.0 version of NiFi, we have spent quite a bit of time
focusing improving the
clustering mechanism to hold up to more demanding flows. The focus really
can be broken down into
three focus areas: UI sluggishness [1] [2] [3], Cluster Stability [4] [2]
[5] [3],
and User Experience [5] [6] [7] (note that many of these JIRA's are listed
under more than 1
category.)

With the above-mentioned JIRA's, I think we have significantly improved the
stability and
user experience around clustering. Local testing shows that in some cases,
requests that previously
took 15+ seconds (such as instantiating a template with several thousand
processors) now take
around 1 second. This provides a better user experience and also improves
our cluster stability
because it prevents nodes from dropping out the cluster due to timeouts.

There is, however, another important area that I believe is ripe for
improvement in our current
model. That is the mechanism used when a node joins a cluster, in order to
determine if the
cluster's flow can be inherited by the node. While the above work will
improve stability
considerably, we need to be very mindful that failures will still occur.
And we need to be good at
recovering from those.

The way that we do this currently is that we download the flow from the
cluster, and then we
"fingerprint" the flow. We then "fingerprint" our own flow and see if they
match. What we mean by
fingerprinting is that we go through the flow and pick out which elements
should make a flow
uninheritable and concatenate all of those together into one long String.
The original purpose of
this was to ensure that we don't lose any data when we join a node back to
a cluster. When this was
developed, though, we took a very strict approach of enforcing that the
node's flow must match
the cluster's flow - with only a few exceptions. For example, the position
of a processor on the
graph could be different; we simply inherit the cluster's value. The run
status of a processor can
be different; we simply inherit the cluster's value.

This fingerprinting approach has its benefit - it forces the user to be
mindful of any differences
between the node and cluster. However, it has several downsides as well. If
a node fails to perform
some update, it cannot join back to the cluster until the discrepancy is
addressed. Additionally, it
is difficult to understand just what the discrepancy is because the best
info that we can provide is
a segment of the fingerprint where the flows differ, and this is not very
clear. It's also difficult
to understand exactly which flow differences are relevant and which are
not.

The class that performs the fingerprinting is rather complex, and updates
are rather error-prone
because it is easy to forget to update the fingerprint when a new "feature"
is added to a component.
Worse still is that if a component gains a "collection" of objects, it is
easy to forget to sort
that collection, which results in incorrect fingerprinting that prevents a
node from joining a
cluster when it should be able to.

Most importantly, though, the current approach requires manual user
intervention when the flow
differs, and almost always the solution that is suggested/used is to shut
down the node, remove the
flow.xml.gz, the users.xml, and authorizations.xml, and then restart. This
will cause the
node to inherit the cluster's flow.

Clearly, this isn't ideal. I'd like to propose a far simpler approach to
determining flow
inheritability. Because the main goal of checking inheritability was to
ensure that there is no
data loss, I would propose that we use the same mechanism for inheriting a
cluster flow as we do for
updating to a new version of a Versioned Flow. We would first determine
which connections would be
removed from the flow if we inherit the cluster's flow. If there are no
connections removed, then
the flow is inheritable. If there are any connections removed, we will stop
each removed
connection's source and destination. We will then check if any connection
has any queued data.
If so, then we will restart all components that we started and fail. This
is critical because the
only way we can lose data when inheriting a flow is if we remove a
connection with data queued.
Otherwise, we determine that inheriting the flow will not cause data loss
and therefore the flow is
inheritable.

This approach will still ensure that we have no data loss. It also results
in a more resilient
recovery that requires no human intervention (unless inheriting the flow
would cause data loss -
in that case, I believe human intervention is still warranted. But we will
be able to inform the
user of which connection(s) have data and would be removed so that they can
address the concern.)
Another added benefit of this approach is that it would allow automation
tools to provision a node
NiFi node with a "seed flow" and if it joins a cluster with a flow, it will
simply inherit the
cluster's flow instead of using the seed flow. Currently, in order to do
this, the automation tools
would have to determine if a cluster already exists and if so not provide
the seeded flow. I think
this may be more important as users start running more and more on
Kubernetes.

While I believe 1.7.0 will provide some great benefits to our clustering
model, I do think that
we can do better with respect to determining flow inheritance. The proposed
inheritance model
provides a mechanism that results in a user experience that more closely
aligns with user
expectations in my opinion. It would result in NiFi being more stable and
reliable. However, it is
a large enough departure from how we have been doing things to-date that I
thought it appropriate
to start a DISCUSS thread to ensure that everyone is on the same page
first.

Any thoughts?

Thanks
-Mark


[1] NIFI-5241
[2] NIFI-950
[3] NIFI-5112
[4] NIFI-5204
[5] NIFI-5208
[6] NIFI-5186
[7] NIFI-5153

Reply via email to