Great ideas Mark. Another addition for ease/convenience of a node joining a
cluster is related to the authorizations.xml and users.xml. In the case of
the flow.xml.gz, if the file is missing, it will obtain a copy from the
cluster. The same should be true of authorizations.xml and users.xml files,
IMO. Currently, the authorizers.xml file is used to generate these files.
And, if the cluster has made additional changes to users or policies
(likely, in many cases), then the node is not able to synchronize and join.
Again, this becomes a manual process to synchronize these files between the
prospective node and the cluster. It would be nice to avoid this.

I also want to re-emphasize the work Mark mentioned in performance
enhancements. That topic seems somewhat overtaken by the interesting and
valuable discussion of clustering. However, performance and responsiveness
of very large flows is an area to continually keep in mind. I'm very happy
to hear there has been some specific work and resulting improvements in
this area!

-Mark

On Thu, Jun 7, 2018 at 10:20 AM, Bryan Bende <bbe...@gmail.com> wrote:

> Using the versioned flow logic seems like a good idea.
>
> Would the authorizer fingerprints still be checked as part of joining
> the cluster?
>
> Currently that is appended to the overall fingerprint to ensure each
> node has the same users/policies, or at least same config (i.e. LDAP).
> Would be nice if a node could merge the clusters users/groups/policies
> into itself, as long as it had a subset of what the cluster had
> (assuming a configurable authorizer).
>
> Also, I believe the list of bundles on the node is compared to the
> bundles in the cluster to ensure consistency. I don't think this is
> technically part of the fingerprint, but just curious if you envision
> that improving/changing at all.
>
>
> On Thu, Jun 7, 2018 at 10:14 AM, Otto Fowler <ottobackwa...@gmail.com>
> wrote:
> > Great write up.
> >
> > While I am not an expert on clustering, it would seem that having one
> > method of comparing flows, perhaps
> > with different strategies within that would be more maintainable as well.
> >
> > Are you proposing that there is a unified flow comparison
> > capability/implementation/service that is shared between
> > clustering and versioned use cases?
> >
> >
> > On June 7, 2018 at 09:46:27, Mark Payne (marka...@hotmail.com) wrote:
> >
> > Hi all,
> >
> > Over the past couple of months, I have been doing a lot of testing with
> > large scale flows and
> > talking to others who are using large scale flows in production. ("Large
> > scale" flows in this case
> > means several thousand to tens of thousands of Processors). While NiFi
> does
> > a really good job of
> > handling the data flow, one area that needs some improvement is around
> > NiFi's clustering.
> > So for the 1.7.0 version of NiFi, we have spent quite a bit of time
> > focusing improving the
> > clustering mechanism to hold up to more demanding flows. The focus really
> > can be broken down into
> > three focus areas: UI sluggishness [1] [2] [3], Cluster Stability [4] [2]
> > [5] [3],
> > and User Experience [5] [6] [7] (note that many of these JIRA's are
> listed
> > under more than 1
> > category.)
> >
> > With the above-mentioned JIRA's, I think we have significantly improved
> the
> > stability and
> > user experience around clustering. Local testing shows that in some
> cases,
> > requests that previously
> > took 15+ seconds (such as instantiating a template with several thousand
> > processors) now take
> > around 1 second. This provides a better user experience and also improves
> > our cluster stability
> > because it prevents nodes from dropping out the cluster due to timeouts.
> >
> > There is, however, another important area that I believe is ripe for
> > improvement in our current
> > model. That is the mechanism used when a node joins a cluster, in order
> to
> > determine if the
> > cluster's flow can be inherited by the node. While the above work will
> > improve stability
> > considerably, we need to be very mindful that failures will still occur.
> > And we need to be good at
> > recovering from those.
> >
> > The way that we do this currently is that we download the flow from the
> > cluster, and then we
> > "fingerprint" the flow. We then "fingerprint" our own flow and see if
> they
> > match. What we mean by
> > fingerprinting is that we go through the flow and pick out which elements
> > should make a flow
> > uninheritable and concatenate all of those together into one long String.
> > The original purpose of
> > this was to ensure that we don't lose any data when we join a node back
> to
> > a cluster. When this was
> > developed, though, we took a very strict approach of enforcing that the
> > node's flow must match
> > the cluster's flow - with only a few exceptions. For example, the
> position
> > of a processor on the
> > graph could be different; we simply inherit the cluster's value. The run
> > status of a processor can
> > be different; we simply inherit the cluster's value.
> >
> > This fingerprinting approach has its benefit - it forces the user to be
> > mindful of any differences
> > between the node and cluster. However, it has several downsides as well.
> If
> > a node fails to perform
> > some update, it cannot join back to the cluster until the discrepancy is
> > addressed. Additionally, it
> > is difficult to understand just what the discrepancy is because the best
> > info that we can provide is
> > a segment of the fingerprint where the flows differ, and this is not very
> > clear. It's also difficult
> > to understand exactly which flow differences are relevant and which are
> > not.
> >
> > The class that performs the fingerprinting is rather complex, and updates
> > are rather error-prone
> > because it is easy to forget to update the fingerprint when a new
> "feature"
> > is added to a component.
> > Worse still is that if a component gains a "collection" of objects, it is
> > easy to forget to sort
> > that collection, which results in incorrect fingerprinting that prevents
> a
> > node from joining a
> > cluster when it should be able to.
> >
> > Most importantly, though, the current approach requires manual user
> > intervention when the flow
> > differs, and almost always the solution that is suggested/used is to shut
> > down the node, remove the
> > flow.xml.gz, the users.xml, and authorizations.xml, and then restart.
> This
> > will cause the
> > node to inherit the cluster's flow.
> >
> > Clearly, this isn't ideal. I'd like to propose a far simpler approach to
> > determining flow
> > inheritability. Because the main goal of checking inheritability was to
> > ensure that there is no
> > data loss, I would propose that we use the same mechanism for inheriting
> a
> > cluster flow as we do for
> > updating to a new version of a Versioned Flow. We would first determine
> > which connections would be
> > removed from the flow if we inherit the cluster's flow. If there are no
> > connections removed, then
> > the flow is inheritable. If there are any connections removed, we will
> stop
> > each removed
> > connection's source and destination. We will then check if any connection
> > has any queued data.
> > If so, then we will restart all components that we started and fail. This
> > is critical because the
> > only way we can lose data when inheriting a flow is if we remove a
> > connection with data queued.
> > Otherwise, we determine that inheriting the flow will not cause data loss
> > and therefore the flow is
> > inheritable.
> >
> > This approach will still ensure that we have no data loss. It also
> results
> > in a more resilient
> > recovery that requires no human intervention (unless inheriting the flow
> > would cause data loss -
> > in that case, I believe human intervention is still warranted. But we
> will
> > be able to inform the
> > user of which connection(s) have data and would be removed so that they
> can
> > address the concern.)
> > Another added benefit of this approach is that it would allow automation
> > tools to provision a node
> > NiFi node with a "seed flow" and if it joins a cluster with a flow, it
> will
> > simply inherit the
> > cluster's flow instead of using the seed flow. Currently, in order to do
> > this, the automation tools
> > would have to determine if a cluster already exists and if so not provide
> > the seeded flow. I think
> > this may be more important as users start running more and more on
> > Kubernetes.
> >
> > While I believe 1.7.0 will provide some great benefits to our clustering
> > model, I do think that
> > we can do better with respect to determining flow inheritance. The
> proposed
> > inheritance model
> > provides a mechanism that results in a user experience that more closely
> > aligns with user
> > expectations in my opinion. It would result in NiFi being more stable and
> > reliable. However, it is
> > a large enough departure from how we have been doing things to-date that
> I
> > thought it appropriate
> > to start a DISCUSS thread to ensure that everyone is on the same page
> > first.
> >
> > Any thoughts?
> >
> > Thanks
> > -Mark
> >
> >
> > [1] NIFI-5241
> > [2] NIFI-950
> > [3] NIFI-5112
> > [4] NIFI-5204
> > [5] NIFI-5208
> > [6] NIFI-5186
> > [7] NIFI-5153
>

Reply via email to