Re: [DISCUSS] Change of Cluster Flow Inheritance

Mark Payne Thu, 07 Jun 2018 08:13:11 -0700

Thanks, all, for the great feedback so far. I did leave out any mention of how 
I would envision
handling the authorizations.xml and users.xml in the new approach, but yes, as 
has been suggested
I do believe that it will be important to also inherit those from the cluster.


Bryan also brought up a good point regarding handling of bundle information. My 
thought, at present,
would be to leave that logic as-is for now. Unless you have any suggested 
improvements that could
be made?



> On Jun 7, 2018, at 10:48 AM, Joe Witt <joe.w...@gmail.com> wrote:
> 
> Mark
> 
> I definitely think it is time to move on from the fingerprinting
> model.  I recall the conversations long ago that led us down this path
> and ultimately the thing that mattered most was ensuring data loss
> cases were prevented.  This still addresses that, reduces a ton of
> code, and simplifies the experience.  Ultimately, we want to ensure a
> couple things:
> 1) When a node is in a weird state it should be able to be
> disconnected from the cluster and debugged if needed.
> 2) If a node is meant to be part of a cluster as per its configuration
> then all things should help us get it back to that state and as
> automated as possible.  The only prevention should be avoiding
> unintended data loss.
> 
> As bryan and others have noted there are other files such as
> authorizations.  They should be inherited by the cluster consensus if
> possible.  Following logic of #2 above the user makes it clear during
> initial node setup that the node ultimately belongs in a cluster.
> 
> All of these improvements are making a huge difference so thanks for
> all the efforts.  I've got a flow now with 100,000+ processors in it
> and the responsiveness is spot on!
> 
> Thanks
> Joe
> 
> On Thu, Jun 7, 2018 at 10:20 AM, Bryan Bende <bbe...@gmail.com> wrote:
>> Using the versioned flow logic seems like a good idea.
>> 
>> Would the authorizer fingerprints still be checked as part of joining
>> the cluster?
>> 
>> Currently that is appended to the overall fingerprint to ensure each
>> node has the same users/policies, or at least same config (i.e. LDAP).
>> Would be nice if a node could merge the clusters users/groups/policies
>> into itself, as long as it had a subset of what the cluster had
>> (assuming a configurable authorizer).
>> 
>> Also, I believe the list of bundles on the node is compared to the
>> bundles in the cluster to ensure consistency. I don't think this is
>> technically part of the fingerprint, but just curious if you envision
>> that improving/changing at all.
>> 
>> 
>> On Thu, Jun 7, 2018 at 10:14 AM, Otto Fowler <ottobackwa...@gmail.com> wrote:
>>> Great write up.
>>> 
>>> While I am not an expert on clustering, it would seem that having one
>>> method of comparing flows, perhaps
>>> with different strategies within that would be more maintainable as well.
>>> 
>>> Are you proposing that there is a unified flow comparison
>>> capability/implementation/service that is shared between
>>> clustering and versioned use cases?
>>> 
>>> 
>>> On June 7, 2018 at 09:46:27, Mark Payne (marka...@hotmail.com) wrote:
>>> 
>>> Hi all,
>>> 
>>> Over the past couple of months, I have been doing a lot of testing with
>>> large scale flows and
>>> talking to others who are using large scale flows in production. ("Large
>>> scale" flows in this case
>>> means several thousand to tens of thousands of Processors). While NiFi does
>>> a really good job of
>>> handling the data flow, one area that needs some improvement is around
>>> NiFi's clustering.
>>> So for the 1.7.0 version of NiFi, we have spent quite a bit of time
>>> focusing improving the
>>> clustering mechanism to hold up to more demanding flows. The focus really
>>> can be broken down into
>>> three focus areas: UI sluggishness [1] [2] [3], Cluster Stability [4] [2]
>>> [5] [3],
>>> and User Experience [5] [6] [7] (note that many of these JIRA's are listed
>>> under more than 1
>>> category.)
>>> 
>>> With the above-mentioned JIRA's, I think we have significantly improved the
>>> stability and
>>> user experience around clustering. Local testing shows that in some cases,
>>> requests that previously
>>> took 15+ seconds (such as instantiating a template with several thousand
>>> processors) now take
>>> around 1 second. This provides a better user experience and also improves
>>> our cluster stability
>>> because it prevents nodes from dropping out the cluster due to timeouts.
>>> 
>>> There is, however, another important area that I believe is ripe for
>>> improvement in our current
>>> model. That is the mechanism used when a node joins a cluster, in order to
>>> determine if the
>>> cluster's flow can be inherited by the node. While the above work will
>>> improve stability
>>> considerably, we need to be very mindful that failures will still occur.
>>> And we need to be good at
>>> recovering from those.
>>> 
>>> The way that we do this currently is that we download the flow from the
>>> cluster, and then we
>>> "fingerprint" the flow. We then "fingerprint" our own flow and see if they
>>> match. What we mean by
>>> fingerprinting is that we go through the flow and pick out which elements
>>> should make a flow
>>> uninheritable and concatenate all of those together into one long String.
>>> The original purpose of
>>> this was to ensure that we don't lose any data when we join a node back to
>>> a cluster. When this was
>>> developed, though, we took a very strict approach of enforcing that the
>>> node's flow must match
>>> the cluster's flow - with only a few exceptions. For example, the position
>>> of a processor on the
>>> graph could be different; we simply inherit the cluster's value. The run
>>> status of a processor can
>>> be different; we simply inherit the cluster's value.
>>> 
>>> This fingerprinting approach has its benefit - it forces the user to be
>>> mindful of any differences
>>> between the node and cluster. However, it has several downsides as well. If
>>> a node fails to perform
>>> some update, it cannot join back to the cluster until the discrepancy is
>>> addressed. Additionally, it
>>> is difficult to understand just what the discrepancy is because the best
>>> info that we can provide is
>>> a segment of the fingerprint where the flows differ, and this is not very
>>> clear. It's also difficult
>>> to understand exactly which flow differences are relevant and which are
>>> not.
>>> 
>>> The class that performs the fingerprinting is rather complex, and updates
>>> are rather error-prone
>>> because it is easy to forget to update the fingerprint when a new "feature"
>>> is added to a component.
>>> Worse still is that if a component gains a "collection" of objects, it is
>>> easy to forget to sort
>>> that collection, which results in incorrect fingerprinting that prevents a
>>> node from joining a
>>> cluster when it should be able to.
>>> 
>>> Most importantly, though, the current approach requires manual user
>>> intervention when the flow
>>> differs, and almost always the solution that is suggested/used is to shut
>>> down the node, remove the
>>> flow.xml.gz, the users.xml, and authorizations.xml, and then restart. This
>>> will cause the
>>> node to inherit the cluster's flow.
>>> 
>>> Clearly, this isn't ideal. I'd like to propose a far simpler approach to
>>> determining flow
>>> inheritability. Because the main goal of checking inheritability was to
>>> ensure that there is no
>>> data loss, I would propose that we use the same mechanism for inheriting a
>>> cluster flow as we do for
>>> updating to a new version of a Versioned Flow. We would first determine
>>> which connections would be
>>> removed from the flow if we inherit the cluster's flow. If there are no
>>> connections removed, then
>>> the flow is inheritable. If there are any connections removed, we will stop
>>> each removed
>>> connection's source and destination. We will then check if any connection
>>> has any queued data.
>>> If so, then we will restart all components that we started and fail. This
>>> is critical because the
>>> only way we can lose data when inheriting a flow is if we remove a
>>> connection with data queued.
>>> Otherwise, we determine that inheriting the flow will not cause data loss
>>> and therefore the flow is
>>> inheritable.
>>> 
>>> This approach will still ensure that we have no data loss. It also results
>>> in a more resilient
>>> recovery that requires no human intervention (unless inheriting the flow
>>> would cause data loss -
>>> in that case, I believe human intervention is still warranted. But we will
>>> be able to inform the
>>> user of which connection(s) have data and would be removed so that they can
>>> address the concern.)
>>> Another added benefit of this approach is that it would allow automation
>>> tools to provision a node
>>> NiFi node with a "seed flow" and if it joins a cluster with a flow, it will
>>> simply inherit the
>>> cluster's flow instead of using the seed flow. Currently, in order to do
>>> this, the automation tools
>>> would have to determine if a cluster already exists and if so not provide
>>> the seeded flow. I think
>>> this may be more important as users start running more and more on
>>> Kubernetes.
>>> 
>>> While I believe 1.7.0 will provide some great benefits to our clustering
>>> model, I do think that
>>> we can do better with respect to determining flow inheritance. The proposed
>>> inheritance model
>>> provides a mechanism that results in a user experience that more closely
>>> aligns with user
>>> expectations in my opinion. It would result in NiFi being more stable and
>>> reliable. However, it is
>>> a large enough departure from how we have been doing things to-date that I
>>> thought it appropriate
>>> to start a DISCUSS thread to ensure that everyone is on the same page
>>> first.
>>> 
>>> Any thoughts?
>>> 
>>> Thanks
>>> -Mark
>>> 
>>> 
>>> [1] NIFI-5241
>>> [2] NIFI-950
>>> [3] NIFI-5112
>>> [4] NIFI-5204
>>> [5] NIFI-5208
>>> [6] NIFI-5186
>>> [7] NIFI-5153

Re: [DISCUSS] Change of Cluster Flow Inheritance

Reply via email to