Hi Mark,

Thanks for the great write-up! I support your proposal. It seems a logical 
improvement based on your description of how fingerprinting works today, the 
original problem it was trying to solve, and the proposed alternative.

I am not an expert on how NiFi clustering is implemented, so I'm interested in 
what others have to say on this topic, but I have been bitten by the "node 
unable to join cluster" problem and anything we can do to (1) reduce the chance 
of that and (2) make it easier for a user to intervene, all while maintaining 
the avoidance of data-loss, sounds like a good idea to me.

Thanks,
Kevin

On 6/7/18, 09:46, "Mark Payne" <marka...@hotmail.com> wrote:

    Hi all,
    
    Over the past couple of months, I have been doing a lot of testing with 
large scale flows and
    talking to others who are using large scale flows in production. ("Large 
scale" flows in this case
    means several thousand to tens of thousands of Processors). While NiFi does 
a really good job of
    handling the data flow, one area that needs some improvement is around 
NiFi's clustering.
    So for the 1.7.0 version of NiFi, we have spent quite a bit of time 
focusing improving the
    clustering mechanism to hold up to more demanding flows. The focus really 
can be broken down into
    three focus areas: UI sluggishness [1] [2] [3], Cluster Stability [4] [2] 
[5] [3],
    and User Experience [5] [6] [7] (note that many of these JIRA's are listed 
under more than 1
    category.)
    
    With the above-mentioned JIRA's, I think we have significantly improved the 
stability and
    user experience around clustering. Local testing shows that in some cases, 
requests that previously
    took 15+ seconds (such as instantiating a template with several thousand 
processors) now take
    around 1 second. This provides a better user experience and also improves 
our cluster stability
    because it prevents nodes from dropping out the cluster due to timeouts.
    
    There is, however, another important area that I believe is ripe for 
improvement in our current
    model. That is the mechanism used when a node joins a cluster, in order to 
determine if the
    cluster's flow can be inherited by the node. While the above work will 
improve stability
    considerably, we need to be very mindful that failures will still occur. 
And we need to be good at
    recovering from those.
    
    The way that we do this currently is that we download the flow from the 
cluster, and then we
    "fingerprint" the flow. We then "fingerprint" our own flow and see if they 
match. What we mean by
    fingerprinting is that we go through the flow and pick out which elements 
should make a flow
    uninheritable and concatenate all of those together into one long String. 
The original purpose of
    this was to ensure that we don't lose any data when we join a node back to 
a cluster. When this was
    developed, though, we took a very strict approach of enforcing that the 
node's flow must match
    the cluster's flow - with only a few exceptions. For example, the position 
of a processor on the
    graph could be different; we simply inherit the cluster's value. The run 
status of a processor can
    be different; we simply inherit the cluster's value. 
    
    This fingerprinting approach has its benefit - it forces the user to be 
mindful of any differences
    between the node and cluster. However, it has several downsides as well. If 
a node fails to perform
    some update, it cannot join back to the cluster until the discrepancy is 
addressed. Additionally, it
    is difficult to understand just what the discrepancy is because the best 
info that we can provide is
    a segment of the fingerprint where the flows differ, and this is not very 
clear. It's also difficult
    to understand exactly which flow differences are relevant and which are not.
    
    The class that performs the fingerprinting is rather complex, and updates 
are rather error-prone
    because it is easy to forget to update the fingerprint when a new "feature" 
is added to a component.
    Worse still is that if a component gains a "collection" of objects, it is 
easy to forget to sort
    that collection, which results in incorrect fingerprinting that prevents a 
node from joining a
    cluster when it should be able to.
    
    Most importantly, though, the current approach requires manual user 
intervention when the flow
    differs, and almost always the solution that is suggested/used is to shut 
down the node, remove the
    flow.xml.gz, the users.xml, and authorizations.xml, and then restart. This 
will cause the
    node to inherit the cluster's flow.
    
    Clearly, this isn't ideal. I'd like to propose a far simpler approach to 
determining flow
    inheritability. Because the main goal of checking inheritability was to 
ensure that there is no
    data loss, I would propose that we use the same mechanism for inheriting a 
cluster flow as we do for
    updating to a new version of a Versioned Flow. We would first determine 
which connections would be
    removed from the flow if we inherit the cluster's flow. If there are no 
connections removed, then
    the flow is inheritable. If there are any connections removed, we will stop 
each removed
    connection's source and destination. We will then check if any connection 
has any queued data.
    If so, then we will restart all components that we started and fail. This 
is critical because the
    only way we can lose data when inheriting a flow is if we remove a 
connection with data queued.
    Otherwise, we determine that inheriting the flow will not cause data loss 
and therefore the flow is
    inheritable.
    
    This approach will still ensure that we have no data loss. It also results 
in a more resilient
    recovery that requires no human intervention (unless inheriting the flow 
would cause data loss -
    in that case, I believe human intervention is still warranted. But we will 
be able to inform the
    user of which connection(s) have data and would be removed so that they can 
address the concern.)
    Another added benefit of this approach is that it would allow automation 
tools to provision a node
    NiFi node with a "seed flow" and if it joins a cluster with a flow, it will 
simply inherit the
    cluster's flow instead of using the seed flow. Currently, in order to do 
this, the automation tools
    would have to determine if a cluster already exists and if so not provide 
the seeded flow. I think
    this may be more important as users start running more and more on 
Kubernetes.
    
    While I believe 1.7.0 will provide some great benefits to our clustering 
model, I do think that
    we can do better with respect to determining flow inheritance. The proposed 
inheritance model
    provides a mechanism that results in a user experience that more closely 
aligns with user 
    expectations in my opinion. It would result in NiFi being more stable and 
reliable. However, it is
    a large enough departure from how we have been doing things to-date that I 
thought it appropriate
    to start a DISCUSS thread to ensure that everyone is on the same page first.
    
    Any thoughts?
    
    Thanks
    -Mark
    
    
    [1] NIFI-5241
    [2] NIFI-950
    [3] NIFI-5112
    [4] NIFI-5204
    [5] NIFI-5208
    [6] NIFI-5186
    [7] NIFI-5153
    
    


Reply via email to