Re: [DISCUSS] State Management improvements for DR scenarios

David Handermann Sat, 09 Dec 2023 14:49:34 -0800

Pierre,

Thanks for taking the time to put together the feature proposal with
additional background on the related Jira issues. The topic of
disaster recovery is an important and challenging one, so it is
definitely worth careful consideration.

At a high level, I think it is worth considering some additional use
case flows, as that would highlight some additional considerations.

Taking the ListS3 to PutDatabaseRecord example, it raises some
concerns in itself that might be addressed in better ways. For
example, instead of relying on localized cluster state, a more
resilient approach could make use of an event queueing system like
Amazon SQS or SNS based on S3 Event Notifications. That avoids ListS3
entirely and provides a more fault-tolerant architecture for tracking
S3 items to be processed.

The other half of the equation is the destination database. Although
it is also external to NiFi, the example provided implies that the
destination supports global redundancy such that communication from
different regions remains possible in the event of a single region
failure. That is certainly possible with various storage solutions, it
just highlights the fact that a true disaster recovery configuration
requires end-to-end design.

In the initial proposal, the diagrams show regional State Management
solutions. The concept of a composite state management solution is
interesting, but it seems to be attempting to make up for the lack of
a true distributed, resilient, and cross-region state management
solution. Granted, ZooKeeper and Kubernetes ConfigMap storage may not
be a good fit for a cross-region solution. However, it seems like it
would be better to evaluate an optimal cross-region state management
implementation, as opposed implementing some type of replication or
leader-follower design in NiFi itself.

To be clear, this is certainly a topic worth considering, but I am not
confident that the implementation steps outlined in the initial two
Jira issues will provide a robust or maintainable solution. Supporting
component-level configuration of a custom state identifier seems prone
to error, and also requires a lot of manual configuration at the
individual Processor level. Supporting a composite state management
could have other benefits, but it also adds a layer of complexity that
may not even achieve the desired outcome, depending on the
capabilities of the underlying storage implementations.

With that background, I think it would be worth evaluating alternative
approaches before moving to any kind of implementation. I'm sure there
are aspects I have not considered, so I welcome additional perspective
on the positives and negatives of the proposed solution.

Regards,
David Handermann

On Fri, Dec 8, 2023 at 8:32 AM Pierre Villard
<[email protected]> wrote:
>
> Team,
>
> I just published a feature proposal here:
> https://cwiki.apache.org/confluence/display/NIFI/State+Management+improvements+for+Disaster+Recovery+scenarios
>
> This feature proposal is to provide a more detailed explanation around the
> two below JIRAs:
> https://issues.apache.org/jira/browse/NIFI-11776
> https://issues.apache.org/jira/browse/NIFI-11777
>
> I'd love to hear your thoughts before we get started with the actual
> implementation.
>
> Thanks,
> Pierre

Re: [DISCUSS] State Management improvements for DR scenarios

Reply via email to