Pierre, Thanks for taking the time to put together the feature proposal with additional background on the related Jira issues. The topic of disaster recovery is an important and challenging one, so it is definitely worth careful consideration.
At a high level, I think it is worth considering some additional use case flows, as that would highlight some additional considerations. Taking the ListS3 to PutDatabaseRecord example, it raises some concerns in itself that might be addressed in better ways. For example, instead of relying on localized cluster state, a more resilient approach could make use of an event queueing system like Amazon SQS or SNS based on S3 Event Notifications. That avoids ListS3 entirely and provides a more fault-tolerant architecture for tracking S3 items to be processed. The other half of the equation is the destination database. Although it is also external to NiFi, the example provided implies that the destination supports global redundancy such that communication from different regions remains possible in the event of a single region failure. That is certainly possible with various storage solutions, it just highlights the fact that a true disaster recovery configuration requires end-to-end design. In the initial proposal, the diagrams show regional State Management solutions. The concept of a composite state management solution is interesting, but it seems to be attempting to make up for the lack of a true distributed, resilient, and cross-region state management solution. Granted, ZooKeeper and Kubernetes ConfigMap storage may not be a good fit for a cross-region solution. However, it seems like it would be better to evaluate an optimal cross-region state management implementation, as opposed implementing some type of replication or leader-follower design in NiFi itself. To be clear, this is certainly a topic worth considering, but I am not confident that the implementation steps outlined in the initial two Jira issues will provide a robust or maintainable solution. Supporting component-level configuration of a custom state identifier seems prone to error, and also requires a lot of manual configuration at the individual Processor level. Supporting a composite state management could have other benefits, but it also adds a layer of complexity that may not even achieve the desired outcome, depending on the capabilities of the underlying storage implementations. With that background, I think it would be worth evaluating alternative approaches before moving to any kind of implementation. I'm sure there are aspects I have not considered, so I welcome additional perspective on the positives and negatives of the proposed solution. Regards, David Handermann On Fri, Dec 8, 2023 at 8:32 AM Pierre Villard <[email protected]> wrote: > > Team, > > I just published a feature proposal here: > https://cwiki.apache.org/confluence/display/NIFI/State+Management+improvements+for+Disaster+Recovery+scenarios > > This feature proposal is to provide a more detailed explanation around the > two below JIRAs: > https://issues.apache.org/jira/browse/NIFI-11776 > https://issues.apache.org/jira/browse/NIFI-11777 > > I'd love to hear your thoughts before we get started with the actual > implementation. > > Thanks, > Pierre
