Hello Team! I've been working with NiFi for a bit now and am seeing a usage pattern within my team that I think could be improved. We have thrown around the idea of creating an additional provenance repository implementation that would allow the storage and retrieval of `ProveanceEventRecords` in an external database / service to support more cloud-centric deployments.
Expanding where NiFi can store provenance would allow the instance/cluster itself to offload the storage and management of provenance events to an external tool. e.g. Elasticsearch / Opensearch, Solr, etc. When targeting cloud based deployments of NiFi's, resource constraints are an important consideration. Externalizing some database-like features would allow more resources to be allocated to data processing tasks. Also, in the event that a container or VM needs to be replaced or scaled down, having provenance stored in an external service would still allow other nodes in the cluster to access those events. My goal is to refactor some of the existing implementations within the nifi-data-provenance-utils module to decouple them from being disk-centric. To go along with that, I'd like to create some new interfaces that external services could be built against. In my research and prototyping for this, I've run into several situations where, while trying to follow the existing patterns, sub-typing some of the existing things doesn't make sense for an external provider. I don't yet have any complete implementations due to the amount of work I think would be involved. So far my research has primarily been with using Elasticsearch as a backing store. I believe this would rise to the level of requiring a NIP and would like to see how the larger dev team feels about this. Thank you! -- -David Y.
