The hash-based deduplication strategy used the built-in "md5"
attribute to offload the work to the database. That functionality was
deprecated and AFAICT gone as of Mongo 5:

https://www.mongodb.com/docs/manual/core/gridfs/#files.md5

I am proposing two changes:

* Remove deduplication
* Create a MongoDB DistributedMapCache client that can query on the
file metadata since GridFS stores metadata separately from chunks
making lookups that way cheap and flexible.

I could easily add that to this PR which already covers Testcontainers
integration, making it super easy to test the changed behavior:

https://github.com/apache/nifi/pull/6460

Thoughts?

Reply via email to