Mike, Thanks for raising this issue for additional discussion. According to the MongoDB document referenced, the md5 option is deprecated, but not yet removed:
> The MD5 algorithm is prohibited by FIPS 140-2. MongoDB drivers deprecate MD5 support and will remove MD5 generation in future releases. Applications that require a file digest should implement it outside of GridFS and store in files.metadata <https://www.mongodb.com/docs/manual/core/gridfs/#mongodb-data-files.metadata> There is a configuration option called disableMD5, but it still appears to be part of the GridFS specification. Were you able to confirm that it breaks in MongoDB 5 or 6? I agree that we should be able to address this behavior in the current version of NiFi, and it seems like having a transitional way forward would be helpful. If the Testcontainers change can verify the current MD5 functionality, that should provide a good baseline for a subsequent PR to implement a new hashing strategy. Regards, David Handermann On Tue, Oct 25, 2022 at 1:36 PM Mike Thomsen <mikerthom...@gmail.com> wrote: > As-is, the deduplication-by-hash functionality appears to now be > broken w/ Mongo 5 and higher. We can address that by doing some > updates to the code base and recommending users add a HashContent > processor before PutGridFS, but flows are going to break either way > thanks to changes in Mongo itself. That's why I'm not sure we should > be dogmatic about waiting. > > On Tue, Oct 25, 2022 at 2:15 PM Pierre Villard > <pierre.villard...@gmail.com> wrote: > > > > IMO we should start working on NiFi 2.0 going forward and it sounds like > a > > good opportunity to make such changes in our components. > > > > > > Le mar. 25 oct. 2022 à 19:33, Mike Thomsen <mikerthom...@gmail.com> a > > écrit : > > > > > The hash-based deduplication strategy used the built-in "md5" > > > attribute to offload the work to the database. That functionality was > > > deprecated and AFAICT gone as of Mongo 5: > > > > > > https://www.mongodb.com/docs/manual/core/gridfs/#files.md5 > > > > > > I am proposing two changes: > > > > > > * Remove deduplication > > > * Create a MongoDB DistributedMapCache client that can query on the > > > file metadata since GridFS stores metadata separately from chunks > > > making lookups that way cheap and flexible. > > > > > > I could easily add that to this PR which already covers Testcontainers > > > integration, making it super easy to test the changed behavior: > > > > > > https://github.com/apache/nifi/pull/6460 > > > > > > Thoughts? > > > >