Mike,

Thanks for raising this issue for additional discussion. According to the
MongoDB document referenced, the md5 option is deprecated, but not yet
removed:

> The MD5 algorithm is prohibited by FIPS 140-2. MongoDB drivers deprecate
MD5 support and will remove MD5 generation in future releases. Applications
that require a file digest should implement it outside of GridFS and store
in files.metadata
<https://www.mongodb.com/docs/manual/core/gridfs/#mongodb-data-files.metadata>

There is a configuration option called disableMD5, but it still appears to
be part of the GridFS specification. Were you able to confirm that it
breaks in MongoDB 5 or 6?

I agree that we should be able to address this behavior in the current
version of NiFi, and it seems like having a transitional way forward would
be helpful. If the Testcontainers change can verify the current MD5
functionality, that should provide a good baseline for a subsequent PR to
implement a new hashing strategy.

Regards,
David Handermann

On Tue, Oct 25, 2022 at 1:36 PM Mike Thomsen <mikerthom...@gmail.com> wrote:

> As-is, the deduplication-by-hash functionality appears to now be
> broken w/ Mongo 5 and higher. We can address that by doing some
> updates to the code base and recommending users add a HashContent
> processor before PutGridFS, but flows are going to break either way
> thanks to changes in Mongo itself. That's why I'm not sure we should
> be dogmatic about waiting.
>
> On Tue, Oct 25, 2022 at 2:15 PM Pierre Villard
> <pierre.villard...@gmail.com> wrote:
> >
> > IMO we should start working on NiFi 2.0 going forward and it sounds like
> a
> > good opportunity to make such changes in our components.
> >
> >
> > Le mar. 25 oct. 2022 à 19:33, Mike Thomsen <mikerthom...@gmail.com> a
> > écrit :
> >
> > > The hash-based deduplication strategy used the built-in "md5"
> > > attribute to offload the work to the database. That functionality was
> > > deprecated and AFAICT gone as of Mongo 5:
> > >
> > > https://www.mongodb.com/docs/manual/core/gridfs/#files.md5
> > >
> > > I am proposing two changes:
> > >
> > > * Remove deduplication
> > > * Create a MongoDB DistributedMapCache client that can query on the
> > > file metadata since GridFS stores metadata separately from chunks
> > > making lookups that way cheap and flexible.
> > >
> > > I could easily add that to this PR which already covers Testcontainers
> > > integration, making it super easy to test the changed behavior:
> > >
> > > https://github.com/apache/nifi/pull/6460
> > >
> > > Thoughts?
> > >
>

Reply via email to