As-is, the deduplication-by-hash functionality appears to now be
broken w/ Mongo 5 and higher. We can address that by doing some
updates to the code base and recommending users add a HashContent
processor before PutGridFS, but flows are going to break either way
thanks to changes in Mongo itself. That's why I'm not sure we should
be dogmatic about waiting.

On Tue, Oct 25, 2022 at 2:15 PM Pierre Villard
<pierre.villard...@gmail.com> wrote:
>
> IMO we should start working on NiFi 2.0 going forward and it sounds like a
> good opportunity to make such changes in our components.
>
>
> Le mar. 25 oct. 2022 à 19:33, Mike Thomsen <mikerthom...@gmail.com> a
> écrit :
>
> > The hash-based deduplication strategy used the built-in "md5"
> > attribute to offload the work to the database. That functionality was
> > deprecated and AFAICT gone as of Mongo 5:
> >
> > https://www.mongodb.com/docs/manual/core/gridfs/#files.md5
> >
> > I am proposing two changes:
> >
> > * Remove deduplication
> > * Create a MongoDB DistributedMapCache client that can query on the
> > file metadata since GridFS stores metadata separately from chunks
> > making lookups that way cheap and flexible.
> >
> > I could easily add that to this PR which already covers Testcontainers
> > integration, making it super easy to test the changed behavior:
> >
> > https://github.com/apache/nifi/pull/6460
> >
> > Thoughts?
> >

Reply via email to