[
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639304#action_12639304
]
Hoss Man commented on SOLR-799:
-------------------------------
If we assume for a minute that users who want to prevent or overwrite
duplicates using a signature should always use the signature field as their
uniqueKey, then doesn't use case#1 simplify to just running using a
SignatureUpdateProcessor and then another processor that forces
"allowDups=false,overwritePending=false,overwriteCommitted=false" ?
Conceptually that seems right ... but at the moment DIH2 doesn't seem to care
about allowDups at all (it only looks at overwriteCommitted and
overwritePending to decide if it's allowing duplicates) and i'm not sure how to
make it work off the top of my head, but assuming we need to muck with DIH2
internals in some way to make signatures (and aborting if the signature already
exists) work, implementing the changes to happen for those combination of
existing options seems like the cleanest approach.: the functional changes to
DIH2 become generally useful to anyone who doesn't want to overwrite existing
docs with the same id, regardless of whether they are computing a signature.
the only hangup is whether we're okay with the initial assumption: that users
who want duplicate detection by signature are willing to use the signature as
the uniqueKey. If not then perhaps the cleanest way to support that would be
to add more generalized "unique field" support ... a list of field names in the
schema.xml and a (hopefully) simple call writer.deleteDocuments(Term[]) call in
DIH2 should do the trick right? ... this could also be potentially useful to
people for other purposes besides signatures, but i haven't thought throw all
the permutations so i'm sure there would be funky corner cases.
> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
> Key: SOLR-799
> URL: https://issues.apache.org/jira/browse/SOLR-799
> Project: Solr
> Issue Type: New Feature
> Components: update
> Reporter: Mark Miller
> Priority: Minor
> Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking
> as well as field collapsing. Lets put it into solr.
> http://wiki.apache.org/solr/Deduplication
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.