[
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638009#action_12638009
]
Yonik Seeley commented on SOLR-799:
-----------------------------------
Some thoughts...
- How should different "types" be handled (for example when we support binary
fields). For example, different base64 encoders might use different line
lengths or different line endings (CR/LF). Perhaps it's good enough to say
that the string form must be identical, and leave it at that for now? The
alternative would be signatures based on the Lucene Document about to be
indexed.
- It would be nice to be able to calculate a signature for a document w/o
having to catenate all the fields together.
Perhaps change calculate(String content) to something like
calculate(Iterable<CharSequence> content)?
An alternative option would be incremental hashing...
{code}
Signature sig = ourSignatureCreator.create();
sig.add(f1)
sig.add(f2)
sig.add(f3)
String s = sig.getSignature()
{code}
Looking at how TextProfileSignature works, i'd lean toward incremental hashing
to avoid building yet another big string. Having a hashing object also opens up
the possibility to easily add other method signatures for more efficient
hashing.
- It appears that if you put fields in a different order that the signature
will change
- It appears that documents with different field names but the same content
will have the same signature.
- I don't understand the dedup logic in DUH2... it seems like we want to delete
by id and by sig... unfortunately there is no
IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a
separate non-atomic delete on the sig for now, right?
- There's probably no need for a separate test solrconfig-deduplicate.xml if
all it adds is an update processor. Tests could just explicitly specify the
update handler on updates.
> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
> Key: SOLR-799
> URL: https://issues.apache.org/jira/browse/SOLR-799
> Project: Solr
> Issue Type: New Feature
> Components: update
> Reporter: Mark Miller
> Priority: Minor
> Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking
> as well as field collapsing. Lets put it into solr.
> http://wiki.apache.org/solr/Deduplication
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.