[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849091#action_12849091 ] Thomas Heigl commented on SOLR-799: --- Hello, For my current project I need to implement an index-time mechanism to detect (near) duplicate documents. The TextProfileSignature available out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright but does not use global collection statistics in deciding which terms will be used for calculating the signature. Most state-of-the-art hash-based duplication detection algorithms make use of this information to improve precision and recall (e.g. http://portal.acm.org/citation.cfm?id=506311dl=GUIDEcoll=GUIDECFID=83187370CFTOKEN=47052122) Is it possible to access collection statistics - especially IDF values for all non-discarded terms in the current document - from within an implementation of the Signature class? Kind regards, Thomas Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849225#action_12849225 ] Andrzej Bialecki commented on SOLR-799: This issue is closed - please use the mailing lists for discussions. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676041#action_12676041 ] Hoss Man commented on SOLR-799: --- The separation of concerns between schema.xml and solrconfig.xml has always been... * schema.xml: what is the data, what is it's nature, what are it's intrinsic properties? * solrconfig.xml: what can people do with your data, how can they use it? fields, fieldTypes, analyzers, copyFields go in the schema.xml because they are (in theory) intrinsic to the nature of your data regardless of where a given document comes from: * documents should only have one author * categoryName should always be tokenized in a particular way * prices need to sort numericly not lexigraphicallyy * any text indexed in the shortSummary field shoudl also be indexed in the searchableAbstract field * etc... request handlers that dictate how people can use the data are specified in solrconfig.xml -- when searching data request handlers (which may leverage search componets) dictate what a user is allowed to get/see; when modifying an index request handlers (which may leverage update processors) dictate what data is allowed to come from various sources and in what formats. In short: as far as document indexing goes, the options configured in solrconfig.xml specify how to build up a Document object from user input, while the options in schema.xml specify how to tear it down into it's individual terms and values for indexing. With the near duplicate detection code, it is the schema's job to say which fields can exist in the input documents, including a signature field -- but it is the solrconfig's job to decide how to compute that signature field ... after all: the computation might be different depending on the source of the data (ie: different processor chains could be configured for different request handlers) Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675748#action_12675748 ] Lance Norskog commented on SOLR-799: I came into Solr with no search experience and it was quite a learning curve. The modular design of the configuration really helped, and we should maintain that modularity. There are two different designs: the design of the configuration and the design of the implementation. This comment only addresses the design of the configuration files. The patch as committed moves the specification of one field out of schema.xml file to another file. This breaks the modularity of the configurations. I suggest that the files should look like this: schema.xml: field name=signatureField type=signatureField indexed=true stored=false signature=solr.TextProfileSignature fields=product_name, model_t, *_s / solrconfig.xml: updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory string name=signatureFieldsignatureField/string bool name=enabledfalse/bool bool name=overwriteDupestrue/bool /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain That is, the design of the signature field should go in schema.xml, and each updateRequest section should only describe how it is used with that section's declared name. Also, there should be no default field, since every field in the schema should be described in schema.xml. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675770#action_12675770 ] Shalin Shekhar Mangar commented on SOLR-799: bq. field name=signatureField type=signatureField indexed=true stored=false signature=solr.TextProfileSignature fields=product_name, model_t, *_s / I don't think signatureField is a separate type. It is just a string, right? bq. The patch as committed moves the specification of one field out of schema.xml file to another file. bq. That is, the design of the signature field should go in schema.xml, and each updateRequest section should only describe how it is used with that section's declared name. Also, there should be no default field, since every field in the schema should be described in schema.xml. The design of the signature field goes into schema.xml right now too. The wiki clearly states the following about signatureField: {code} The name of the field used to hold the fingerprint/signature. Be sure the field is defined in schema.xml. {code} bq. field name=signatureField type=signatureField indexed=true stored=false signature=solr.TextProfileSignature fields=product_name, model_t, *_s / I don't agree with the above. The method of computing the contents of the field should not be part of schema.xml. I do not understand your concern, maybe because I'm not very familiar with this feature. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653484#action_12653484 ] Yonik Seeley commented on SOLR-799: --- Why not plug in an entirely new chain? That is one of the way it would be done for users of this component, right? updateRequestProcessorChain name=hash [...] And then in the test send in update.processor=hash as a parameter. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652815#action_12652815 ] Mark Miller commented on SOLR-799: -- I'm going to put up another patch for this soon. I'd like to have some getters on the factory for manual exploration of the settings. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652836#action_12652836 ] Yonik Seeley commented on SOLR-799: --- bq. Now that I look to fix this, I am not understanding - I don't need to change the update handler, I need to change the update chain...I am not seeing how that can be done dynamically...is it possible? Yes, you can dynamically change an update processor: update.processor=hash Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652837#action_12652837 ] Mark Miller commented on SOLR-799: -- Okay, I see. I was too intent on changing the current chain - problem indeed goes away by just plugging in an entirely new chain. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652923#action_12652923 ] Ryan McKinley commented on SOLR-799: I'm not sure how you have the test set up, so I could be way off base. You could subclass SearchHandler and set the protected ListSearchComponent components directly... Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12649826#action_12649826 ] Mark Miller commented on SOLR-799: -- bq. There's probably no need for a separate test solrconfig-deduplicate.xml if all it adds is an update processor. Tests could just explicitly specify the update handler on updates. Now that I look to fix this, I am not understanding - I don't need to change the update handler, I need to change the update chain...I am not seeing how that can be done dynamically...is it possible? If not I think I need the config xml. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12645061#action_12645061 ] Yonik Seeley commented on SOLR-799: --- bq. Maybe we just do overwrite dupe for now? +1, as long as we don't do anything to preclude the other stuff - we just need to leave room in the config XML and the update API such that we don't have to break the back compatibility of this patch if/when future features are implemented. bq. Another point that was brought up is whether or not to delete any docs that match the update docs uniqueField id term, but not its similarity/update term. You are choosing to use the updateTerm to do updates rather then the unique term. It seems like uniqueField should normally enforce uniqueness, regardless of what this component does. If one wants duplicate ids, then it seems like a different field should be used for that (other than the uniqueKey field). If one wants to delete *only* on the hash field, then they can make the hash field the id field. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12645073#action_12645073 ] Mark Miller commented on SOLR-799: -- Ok. I cant muster up much of a defense for leaving it out I suppose. I'll polish off a final patch. - Mark On Nov 4, 2008, at 3:06 PM, Yonik Seeley (JIRA) [EMAIL PROTECTED] Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642245#action_12642245 ] Mark Miller commented on SOLR-799: -- I find the pluggable delete policy idea appealing, but I have not yet found a great way to plug it into the UpdateHandler. Any approach other than sub-classing DirectUpdateHandler2 appears to lead to tying an IndexWriter to UpdateHandler. There is a connection now, UpdateHandler has a method to create a main IndexWriter, but further tying seems wrong without a stronger reason. That point is arguable, but in the end, sub-classing results in simpler code in any case. The trade off is that now you have a PreventDupesDirectUpdateHandler that extends DirectUpdateHandler2. This would have to be used in combination with the SignatureUpdateProcessor if you want to prevent dupes from entering the index. Other use cases (other than overwriting) would require another UpdateHandler. Less than ideal in both cases (subclassing, pluggable interface/class). Both approaches lead to less than ideal solutions beyond that as well . Because many docs that have been added to Solr might not yet be visible to an IndexReader, you have to keep a pending commit set of docs to check against. This list should be resilient against AddDoc, DelDocWquery, AddDoc, Commit. You'd essentially have to keep a mini index around to search against to accomplish this, due to delete by query. The other options are to either auto-commit sans a user commit before a delete, or just say we don't support that use case when using that UpdateHandler. None of it is very pretty. Another option is to do things with an UpdateProcessor. This is the most elegant solution really, but it requires putting big,coarse syncs around the more precise syncs in DirectUpdateHandler2. That may not be a huge deal, I am not sure. The previous two options allow you to maintain similar syncs as to what is already there. Beyond that, the UpdateProcessor approach still has the delete by query issues. Maybe we just do overwrite dupe for now? It has none of these issues. I am open to whatever path you guys want. The other use cases do have their place - we will just have to compromise some to get there. Or maybe there are other suggestions? Another point that was brought up is whether or not to delete any docs that match the update docs uniqueField id term, but not its similarity/update term. At the current moment, IMO, we shouldn't. You are choosing to use the updateTerm to do updates rather then the unique term. This allows you to have duplicate signatures but also uniqueField Ids for other operations (say delete). Also, if you already have a unique field that your using, it may be desirable to do dupe detection with a different field. There is always the option of setting the signature field to the uniqueField term. In the end, your call, I'll add it if we want it. As far as search time dupe collapsing, I think I could see a search component that takes the page numbers to collapse (start, end) and does dupe elimination on that range at query time. It wouldn't be very fast, and I'm not sure how useful page at a time collapsing is, but it would be fairly easy to do. Not sure that it fits into this issue, but certainly could share some of its classes. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12639470#action_12639470 ] Yonik Seeley commented on SOLR-799: --- overwriting is implemented and supported in Lucene now (and we gain a number of benefits from using that). Conditionally adding a document, or testing if a document already exists, is not supported. Since we can't currently determine if something is a duplicate, it seems like this issue should go ahead with just a single option: whether to remove older documents with the same signature or not. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12639479#action_12639479 ] Otis Gospodnetic commented on SOLR-799: --- Thanks Yonik. Good thing I asked for the clarification, since Marks' issue description does mention search-time stuff (field collapsing). Mark: Do you still plan on tackling search-time duplicate/near-duplicate/similar doc detection? In a separate issue? Thanks. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12639456#action_12639456 ] Otis Gospodnetic commented on SOLR-799: --- Haven't looked at the patch yet. Have looked at the Deduplication wiki page (and realize the stuff I'll write below is briefly mentioned there). Have skimmed the above comments. I want to bring up the use case that seems to have been mentioned already, but only in passing. The focus of the previous comments seems to be on index-time duplication detection. Another huge use case is search-time near-duplicate detection. Sometimes it's about straight forward field collapsing (collapsing adjacent docs with identical values in some field), but sometimes it's more complicated. For example, sometimes multiple fields need to be compared. Sometimes they have to be identical for collapsing to happen. Sometimes they only need to be similar. How similarity is calculated is very application-dependent. I believe this similarity computation has to be completely open/extensible/overridable, allowing one to write a custom search component, examine returned hits and compare them using app-specific similarity Ideally one would have the option not to save the document/field at index-time (for examination at search-time), since that prevents one from experimenting and dynamically changing the similarity computation. Here is one example. Imagine a field called IDs that can have 1 or more tokens in it and imagine docs with the following IDs get returned: 1) id:aaa 2) id:bbb 3) id:ccc ddd 4) id:aaa bbb 5) id:eee ddd 6) id:aaa A custom similarity may look at all of the above (e.g. a page's worth of hits) and decide that: 1) and 4) are similar 2) and 4) are also similar 3) and 5) are similar 1) and 4) and 6) are similar Another custom similarity may say that only 1) and 6) are similar because they are identical. My point is really that we have to leave it up to the application to provide similarity implementation, just like we make it possible for the app to provide a custom Lucene Similarity. Is the goal of this issue to make this possible? Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12639304#action_12639304 ] Hoss Man commented on SOLR-799: --- If we assume for a minute that users who want to prevent or overwrite duplicates using a signature should always use the signature field as their uniqueKey, then doesn't use case#1 simplify to just running using a SignatureUpdateProcessor and then another processor that forces allowDups=false,overwritePending=false,overwriteCommitted=false ? Conceptually that seems right ... but at the moment DIH2 doesn't seem to care about allowDups at all (it only looks at overwriteCommitted and overwritePending to decide if it's allowing duplicates) and i'm not sure how to make it work off the top of my head, but assuming we need to muck with DIH2 internals in some way to make signatures (and aborting if the signature already exists) work, implementing the changes to happen for those combination of existing options seems like the cleanest approach.: the functional changes to DIH2 become generally useful to anyone who doesn't want to overwrite existing docs with the same id, regardless of whether they are computing a signature. the only hangup is whether we're okay with the initial assumption: that users who want duplicate detection by signature are willing to use the signature as the uniqueKey. If not then perhaps the cleanest way to support that would be to add more generalized unique field support ... a list of field names in the schema.xml and a (hopefully) simple call writer.deleteDocuments(Term[]) call in DIH2 should do the trick right? ... this could also be potentially useful to people for other purposes besides signatures, but i haven't thought throw all the permutations so i'm sure there would be funky corner cases. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638850#action_12638850 ] Mark Miller commented on SOLR-799: -- bq. 1. Prevent new insert - SignatureUpdateProcessor generates a signature and adds it as a field; AbortIfExistingUpdateProcessor aborts the update if a doc exists with a specific field in common with the doc to be added. I like the idea of using UpdateProcessors for all of this. Its very clean compared to hacking around the DirectUpdateHandler. Unfortunately, I think AbortIfExistingUpdateProcessor would require locks that are too course. Ideally, you want to be able to inject code into the DirectUpdateHandlers 3 levels of locking (iw,sync(this),none). Thats whats needed for efficiency, but the cleanness gets whacked - its ugly to get that done, and doesn't really mesh with the UpdateHandler API thats been defined. The linking of DirectUpdateHandlers2's addDoc implementation to the whole idea...there would have to be changes that just don't seem worth the added functionality. Which leaves just hardcoding the support into DirectUpdateHandler, kind of like was done before for deletes/id dupes, and then just give options on the add doc cmd. Again I don't like it. But the anything else quickly breaks down for me. Any suggestions, insights? Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638217#action_12638217 ] Andrzej Bialecki commented on SOLR-799: +1 on the incremental sig calculation. Re: different types of signatures. Our experience in Nutch is that signature type is rarely changed, and we assume that this setting is selected once per lifetime of an index, i.e. there are never any mixed cases of documents with incompatible signatures. If we want to be sure that they are comparable, we could prepend a byte or two of unique signature type id - this way, even if a signature value matches but was calculated using other impl. the documents won't be considered duplicates, which is the way it should work, because different signature algorithms are incomparable. Re: signature as byte[] - I think it's better if we return byte[] from Signature, and until we support binary fields we just turn this into a hex string. Re: field ordering in DeduplicateUpdateProcessorFactory: I think that both sigFields (if defined) and any other document fields (if sigFields is undefined) should be first ordered in a predictable way (lexicographic?). Current patch uses a HashSet which doesn't guarantee any particular ordering - in fact the ordering may be different if you run the same code under different JVMs, which may introduce a random factor to the sig. calculation. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638426#action_12638426 ] Hoss Man commented on SOLR-799: --- (disclaimer: haven't looked at the patch) bq. Though in some implementations (like #2, which may be the default), detecting that duplicate and handling it are truly coupled... forcing a decoupling would not be a good thing in that case. I don't follow your reasoning. all the usecases i've seen mentioned seem like they could/would decouple very nicely... 1. Prevent new insert -- SignatureUpdateProcessor generates a signature and adds it as a field; AbortIfExistingUpdateProcessor aborts the update if a doc exists with a specific field in common with the doc to be added. 2. Remove old (i.e. same as an update works now) -- SignatureUpdateProcessor as mentioned before, and signature field is used as the uniqueKey field. 3. Note the duplicate on the existing document in a duplicates field -- SignatureUpdateProcessor as mentioned before; AnnotateDuplicatesProcessor checks for existing docs with a specific field in common with the doc to be added and executes additional opperations to udpate those docs, as well as the doc to be added. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638427#action_12638427 ] Hoss Man commented on SOLR-799: --- some misc comments from a user perspective based on the current state of the wiki... 1) rather then a comma seperated str fields, we should just use an arr 2) we should consider if/how we want to support using dynamicFields (ie: field name globs) in listing fields that are included in the signature) 3) By default, all non null fields on the document will be used. ... there's no such thing as a null field -- there are fields that have no value, and there are fields whose value is an empty string, but no null value. 4) yonik already asked other questions i had based on the wiki: how the order of fields in the update command affects the signature that gets computed -- both in terms of fields with different names, and fields with the same name. the fields should probably be stable sorted by field name, so that the order of fields with teh same name affects the signature, but the relative order of fields with different names doesn't (since the order of fields with the same name actually affects the way the document is indexed, but the order of different field names does not) Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637965#action_12637965 ] Grant Ingersoll commented on SOLR-799: -- Haven't looked at the patch, but I agree that it is wise to separate the detection of duplication from the handling of found duplicates. The default can be to remove all as in the patch, but it should be easy to override. Scenarios I can see being useful: 1. Prevent new insert 2. Remove old (i.e. same as an update works now) 3. Note the duplicate on the existing document in a duplicates field. This obviously requires either deleting and re-adding the doc, or Lucene to better support appending/updating fields, maybe via the column-stride payloads (if that ever happens). No need for this anytime soon. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637976#action_12637976 ] Yonik Seeley commented on SOLR-799: --- bq. I agree that it is wise to separate the detection of duplication from the handling of found duplicates Though in some implementations (like #2, which may be the default), detecting that duplicate and handling it are truly coupled... forcing a decoupling would not be a good thing in that case. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638009#action_12638009 ] Yonik Seeley commented on SOLR-799: --- Some thoughts... - How should different types be handled (for example when we support binary fields). For example, different base64 encoders might use different line lengths or different line endings (CR/LF). Perhaps it's good enough to say that the string form must be identical, and leave it at that for now? The alternative would be signatures based on the Lucene Document about to be indexed. - It would be nice to be able to calculate a signature for a document w/o having to catenate all the fields together. Perhaps change calculate(String content) to something like calculate(IterableCharSequence content)? An alternative option would be incremental hashing... {code} Signature sig = ourSignatureCreator.create(); sig.add(f1) sig.add(f2) sig.add(f3) String s = sig.getSignature() {code} Looking at how TextProfileSignature works, i'd lean toward incremental hashing to avoid building yet another big string. Having a hashing object also opens up the possibility to easily add other method signatures for more efficient hashing. - It appears that if you put fields in a different order that the signature will change - It appears that documents with different field names but the same content will have the same signature. - I don't understand the dedup logic in DUH2... it seems like we want to delete by id and by sig... unfortunately there is no IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a separate non-atomic delete on the sig for now, right? - There's probably no need for a separate test solrconfig-deduplicate.xml if all it adds is an update processor. Tests could just explicitly specify the update handler on updates. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638048#action_12638048 ] Mark Miller commented on SOLR-799: -- bq.I agree that it is wise to separate the detection of duplication from the handling of found duplicates bq. Though in some implementations (like #2, which may be the default), detecting that duplicate and handling it are truly coupled... forcing a decoupling would not be a good thing in that case. Still looking at this. Was hoping to avoid any of the old 'if solr crashes you can have 2 docs with same id in the index' type stuff. Guess I won't easily get away with that g Hopefully we can make it so the default implementation can still be as efficient and atomic. bq. How should different types be handled (for example when we support binary fields). For example, different base64 encoders might use different line lengths or different line endings (CR/LF). Perhaps it's good enough to say that the string form must be identical, and leave it at that for now? The alternative would be signatures based on the Lucene Document about to be indexed. Yeah, may be best to worry about it when we support binary fields...would be nice to look forward though. I think returning a byte[] rather than a String will future proof the sig implementations a bit along those lines (though doesn't address your point)...still mulling - this shouldn't trip up Fuzzy hashing implementations to much, and so how exact should MD5Signature be... bq. * It appears that if you put fields in a different order that the signature will change bq. * It appears that documents with different field names but the same content will have the same signature. Two good points I have addressed. bq. It would be nice to be able to calculate a signature for a document w/o having to catenate all the fields together. Perhaps change calculate(String content) to something like calculate(IterableCharSequence content)? I like the idea of incremental as well. bq. I don't understand the dedup logic in DUH2... it seems like we want to delete by id and by sig... unfortunately there is no IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a separate non-atomic delete on the sig for now, right? Another one I was hoping to get away with. My current strategy was to say that setting an update term means that updating by id is overridden and *only* the update Term is used - effectively, the update Term (signature) becomes the update id - and you can control whether the id factors into that update signature or not. Didn't get passes the goalie I suppose g I guess I give up on clean atomic imp and perhaps investigate update(terms[], doc) for the future. I wanted to deal with both signature and id, but figured its best to start with most efficient bare bones and work out. bq. There's probably no need for a separate test solrconfig-deduplicate.xml if all it adds is an update processor. Tests could just explicitly specify the update handler on updates. Its mainly for me at the moment (testing config settings loading and what not), I'll be sure to pull it once the patch is done. Thanks for all of the feedback. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637649#action_12637649 ] Andrzej Bialecki commented on SOLR-799: Interesting development in light of NUTCH-442 :) Some comments: * in MD5Signature I suggest using the code from org.apache.hadoop.io.MD5Hash.toString() instead of BigInteger. * TextProfileSignature should contain a remark that it's copied from Nutch, since AFAIK the algorithm that it implements is currently used only in Nutch. * in Nutch the concept of a page Signature is only a part of the deduplication process. The other part is the algorithm to decide which copy to keep and which one to discard. In your patch the latest update always removes all other documents with the same signature. IMHO this decision should be isolated into a DuplicateDeletePolicy class that gets all duplicates and can decide (based on arbitrary criteria) which one to keep, with the default implementation that simply keeps the latest document. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637719#action_12637719 ] Mark Miller commented on SOLR-799: -- Thanks for the review Andrzej. I've made the first two changes (I put at the top of TextProfileSignature that its 'borrowed' from Nutch and grabbed Hadoops MD5Hash class and stripped its Hadoop dependencies) and I'm investigating change 3. I'll put up another patch in a couple days. - Mark Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.