[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2010-03-24 Thread Thomas Heigl (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849091#action_12849091
 ] 

Thomas Heigl commented on SOLR-799:
---

Hello,

For my current project I need to implement an index-time mechanism to detect 
(near) duplicate documents. The TextProfileSignature available out-of-the-box 
(http://wiki.apache.org/solr/Deduplication) seems alright but does not use 
global collection statistics in deciding which terms will be used for 
calculating the signature. 
Most state-of-the-art hash-based duplication detection algorithms make use of 
this information to improve precision and recall (e.g. 
http://portal.acm.org/citation.cfm?id=506311dl=GUIDEcoll=GUIDECFID=83187370CFTOKEN=47052122)

Is it possible to access collection statistics - especially IDF values for all 
non-discarded terms in the current document - from within an implementation of 
the Signature class?

Kind regards,

Thomas


 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, 
 SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2010-03-24 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849225#action_12849225
 ] 

Andrzej Bialecki  commented on SOLR-799:


This issue is closed - please use the mailing lists for discussions.

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, 
 SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2009-02-23 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676041#action_12676041
 ] 

Hoss Man commented on SOLR-799:
---

The separation of concerns between schema.xml and solrconfig.xml has always 
been...

 * schema.xml: what is the data, what is it's nature, what are it's intrinsic 
properties?
 * solrconfig.xml: what can people do with your data, how can they use it?

fields, fieldTypes, analyzers, copyFields go in the schema.xml because they are 
(in theory) intrinsic to the nature of your data regardless of where a given 
document comes from: 
 * documents should only have one author
 * categoryName should always be tokenized in a particular way
 * prices need to sort numericly not lexigraphicallyy
 * any text indexed in the shortSummary field shoudl also be indexed in the 
searchableAbstract field
 * etc...

request handlers that dictate how people can use the data are specified in 
solrconfig.xml -- when searching data request handlers (which may leverage 
search componets) dictate what a user is allowed to get/see;  when modifying an 
index request handlers (which may leverage update processors) dictate what data 
is allowed to come from various sources and in what formats.

In short: as far as document indexing goes, the options configured in 
solrconfig.xml specify how to build up a Document object from user input, 
while the options in schema.xml specify how to tear it down into it's 
individual terms and values for indexing.

With the near duplicate detection code, it is the schema's job to say which 
fields can exist in the input documents, including a signature field --  but it 
is the solrconfig's job to decide how to compute that signature field ... after 
all: the computation might be different depending on the source of the data 
(ie: different processor chains could be configured for different request 
handlers)

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, 
 SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2009-02-22 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675748#action_12675748
 ] 

Lance Norskog commented on SOLR-799:


I came into Solr with no search experience and it was quite a learning curve. 
The modular design of the configuration really helped, and we should maintain 
that modularity. There are two different designs: the design of the 
configuration and the design of the implementation. This comment only addresses 
the design of the configuration files.  

The patch as committed moves the specification of one field out of schema.xml 
file to another file. This breaks the modularity of the configurations.  I 
suggest that the files should look like this:

schema.xml:
field name=signatureField type=signatureField indexed=true 
stored=false signature=solr.TextProfileSignature fields=product_name, 
model_t, *_s /

solrconfig.xml:
updateRequestProcessorChain name=dedupe
processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  string name=signatureFieldsignatureField/string
  bool name=enabledfalse/bool
  bool name=overwriteDupestrue/bool
   /processor
   
   processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

That is, the design of the signature field should go in schema.xml, and each 
updateRequest section should only describe how it is used with that section's 
declared name. Also, there should be no default field, since every field in the 
schema should be described in schema.xml. 



 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, 
 SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2009-02-22 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675770#action_12675770
 ] 

Shalin Shekhar Mangar commented on SOLR-799:


bq. field name=signatureField type=signatureField indexed=true 
stored=false signature=solr.TextProfileSignature fields=product_name, 
model_t, *_s /

I don't think signatureField is a separate type. It is just a string, right?

bq. The patch as committed moves the specification of one field out of 
schema.xml file to another file.

bq. That is, the design of the signature field should go in schema.xml, and 
each updateRequest section should only describe how it is used with that 
section's declared name. Also, there should be no default field, since every 
field in the schema should be described in schema.xml.

The design of the signature field goes into schema.xml right now too. The wiki 
clearly states the following about signatureField:
{code}
The name of the field used to hold the fingerprint/signature. Be sure the field 
is defined in schema.xml. 
{code}

bq. field name=signatureField type=signatureField indexed=true 
stored=false signature=solr.TextProfileSignature fields=product_name, 
model_t, *_s /

I don't agree with the above. The method of computing the contents of the field 
should not be part of schema.xml. I do not understand your concern, maybe 
because I'm not very familiar with this feature.

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, 
 SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-12-04 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653484#action_12653484
 ] 

Yonik Seeley commented on SOLR-799:
---

Why not plug in an entirely new chain?  That is one of the way it would be done 
for users of this component, right?

  updateRequestProcessorChain name=hash [...]

And then in the test send in update.processor=hash as a parameter.

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-12-03 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652815#action_12652815
 ] 

Mark Miller commented on SOLR-799:
--

I'm going to put up another patch for this soon. I'd like to have some getters 
on the factory for manual exploration of the settings.

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-12-03 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652836#action_12652836
 ] 

Yonik Seeley commented on SOLR-799:
---

bq. Now that I look to fix this, I am not understanding - I don't need to 
change the update handler, I need to change the update chain...I am not seeing 
how that can be done dynamically...is it possible?

Yes, you can dynamically change an update processor: update.processor=hash

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-12-03 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652837#action_12652837
 ] 

Mark Miller commented on SOLR-799:
--

Okay, I see. I was too intent on changing the current chain - problem indeed 
goes away by just plugging in an entirely new chain.

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-12-03 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652923#action_12652923
 ] 

Ryan McKinley commented on SOLR-799:


I'm not sure how you have the test set up, so I could be way off base.

You could subclass SearchHandler and set the protected ListSearchComponent 
components directly...



 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-11-21 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12649826#action_12649826
 ] 

Mark Miller commented on SOLR-799:
--

bq. There's probably no need for a separate test solrconfig-deduplicate.xml if 
all it adds is an update processor. Tests could just explicitly specify the 
update handler on updates.

Now that I look to fix this, I am not understanding - I don't need to change 
the update handler, I need to change the update chain...I am not seeing how 
that can be done dynamically...is it possible? If not I think I need the config 
xml.

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-11-04 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12645061#action_12645061
 ] 

Yonik Seeley commented on SOLR-799:
---

bq. Maybe we just do overwrite dupe for now?

+1, as long as we don't do anything to preclude the other stuff - we just need 
to leave room in the config XML and the update API such that we don't have to 
break the back compatibility of this patch if/when future features are 
implemented.

bq. Another point that was brought up is whether or not to delete any docs that 
match the update docs uniqueField id term, but not its similarity/update term.  
You are choosing to use the updateTerm to do updates rather then the unique 
term.

It seems like uniqueField should normally enforce uniqueness, regardless of 
what this component does.  If one wants duplicate ids, then it seems like a 
different field should be used for that (other than the uniqueKey field).  If 
one wants to delete *only* on the hash field, then they can make the hash field 
the id field.


 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch, SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-11-04 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12645073#action_12645073
 ] 

Mark Miller commented on SOLR-799:
--

Ok. I cant muster up much of a defense for leaving it out I suppose.

I'll polish off a final patch.

- Mark


On Nov 4, 2008, at 3:06 PM, Yonik Seeley (JIRA) [EMAIL PROTECTED]  



 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch, SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642245#action_12642245
 ] 

Mark Miller commented on SOLR-799:
--


I find the pluggable delete policy idea appealing, but I have not yet found a 
great way to plug it into the UpdateHandler. Any approach other than 
sub-classing DirectUpdateHandler2 appears to lead to tying an IndexWriter to 
UpdateHandler. There is a connection now, UpdateHandler has a method to create 
a main IndexWriter, but further tying seems wrong without a stronger reason. 
That point is arguable, but in the end, sub-classing results in simpler code in 
any case. The trade off is that now you have a PreventDupesDirectUpdateHandler 
that extends DirectUpdateHandler2. This would have to be used in combination 
with the SignatureUpdateProcessor if you want to prevent dupes from entering 
the index. Other use cases (other than overwriting) would require another 
UpdateHandler. Less than ideal in both cases (subclassing, pluggable 
interface/class).

Both approaches lead to less than ideal solutions beyond that as well . Because 
many docs that have been added to Solr might not yet be visible to an 
IndexReader, you have to keep a pending commit set of docs to check against. 
This list should be resilient against AddDoc, DelDocWquery, AddDoc, Commit. 
You'd essentially have to keep a mini index around to search against to 
accomplish this, due to delete by query. The other options are to either 
auto-commit sans a user commit before a delete, or just say we don't support 
that use case when using that UpdateHandler. None of it is very pretty.

Another option is to do things with an UpdateProcessor. This is the most 
elegant solution really, but it requires putting big,coarse syncs around the 
more precise syncs in DirectUpdateHandler2. That may not be a huge deal, I am 
not sure. The previous two options allow you to maintain similar syncs as to 
what is already there. Beyond that,  the UpdateProcessor approach still has the 
delete by query issues.

Maybe we just do overwrite dupe for now? It has none of these issues. I am open 
to whatever path you guys want. The other use cases do have their place - we 
will just have to compromise some to get there. Or maybe there are other 
suggestions?

Another point that was brought up is whether or not to delete any docs that 
match the update docs uniqueField id term, but not its similarity/update term. 
At the current moment, IMO, we shouldn't. You are choosing to use the 
updateTerm to do updates rather then the unique term. This allows you to have 
duplicate signatures but also uniqueField Ids for other operations (say 
delete). Also, if you already have a unique field that your using, it may be 
desirable to do dupe detection with a different field. There is always the 
option of setting the signature field to the uniqueField term. In the end, your 
call, I'll add it if we want it.

As far as search time dupe collapsing, I think I could see a search component 
that takes the page numbers to collapse (start, end) and does dupe elimination 
on that range at query time. It wouldn't be very fast, and I'm not sure how 
useful page at a time collapsing is, but it would be fairly easy to do. Not 
sure that it fits into this issue, but certainly could share some of its 
classes.



 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12639470#action_12639470
 ] 

Yonik Seeley commented on SOLR-799:
---

overwriting is implemented and supported in Lucene now (and we gain a number 
of benefits from using that).  Conditionally adding a document, or testing if a 
document already exists, is not supported.

Since we can't currently determine if something is a duplicate, it seems like 
this issue should go ahead with just a single option: whether to remove older 
documents with the same signature or not.



 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-14 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12639479#action_12639479
 ] 

Otis Gospodnetic commented on SOLR-799:
---

Thanks Yonik.  Good thing I asked for the clarification, since Marks' issue 
description does mention search-time stuff (field collapsing).

Mark: Do you still plan on tackling search-time 
duplicate/near-duplicate/similar doc detection?  In a separate issue?  Thanks.


 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-14 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12639456#action_12639456
 ] 

Otis Gospodnetic commented on SOLR-799:
---

Haven't looked at the patch yet.
Have looked at the Deduplication wiki page (and realize the stuff I'll write 
below is briefly mentioned there).
Have skimmed the above comments.

I want to bring up the use case that seems to have been mentioned already, but 
only in passing.  The focus of the previous comments seems to be on index-time 
duplication detection.  Another huge use case is search-time near-duplicate 
detection.  Sometimes it's about straight forward field collapsing (collapsing 
adjacent docs with identical values in some field), but sometimes it's more 
complicated.

For example, sometimes multiple fields need to be compared.  Sometimes they 
have to be identical for collapsing to happen.  Sometimes they only need to be 
similar.  How similarity is calculated is very application-dependent.  I 
believe this similarity computation has to be completely 
open/extensible/overridable, allowing one to write a custom search component, 
examine returned hits and compare them using app-specific similarity

Ideally one would have the option not to save the document/field at index-time 
(for examination at search-time), since that prevents one from experimenting 
and dynamically changing the similarity computation.

Here is one example.  Imagine a field called IDs that can have 1 or more 
tokens in it and imagine docs with the following IDs get returned:

1) id:aaa
2) id:bbb
3) id:ccc ddd
4) id:aaa bbb
5) id:eee ddd
6) id:aaa

A custom similarity may look at all of the above (e.g. a page's worth of hits) 
and decide that:
1) and 4) are similar
2) and 4) are also similar
3) and 5) are similar
1) and 4) and 6) are similar

Another custom similarity may say that only 1) and 6) are similar because they 
are identical.

My point is really that we have to leave it up to the application to provide 
similarity implementation, just like we make it possible for the app to provide 
a custom Lucene Similarity.

Is the goal of this issue to make this possible?


 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-13 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12639304#action_12639304
 ] 

Hoss Man commented on SOLR-799:
---

If we assume for a minute that users who want to prevent or overwrite 
duplicates using a signature should always use the signature field as their 
uniqueKey, then doesn't use case#1 simplify to just running using a 
SignatureUpdateProcessor and then another processor that forces 
allowDups=false,overwritePending=false,overwriteCommitted=false ?

Conceptually that seems right ... but at the moment DIH2 doesn't seem to care 
about allowDups at all (it only looks at overwriteCommitted and 
overwritePending to decide if it's allowing duplicates) and i'm not sure how to 
make it work off the top of my head, but assuming we need to muck with DIH2 
internals in some way to make signatures (and aborting if the signature already 
exists) work, implementing the changes to happen for those combination of 
existing options seems like the cleanest approach.: the functional changes to 
DIH2 become generally useful to anyone who doesn't want to overwrite existing 
docs with the same id, regardless of whether they are computing a signature.

the only hangup is whether we're okay with the initial assumption: that users 
who want duplicate detection by signature are willing to use the signature as 
the uniqueKey.  If not then perhaps the cleanest way to support that would be 
to add more generalized unique field support ... a list of field names in the 
schema.xml and a (hopefully) simple call writer.deleteDocuments(Term[]) call in 
DIH2 should do the trick right?  ... this could also be potentially useful to 
people for other purposes besides signatures, but i haven't thought throw all 
the permutations so i'm sure there would be funky corner cases.



 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638850#action_12638850
 ] 

Mark Miller commented on SOLR-799:
--

bq. 1. Prevent new insert - SignatureUpdateProcessor generates a signature and 
adds it as a field; AbortIfExistingUpdateProcessor aborts the update if a doc 
exists with a specific field in common with the doc to be added.

I like the idea of using UpdateProcessors for all of this. Its very clean 
compared to hacking around the DirectUpdateHandler. Unfortunately, I think 
AbortIfExistingUpdateProcessor would require locks that are too course. 
Ideally, you want to be able to inject code into the DirectUpdateHandlers 3 
levels of locking (iw,sync(this),none). Thats whats needed for efficiency, but 
the cleanness gets whacked - its ugly to get that done, and doesn't really mesh 
with the UpdateHandler API thats been defined. The linking of 
DirectUpdateHandlers2's addDoc implementation to the whole idea...there would 
have to be changes that just don't seem worth the added functionality.

Which leaves just hardcoding the support into DirectUpdateHandler, kind of like 
was done before for deletes/id dupes, and then just give options on the add doc 
cmd. Again I don't like it. But the anything else quickly breaks down for me. 
Any suggestions, insights?

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-09 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638217#action_12638217
 ] 

Andrzej Bialecki  commented on SOLR-799:


+1 on the incremental sig calculation.

Re: different types of signatures. Our experience in Nutch is that signature 
type is rarely changed, and we assume that this setting is selected once per 
lifetime of an index, i.e. there are never any mixed cases of documents with 
incompatible signatures. If we want to be sure that they are comparable, we 
could prepend a byte or two of unique signature type id - this way, even if a 
signature value matches but was calculated using other impl. the documents 
won't be considered duplicates, which is the way it should work, because 
different signature algorithms are incomparable.

Re: signature as byte[] - I think it's better if we return byte[] from 
Signature, and until we support binary fields we just turn this into a hex 
string.

Re: field ordering in DeduplicateUpdateProcessorFactory: I think that both 
sigFields (if defined) and any other document fields (if sigFields is 
undefined) should be first ordered in a predictable way (lexicographic?). 
Current patch uses a HashSet which doesn't guarantee any particular ordering - 
in fact the ordering may be different if you run the same code under different 
JVMs, which may introduce a random factor to the sig. calculation.

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-09 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638426#action_12638426
 ] 

Hoss Man commented on SOLR-799:
---

(disclaimer: haven't looked at the patch)

bq. Though in some implementations (like #2, which may be the default), 
detecting that duplicate and handling it are truly coupled... forcing a 
decoupling would not be a good thing in that case.

I don't follow your reasoning.  all the usecases i've seen mentioned seem like 
they could/would decouple very nicely...

1. Prevent new insert -- SignatureUpdateProcessor generates a signature and 
adds it as a field; AbortIfExistingUpdateProcessor aborts the update if a doc 
exists with a specific field in common with the doc to be added.
2. Remove old (i.e. same as an update works now) -- SignatureUpdateProcessor as 
mentioned before, and signature field is used as the uniqueKey field.
3. Note the duplicate on the existing document in a duplicates field -- 
SignatureUpdateProcessor as mentioned before; AnnotateDuplicatesProcessor 
checks for existing docs with a specific field in common with the doc to be 
added and executes additional opperations to udpate those docs, as well as 
the doc to be added.


 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-09 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638427#action_12638427
 ] 

Hoss Man commented on SOLR-799:
---

some misc comments from a user perspective based on the current state of the 
wiki...

1) rather then a comma seperated str fields, we should just use an arr

2) we should consider if/how we want to support using dynamicFields (ie: field 
name globs) in listing fields that are included in the signature)

3) By default, all non null fields on the document will be used. ... there's 
no such thing as a null field -- there are fields that have no value, and there 
are fields whose value is an empty string, but no null value.

4) yonik already asked other questions i had based on the wiki: how the order 
of fields in the update command affects the signature that gets computed -- 
both in terms of fields with different names, and fields with the same name.  
the fields should probably be stable sorted by field name, so that the order of 
fields with teh same name affects the signature, but the relative order of 
fields with different names doesn't (since the order of fields with the same 
name actually affects the way the document is indexed, but the order of 
different field names does not)

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637965#action_12637965
 ] 

Grant Ingersoll commented on SOLR-799:
--

Haven't looked at the patch, but I agree that it is wise to separate the 
detection of duplication from the handling of found duplicates.  The default 
can be to remove all as in the patch, but it should be easy to override.  
Scenarios I can see being useful:
1. Prevent new insert
2. Remove old (i.e. same as an update works now)
3.  Note the duplicate on the existing document in a duplicates field.  This 
obviously requires either deleting and re-adding the doc, or Lucene to better 
support appending/updating fields, maybe via the column-stride payloads (if 
that ever happens).  No need for this anytime soon.


 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637976#action_12637976
 ] 

Yonik Seeley commented on SOLR-799:
---

bq. I agree that it is wise to separate the detection of duplication from the 
handling of found duplicates

Though in some implementations (like #2, which may be the default), detecting 
that duplicate and handling it are truly coupled... forcing a decoupling would 
not be a good thing in that case.


 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638009#action_12638009
 ] 

Yonik Seeley commented on SOLR-799:
---

Some thoughts...

- How should different types be handled (for example when we support binary 
fields).  For example, different base64 encoders might use different line 
lengths or different line endings (CR/LF).  Perhaps it's good enough to say 
that the string form must be identical, and leave it at that for now?  The 
alternative would be signatures based on the Lucene Document about to be 
indexed.

- It would be nice to be able to calculate a signature for a document w/o 
having to catenate all the fields together.
Perhaps change calculate(String content) to something like 
calculate(IterableCharSequence content)?

An alternative option would be incremental hashing...
{code}
Signature sig = ourSignatureCreator.create();
sig.add(f1)
sig.add(f2)
sig.add(f3)
String s = sig.getSignature()
{code}

Looking at how TextProfileSignature works, i'd lean toward incremental hashing 
to avoid building yet another big string. Having a hashing object also opens up 
the possibility to easily add other method signatures for more efficient 
hashing.

- It appears that if you put fields in a different order that the signature 
will change

- It appears that documents with different field names but the same content 
will have the same signature.

- I don't understand the dedup logic in DUH2... it seems like we want to delete 
by id and by sig... unfortunately there is no 
  IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a 
separate non-atomic delete on the sig for now, right?

- There's probably no need for a separate test solrconfig-deduplicate.xml if 
all it adds is an update processor.  Tests could just explicitly specify the 
update handler on updates.


 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638048#action_12638048
 ] 

Mark Miller commented on SOLR-799:
--

bq.I agree that it is wise to separate the detection of duplication from 
the handling of found duplicates

bq. Though in some implementations (like #2, which may be the default), 
detecting that duplicate and handling it are truly coupled... forcing a 
decoupling would not be a good thing in that case.

Still looking at this. Was hoping to avoid any of the old 'if solr crashes you 
can have 2 docs with same id in the index' type stuff. Guess I won't easily get 
away with that g Hopefully we can make it so the default implementation can 
still be as efficient and atomic.

bq. How should different types be handled (for example when we support binary 
fields). For example, different base64 encoders might use different line 
lengths or different line endings (CR/LF). Perhaps it's good enough to say that 
the string form must be identical, and leave it at that for now? The 
alternative would be signatures based on the Lucene Document about to be 
indexed.

Yeah, may be best to worry about it when we support binary fields...would be 
nice to look forward though. I think returning a byte[] rather than a String 
will future proof the sig implementations a bit along those lines (though 
doesn't address your point)...still mulling - this shouldn't trip up Fuzzy 
hashing implementations to much, and so how exact should MD5Signature be...

bq. *  It appears that if you put fields in a different order that the 
signature will change
bq. * It appears that documents with different field names but the same 
content will have the same signature.

Two good points I have addressed.

bq. It would be nice to be able to calculate a signature for a document w/o 
having to catenate all the fields together.
Perhaps change calculate(String content) to something like 
calculate(IterableCharSequence content)?

I like the idea of incremental as well.

bq. I don't understand the dedup logic in DUH2... it seems like we want to 
delete by id and by sig... unfortunately there is no
IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a 
separate non-atomic delete on the sig for now, right?

Another one I was hoping to get away with. My current strategy was to say that 
setting an update term means that updating by id is overridden and *only* the 
update Term is used - effectively, the update Term (signature) becomes the 
update id - and you can control whether the id factors into that update 
signature or not.  Didn't get passes the goalie I suppose g I guess I give up 
on clean atomic imp and perhaps investigate update(terms[], doc) for the 
future. I wanted to deal with both signature and id, but figured its best to 
start with most efficient bare bones and work out.

bq. There's probably no need for a separate test solrconfig-deduplicate.xml if 
all it adds is an update processor. Tests could just explicitly specify the 
update handler on updates.

Its mainly for me at the moment (testing config settings loading and what not), 
I'll be sure to pull it once the patch is done.

Thanks for all of the feedback.


 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-07 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637649#action_12637649
 ] 

Andrzej Bialecki  commented on SOLR-799:


Interesting development in light of NUTCH-442 :) Some comments:

* in MD5Signature I suggest using the code from 
org.apache.hadoop.io.MD5Hash.toString() instead of BigInteger.

* TextProfileSignature should contain a remark that it's copied from Nutch, 
since AFAIK the algorithm that it implements is currently used only in Nutch.

* in Nutch the concept of a page Signature is only a part of the deduplication 
process. The other part is the algorithm to decide which copy to keep and which 
one to discard. In your patch the latest update always removes all other 
documents with the same signature. IMHO this decision should be isolated into a 
DuplicateDeletePolicy class that gets all duplicates and can decide (based on 
arbitrary criteria) which one to keep, with the default implementation that 
simply keeps the latest document.

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637719#action_12637719
 ] 

Mark Miller commented on SOLR-799:
--

Thanks for the review Andrzej. I've made the first two changes (I put at the 
top of TextProfileSignature that its 'borrowed' from Nutch and grabbed Hadoops 
MD5Hash class and stripped its Hadoop dependencies) and I'm investigating 
change 3. I'll put up another patch in a couple days.

- Mark

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.