I'd probably treat this as a deduplication problem and look to use a fuzzy 
matching approach, such as the TextProfileSignature in Solr/Nutch: 
http://wiki.apache.org/solr/Deduplication, which I believe is tunable as to 
it's threshold of acceptance.

I'd also likely give pushback on the notion of 50% for a bit more 
clarification.  Does it mean 50% of all words (pre or post analysis?  Stemming 
or not?) or 50% of "important words" (which is more or less what More Like This 
will do.)  You might also do a little bit of research into academia here, as 
there is a fair amount of work that has gone into this area along the lines of 
detecting plagiarism, etc.   Finally, one might be able to instead treat this 
as a classification problem and train a model to detect dupes or not.


On Aug 30, 2011, at 12:55 PM, Saurabh Gokhale wrote:

> Hi All,
> 
> I need your help to understand how I can have Lucene applied to the
> following business scenario. Question is in RED
> 
> *Business Scenario:*
> Analyze newly created document "A" with existing documents in the system and
> if document A matches more than (similar to) 50% with any of the existing
> documents, perform specific action.
> 
> *Possible Lucene Implementation:*
> Requirement: Analyze newly created document A
> Action: Read name and the contents of the document A
> 
> Requirement: Analyze new document with existing documents in the system
> Action: 1. Pre Index all the existing document and create lucene index. 2.
> Use class like MoreLikeThis to find similar documents for newly created
> document.
> 
> Requirement: If match is above 50%, perform specific action
> Action: Since resulting lucene score for the match can not be directly
> converted into a percentage match (as the score value changes based on many
> factors) how can this requirement be satisfied?
> 
> Thanks
> 
> Saurabh

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Reply via email to