I'd probably treat this as a deduplication problem and look to use a fuzzy matching approach, such as the TextProfileSignature in Solr/Nutch: http://wiki.apache.org/solr/Deduplication, which I believe is tunable as to it's threshold of acceptance.
I'd also likely give pushback on the notion of 50% for a bit more clarification. Does it mean 50% of all words (pre or post analysis? Stemming or not?) or 50% of "important words" (which is more or less what More Like This will do.) You might also do a little bit of research into academia here, as there is a fair amount of work that has gone into this area along the lines of detecting plagiarism, etc. Finally, one might be able to instead treat this as a classification problem and train a model to detect dupes or not. On Aug 30, 2011, at 12:55 PM, Saurabh Gokhale wrote: > Hi All, > > I need your help to understand how I can have Lucene applied to the > following business scenario. Question is in RED > > *Business Scenario:* > Analyze newly created document "A" with existing documents in the system and > if document A matches more than (similar to) 50% with any of the existing > documents, perform specific action. > > *Possible Lucene Implementation:* > Requirement: Analyze newly created document A > Action: Read name and the contents of the document A > > Requirement: Analyze new document with existing documents in the system > Action: 1. Pre Index all the existing document and create lucene index. 2. > Use class like MoreLikeThis to find similar documents for newly created > document. > > Requirement: If match is above 50%, perform specific action > Action: Since resulting lucene score for the match can not be directly > converted into a percentage match (as the score value changes based on many > factors) how can this requirement be satisfied? > > Thanks > > Saurabh -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com