Re: Identifying common text in documents

Lance Norskog Sat, 24 Dec 2011 14:21:21 -0800

Great topic!

1) SignatureUpdateProcessor creates a hash of the exact byte stream of
the document. Often your crawling software can't do an incremental
update of your data, but can only re-index the entire corpus. The SUP
makes the hash, searches for it, and it it is there the document
indexer says "all done, give me the next document" without doing
anything.

2) TextProfileSignature does roughly the same, but operates on a
version of the document that is analyzed. I'm not sure what inspired,
but here is a wild guess: if you change some formatting in an HTML
page and re-index it, since the TUP only sees the text, it will ignore
the formatting change and the hashes will still match. (Maybe.)

3) The Mahout project includes a batch process: it takes all of your
documents, cuts them up into pieces in the same way that TUP does, and
then compares all of them to each other. It uses the Bayes theorem to
score the distances probabilistically. This can be run on many
machines simultaneously via Hadoop. I don't know if it has been run on
Wikipedia, but it should work.

Something like #3 could be done in Solr.

On Sat, Dec 24, 2011 at 12:41 PM, Mike O'Leary <tmole...@uw.edu> wrote:
> I am looking for a way to identify blocks of text that occur in several 
> documents in a corpus for a research project with electronic medical records. 
> They can be copied and pasted sections inserted into another document, text 
> from a previous email in the corpus that is repeated in a follow-up email, 
> text templates that get inserted into groups of documents, and occurrences of 
> the same template more than once in the same document. Any of these 
> duplicated text blocks may contain minor differences from one instance to 
> another.
>
> I read in a document called "What's new in Solr 1.4" that there has been 
> support since 1.4 came out for duplicate text detection using the 
> SignatureUpdateProcessor and TextProfileSignature classes. Can these be used 
> to detect portions of documents that are alike or nearly alike, or are they 
> intended to detect entire documents that are alike or nearly alike? Has 
> additional support for duplicate detection been added to Solr since 1.4? It 
> seems like some of the features of Solr and Lucene such as term positions and 
> shingling could help in finding sections of matching or nearly matching text 
> in documents. Does anyone have any experience in this area that they would be 
> willing to share?
> Thanks,
> Mike

-- 
Lance Norskog
goks...@gmail.com

Re: Identifying common text in documents

Reply via email to