[algogeeks] Re: Efficient Way to Detect Duplicate Document

bittu Mon, 09 May 2011 08:57:14 -0700

@all.....here is what i mean

Duplicate document detection( taken from IBM Research)


Duplicate document detection is a technique that is used to prevent
search results from containing multiple documents with the same or
nearly the same content.
Search quality might be degraded if multiple copies of the same (or
nearly the same) documents are listed in the search results. Duplicate
document analysis occurs only when both of the following conditions
are true:

    * The collection uses the link-based ranking model. This model
applies to crawlers that crawl Web sites, such as the Web crawler or
WebSphere Portal crawler.
    * Collection-security is disabled.

During global analysis, the indexing processes detect duplicates by
scanning the document content for each document. If two documents have
the same document content, they are treated as duplicates.

If you want document metadata to also be considered when duplicate
detection analysis occurs, you must select the Document content check
box when you configure crawlers for the collection and specify options
for crawling metadata. In this case, the crawler crawls the metadata
fields as document content and includes the metadata when analyzing
the content for duplicate documents. Similar analysis occurs when you
configure options for parsing HTML and XML documents and select the
select the Document content check box.

When you specify that a field or metadata field constitutes document
content, the content of those fields is added to the dynamic summary
of the document in the search results, which can have an impact on
whether the document is displayed in the search results. If near
duplicate detection is enabled in the search application (the
NearDuplicateDetection property in the setProperty method is set to
Yes), documents with similar titles and summaries are suppressed when
a user views search results. Users can click a link to view the nearly
duplicate, suppressed documents.

In a group of duplicate documents, one document is the master and the
others are duplicates. All documents in the group of duplicates have
the same canonical representation of the content. During indexing, the
content (tokens) of the master document are indexed. For the duplicate
documents, only the metadata tokens are indexed. When the master
document is deleted from the index, the next duplicate becomes the
master. When users search the collection, only the master document is
returned.

its the detail for understanding what i am talking about..don't be
panic..no need to take care of all ..ultimately  we have to come up
with excellent solution to solve it.??


@rahul

cryptographic hash function seems to be good but we have come to
efficient & optimized solution  all the algo fro such hash function
O(n^2) in worst case which can't neglected at all if we have huge data
i said one billion ,google,yahoo has 1oo's of billion data stored on
theirs storage servers.. & all are distributed as well so roughly

This algorithm will provide us a list of unique urls. But wait, can
this fit on one computer?
how much space does each page take up in the hash table?
Each page hashes to a minimum of greater then 16 byte value. in all
SHA-1,2,3,4,5 & MD2,5 Algorithms
Each url is an average of 30 characters, so that’s another 30 bytes at
least.
Each url takes up roughly 46 bytes.
46  bytes * 1 billion = 31.8 gigabytes. We’re going to have trouble
holding that all in memory!  as we have very limited memory

so we can distribute our hash function as well  Or, we could split
this up across machines, and deal with network latency
so  assume we have m machines.
First, we hash the document to get a hash value h
h%m tells us which machine this document’s hash table can be found on.
h/ m is the value in the hash table that is located on its machine.

it can reduce the time complexity but still more thinking needed
??

this Question has various & massive  application in the filed of
computers science ..

Thanks & Regards
Shashank >>" The Best Way to escape from the problem is to solve it"
CSE,BIT Mesra

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.

[algogeeks] Re: Efficient Way to Detect Duplicate Document

Reply via email to