@all.....here is what i mean Duplicate document detection( taken from IBM Research)
Duplicate document detection is a technique that is used to prevent search results from containing multiple documents with the same or nearly the same content. Search quality might be degraded if multiple copies of the same (or nearly the same) documents are listed in the search results. Duplicate document analysis occurs only when both of the following conditions are true: * The collection uses the link-based ranking model. This model applies to crawlers that crawl Web sites, such as the Web crawler or WebSphere Portal crawler. * Collection-security is disabled. During global analysis, the indexing processes detect duplicates by scanning the document content for each document. If two documents have the same document content, they are treated as duplicates. If you want document metadata to also be considered when duplicate detection analysis occurs, you must select the Document content check box when you configure crawlers for the collection and specify options for crawling metadata. In this case, the crawler crawls the metadata fields as document content and includes the metadata when analyzing the content for duplicate documents. Similar analysis occurs when you configure options for parsing HTML and XML documents and select the select the Document content check box. When you specify that a field or metadata field constitutes document content, the content of those fields is added to the dynamic summary of the document in the search results, which can have an impact on whether the document is displayed in the search results. If near duplicate detection is enabled in the search application (the NearDuplicateDetection property in the setProperty method is set to Yes), documents with similar titles and summaries are suppressed when a user views search results. Users can click a link to view the nearly duplicate, suppressed documents. In a group of duplicate documents, one document is the master and the others are duplicates. All documents in the group of duplicates have the same canonical representation of the content. During indexing, the content (tokens) of the master document are indexed. For the duplicate documents, only the metadata tokens are indexed. When the master document is deleted from the index, the next duplicate becomes the master. When users search the collection, only the master document is returned. its the detail for understanding what i am talking about..don't be panic..no need to take care of all ..ultimately we have to come up with excellent solution to solve it.?? @rahul cryptographic hash function seems to be good but we have come to efficient & optimized solution all the algo fro such hash function O(n^2) in worst case which can't neglected at all if we have huge data i said one billion ,google,yahoo has 1oo's of billion data stored on theirs storage servers.. & all are distributed as well so roughly This algorithm will provide us a list of unique urls. But wait, can this fit on one computer? how much space does each page take up in the hash table? Each page hashes to a minimum of greater then 16 byte value. in all SHA-1,2,3,4,5 & MD2,5 Algorithms Each url is an average of 30 characters, so that’s another 30 bytes at least. Each url takes up roughly 46 bytes. 46 bytes * 1 billion = 31.8 gigabytes. We’re going to have trouble holding that all in memory! as we have very limited memory so we can distribute our hash function as well Or, we could split this up across machines, and deal with network latency so assume we have m machines. First, we hash the document to get a hash value h h%m tells us which machine this document’s hash table can be found on. h/ m is the value in the hash table that is located on its machine. it can reduce the time complexity but still more thinking needed ?? this Question has various & massive application in the filed of computers science .. Thanks & Regards Shashank >>" The Best Way to escape from the problem is to solve it" CSE,BIT Mesra -- You received this message because you are subscribed to the Google Groups "Algorithm Geeks" group. To post to this group, send email to algogeeks@googlegroups.com. To unsubscribe from this group, send email to algogeeks+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/algogeeks?hl=en.