sdeck wrote: > That sort of gets me there in understanding what is going on. > Still not all the way though. > So, let's look at the trunk of deleteduplicates: > http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java > > No where in there do I see where url == url, and if so, delete that doc from > the index. > So, I am not sure where I would put my code. > > I could possibly modify the hash content reducer. Basically, here is the > algorithm approach > > start at 1 > loop through 2-N and take the text of 1 and compare to the text of 2, 3, > 4,...N > If the similarity score is > ## then delete that document. > > The way I understand the hash reducer, that is what it is doing, but I don't > really understand where the score is coming from and where the comparison is > really taking place. > I see this: > public int compareTo(Object o) { > IndexDoc that = (IndexDoc)o; > if (this.keep != that.keep) { > return this.keep ? 1 : -1; > } else if (!this.hash.equals(that.hash)) { // order first by > hash > return this.hash.compareTo(that.hash); > ... > > > So, is that where I would place my similary score and return that value > there? > > >
AFAIK DeleteDuplicates works like this: IndexDoc is a presentation of the actual document in your index(IndexDoc keeps among other things, document's url, boost and digest). It is also Writable and Comparable which means that it can be used both as a key and a value in MapReduce. In the first phase of dedup, job reads the indexes and outputs <IndexDoc.url, IndexDoc> pairs. Job's map is identity, so in reduce, IndexDocs with same url are grouped under same reduce. Reduce outputs these marking older versions of same urls to be deleted. (So if you fetched the same url more than once only the newest is kept) In phase 2, job reads this output then outputs <IndexDoc.hash, IndexDoc> pairs. Again map is identity and reduce marks relevant ones to be deleted. (So if you fetched same documents under different urls, only the the one with the highest boost or the shortest url is kept). Phase 3 reads this output then deletes all marked documents. I think that your version will be somewhat difficult to implement. Because, MapReduce works best on input that can be processed independently from each other. Hope that clears things a bit. -- Dogacan Guney ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
