Re: [Nutch-general] Need help with deleteduplicates

Doğacan Güney Wed, 27 Dec 2006 00:38:10 -0800

sdeck wrote:
> That sort of gets me there in understanding what is going on.
> Still not all the way though.
> So, let's look at the trunk of deleteduplicates:
> http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
>
> No where in there do I see where url == url, and if so, delete that doc from
> the index.
> So, I am not sure where I would put my code.
>
> I could possibly modify the hash content reducer.  Basically, here is the
> algorithm approach
>
> start at 1
> loop through 2-N and take the text of 1 and compare to the text of 2, 3,
> 4,...N
> If the similarity score is > ## then delete that document.
>
> The way I understand the hash reducer, that is what it is doing, but I don't
> really understand where the score is coming from and where the comparison is
> really taking place.
> I see this:
> public int compareTo(Object o) {
>       IndexDoc that = (IndexDoc)o;
>       if (this.keep != that.keep) {
>         return this.keep ? 1 : -1; 
>       } else if (!this.hash.equals(that.hash)) {       // order first by
> hash
>         return this.hash.compareTo(that.hash);
> ...
>
>
> So, is that where I would place my similary score and return that value
> there?
>
>
>


AFAIK DeleteDuplicates works like this:
IndexDoc is a presentation of the actual document in your index(IndexDoc 
keeps among other things, document's url, boost and digest). It is also 
Writable and Comparable which means that it can be used both as a key 
and a value in MapReduce.

In the first phase of dedup, job reads the indexes and outputs 
<IndexDoc.url, IndexDoc> pairs. Job's map is identity, so in reduce, 
IndexDocs with same url are grouped under same reduce. Reduce outputs 
these marking older versions of same urls to be deleted. (So if you 
fetched the same url more than  once only the newest is kept)
 
In phase 2, job reads this output then outputs <IndexDoc.hash, IndexDoc> 
pairs. Again map is identity and reduce marks relevant ones to be 
deleted. (So if you fetched same documents under different urls, only 
the the one with the highest boost or the shortest url is kept).

Phase 3 reads this output then deletes all marked documents.

I think that your version will be somewhat difficult to implement. 
Because, MapReduce works best on input that can be processed 
independently from each other.

Hope that clears things a bit.

--
Dogacan Guney

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Need help with deleteduplicates

Reply via email to