Re: [Nutch-general] Need help with deleteduplicates

sdeck Tue, 26 Dec 2006 17:21:12 -0800

That sort of gets me there in understanding what is going on.
Still not all the way though.
So, let's look at the trunk of deleteduplicates:
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java


No where in there do I see where url == url, and if so, delete that doc from
the index.
So, I am not sure where I would put my code.

I could possibly modify the hash content reducer.  Basically, here is the
algorithm approach

start at 1
loop through 2-N and take the text of 1 and compare to the text of 2, 3,
4,...N
If the similarity score is > ## then delete that document.

The way I understand the hash reducer, that is what it is doing, but I don't
really understand where the score is coming from and where the comparison is
really taking place.
I see this:
public int compareTo(Object o) {
      IndexDoc that = (IndexDoc)o;
      if (this.keep != that.keep) {
        return this.keep ? 1 : -1; 
      } else if (!this.hash.equals(that.hash)) {       // order first by
hash
        return this.hash.compareTo(that.hash);
...


So, is that where I would place my similary score and return that value
there?




Dennis Kubes wrote:
> 
> If I am understanding what you are asking, in the getRecordReader method 
> of the InputFormat innner class in DeleteDuplicates it gets the hash 
> score from the document.  You could put your algorithm there and return 
> some type of numeric value based on analysis of the document fields.  
> You would need to write a different class for HashScore and return it 
> from the record reader.  You would probably want to keep the IndexDoc 
> being written out as the value in dedup phase 1 (in the job config) but 
> change the key to your HashScore replacement class.  You would need to 
> change HashPartitioner to partition according to your new key numeric.  
> The HashReducer would also need to be changed to collect only the ones 
> you want based on your new key numeric. 
> 
> The dedup phase 2 deletes by url so if you want to remove exact urls 
> then you would leave it in otherwise you might want to take the job 
> config section for phase 2 out.
> 
> Hope this helps.
> 
> Dennis
> 
> sdeck wrote:
>> Hello,
>>   I am running nutch .8 against hadoop .4, just for reference
>> I want to add a delete duplicate based on a similarity algorithm, as
>> opposed
>> to the hash method that is currently in there.
>> I would have to say I am pretty lost as to how the delete duplicates
>> class
>> is working.
>> I would guess that I need to implement a compareTo method, but I am not
>> really sure what to return. Also, when I do return something, where do I
>> implement the functionality to say "yes, these are dupes, so remove the
>> first one)
>>
>> Can anyone help out?
>> Thanks,
>> S
>>   
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Need-help-with-deleteduplicates-tf2858127.html#a8058635
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Need help with deleteduplicates

Reply via email to