That sort of gets me there in understanding what is going on.
Still not all the way though.
So, let's look at the trunk of deleteduplicates:
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
No where in there do I see where url == url, and if so, delete that doc from
the index.
So, I am not sure where I would put my code.
I could possibly modify the hash content reducer. Basically, here is the
algorithm approach
start at 1
loop through 2-N and take the text of 1 and compare to the text of 2, 3,
4,...N
If the similarity score is > ## then delete that document.
The way I understand the hash reducer, that is what it is doing, but I don't
really understand where the score is coming from and where the comparison is
really taking place.
I see this:
public int compareTo(Object o) {
IndexDoc that = (IndexDoc)o;
if (this.keep != that.keep) {
return this.keep ? 1 : -1;
} else if (!this.hash.equals(that.hash)) { // order first by
hash
return this.hash.compareTo(that.hash);
...
So, is that where I would place my similary score and return that value
there?
Dennis Kubes wrote:
>
> If I am understanding what you are asking, in the getRecordReader method
> of the InputFormat innner class in DeleteDuplicates it gets the hash
> score from the document. You could put your algorithm there and return
> some type of numeric value based on analysis of the document fields.
> You would need to write a different class for HashScore and return it
> from the record reader. You would probably want to keep the IndexDoc
> being written out as the value in dedup phase 1 (in the job config) but
> change the key to your HashScore replacement class. You would need to
> change HashPartitioner to partition according to your new key numeric.
> The HashReducer would also need to be changed to collect only the ones
> you want based on your new key numeric.
>
> The dedup phase 2 deletes by url so if you want to remove exact urls
> then you would leave it in otherwise you might want to take the job
> config section for phase 2 out.
>
> Hope this helps.
>
> Dennis
>
> sdeck wrote:
>> Hello,
>> I am running nutch .8 against hadoop .4, just for reference
>> I want to add a delete duplicate based on a similarity algorithm, as
>> opposed
>> to the hash method that is currently in there.
>> I would have to say I am pretty lost as to how the delete duplicates
>> class
>> is working.
>> I would guess that I need to implement a compareTo method, but I am not
>> really sure what to return. Also, when I do return something, where do I
>> implement the functionality to say "yes, these are dupes, so remove the
>> first one)
>>
>> Can anyone help out?
>> Thanks,
>> S
>>
>
>
--
View this message in context:
http://www.nabble.com/Need-help-with-deleteduplicates-tf2858127.html#a8058635
Sent from the Nutch - User mailing list archive at Nabble.com.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general