[ http://issues.apache.org/jira/browse/NUTCH-235?page=all ]
Andrzej Bialecki closed NUTCH-235:
-----------------------------------
Fix Version: 0.8-dev
Resolution: Fixed
HashSet-based version of the patch applied.
> Duplicate Inlink values
> -----------------------
>
> Key: NUTCH-235
> URL: http://issues.apache.org/jira/browse/NUTCH-235
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Fix For: 0.8-dev
> Attachments: patch.txt, set-patch.txt
>
> Reading the code for LinkDb.reduce(): if we have page duplicates in input
> segments, or if we have two copies of the same input segment, we will create
> the same Inlink values (satisfying Inlink.equals()) multiple times. Since
> Inlinks is a facade for List, and not a Set, we will get duplicate Inlink-s
> in Inlinks (if you know what I mean ;) .
> The problem is easy to test: create a new linkdb based on 2 identical
> segments. This problem also makes it more difficult to properly implement
> LinkDB updating mechanism (i.e. incremental invertlinks).
> I propose to change Inlinks to use a Set semantics, either explicitly by
> using a HashSet or implicitly by checking if a value to be added already
> exists. If there are no objections I'll commit this change shortly.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers