[
https://issues.apache.org/jira/browse/NUTCH-70?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578965#action_12578965
]
Andrzej Bialecki commented on NUTCH-70:
----------------------------------------
These issues are partially addressed in 0.9 / 1.0. Also, 0.7 branch is in End
Of Life status.
> duplicate pages - virtual hosts in db.
> --------------------------------------
>
> Key: NUTCH-70
> URL: https://issues.apache.org/jira/browse/NUTCH-70
> Project: Nutch
> Issue Type: Bug
> Environment: 0,7 dev
> Reporter: YourSoft
>
> Dear Developers,
> I have a problem with nutch:
> - There are many sites duplicates in the webdb and in the segments.
> The source of this problem is:
> - If the site make 'virtual hosts' (like Apache), e.g. www.origo.hu,
> origo.hu, origo.matav.hu, origo.matavnet.hu etc.: the result pages are the
> same, only the inlinks are differents.
> - The ip address is the same.
> - When search, all virtualhosts are in the results.
> Google only show one of these virtual hosts, the nutch show all. The result
> nutch db is larger, and this case slower, than google.
> Have any idea, how to remove these duplicates?
> Regards,
> Ferenc
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.