[ https://issues.apache.org/jira/browse/NUTCH-70?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki closed NUTCH-70. ---------------------------------- Resolution: Won't Fix > duplicate pages - virtual hosts in db. > -------------------------------------- > > Key: NUTCH-70 > URL: https://issues.apache.org/jira/browse/NUTCH-70 > Project: Nutch > Issue Type: Bug > Environment: 0,7 dev > Reporter: YourSoft > > Dear Developers, > I have a problem with nutch: > - There are many sites duplicates in the webdb and in the segments. > The source of this problem is: > - If the site make 'virtual hosts' (like Apache), e.g. www.origo.hu, > origo.hu, origo.matav.hu, origo.matavnet.hu etc.: the result pages are the > same, only the inlinks are differents. > - The ip address is the same. > - When search, all virtualhosts are in the results. > Google only show one of these virtual hosts, the nutch show all. The result > nutch db is larger, and this case slower, than google. > Have any idea, how to remove these duplicates? > Regards, > Ferenc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.