[ http://issues.apache.org/jira/browse/NUTCH-70?page=comments#action_12413169 ]
Stefan Neufeind commented on NUTCH-70: -------------------------------------- Is the content exactly the same? Maybe could the page be checked against an already existing one by an MD5 on the content? But I'm not sure if there is a clean way to workaround the problem - what if all pages are the same except one, on the other vhost? Would have to crawl all anyway, wouldn't you? > duplicate pages - virtual hosts in db. > -------------------------------------- > > Key: NUTCH-70 > URL: http://issues.apache.org/jira/browse/NUTCH-70 > Project: Nutch > Type: Bug > Environment: 0,7 dev > Reporter: YourSoft > > Dear Developers, > I have a problem with nutch: > - There are many sites duplicates in the webdb and in the segments. > The source of this problem is: > - If the site make 'virtual hosts' (like Apache), e.g. www.origo.hu, > origo.hu, origo.matav.hu, origo.matavnet.hu etc.: the result pages are the > same, only the inlinks are differents. > - The ip address is the same. > - When search, all virtualhosts are in the results. > Google only show one of these virtual hosts, the nutch show all. The result > nutch db is larger, and this case slower, than google. > Have any idea, how to remove these duplicates? > Regards, > Ferenc -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
