Lucifersam wrote: > Finally - I seem to have a problem with identical pages with different urls > - i.e. > > http://website/ > http://website/default.htm > > I was under the impression that these would be removed by the dedup process, > but this does not seem to be working. Is there something I'm missing?
Most likely the pages are slightly different - you can save them to files, and then run a diff utility to check for differences. > (I > also have a similar problem with the external site as it carries session ids > around in the URL which change - although the content of the duplicate pages > is identical). > You can remove session IDs using URLNormalizers - see e.g. the regex-urlnormalizer.xml for an example how to do this. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
