Andrzej Bialecki wrote: > > Lucifersam wrote: >> Finally - I seem to have a problem with identical pages with different >> urls >> - i.e. >> >> http://website/ >> http://website/default.htm >> >> I was under the impression that these would be removed by the dedup >> process, >> but this does not seem to be working. Is there something I'm missing? > > Most likely the pages are slightly different - you can save them to > files, and then run a diff utility to check for differences. >
You're right, there was a small difference in the HTML concerning some timing comment, e.g: <!--Exec time = 265.625--> As this is not strictly content - is there a simply way to ignore anything within comments when looking at the content of a page? Andrzej Bialecki wrote: > >> (I >> also have a similar problem with the external site as it carries session >> ids >> around in the URL which change - although the content of the duplicate >> pages >> is identical). >> > > You can remove session IDs using URLNormalizers - see e.g. the > regex-urlnormalizer.xml for an example how to do this. > Thanks - I will look into this. -- View this message in context: http://www.nabble.com/Quick-questions---merging-deduping-tf3267849.html#a9084994 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
