Thank you Andrzej, it helped ! Another problem i am facing right now is limiting the total number of urls to crawl on a single website. "generate.max.per.host" value doesn't seem to work as it is supposed to - the value is set to 300, however the total number of crawled urls varies from 600 to ~3000 depending from "crawl.link.depth" value.
Is there any way to limit total number of urls per site to crawl (under Nutch 0.8) ? Andrzej Bialecki wrote: > > [EMAIL PROTECTED] wrote: >> I am using Nutch 0.8 to crawl a list of websites and i have found out >> that >> Nutch cannot find all the links on a page. >> >> For example: http://www.artbrown.com/ >> >> According to google this website has approximately 4,450 pages. >> However no matter how i change nutch's config, it won't crawl more >> than 9 pages. >> >> I've tried changing "crawl.link.depth", "http.content.limit" >> and using Tagsoup html parser instead of NekoHtml as described here: >> http://www.mail-archive.com/[email protected]/msg03141.html >> but it doesn't help. >> >> Any ideas ? >> > > Check your url filters - most likely you have the default rule that > discards URLs with special characters. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > > -- View this message in context: http://www.nabble.com/Nutch-0.8-cannot-find-all-the-links-on-a-page-tf3033338.html#a8446081 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
