Well db.maxoutlinks appears to have made a world of difference. I'm now getting nutch crawling deeply through the site.
Many thanks for all your input, I'm sure I'll be back asking some more useless questions soon! Annona Keene wrote: > > > Thus saith Mike Howarth: >> I've already played around with differing depths generally from 3 to 10 >> and >> have had no distinguisable difference in results >>...... >> Anymore ideas? > > > > I fought with a similar problem for quite a while. I suggest changing 2 > things in your nutch-site.xml > > The http.content.limit will prevent nutch from truncating the page. As > long as your pages aren't so big that you're going to kill the machine > you're using, removing the truncation should work. > > <property> > <name>http.content.limit</name> > <value>-1</value> > <description>The length limit for downloaded content, in bytes. > If this value is nonnegative (>=0), content longer than it will be > truncated; > otherwise, no truncation at all. > </description> > </property> > > Second, by default, nutch only crawls the first 100 links it encounters on > a page. So if you set db.max.outlinks.per.page to -1, it will crawl all > the links. > > <property> > <name>db.max.outlinks.per.page</name> > <value>-1</value> > <description>The maximum number of outlinks that we'll process for a > page. > If this value is nonnegative (>=0), at most db.max.outlinks.per.page > outlinks > will be processed for a page; otherwise, all outlinks will be processed. > </description> > </property> > > > I hope this helps! > > Ann > > > > > > ____________________________________________________________________________________ > We won't tell. Get more on shows you hate to love > (and love to hate): Yahoo! TV's Guilty Pleasures list. > http://tv.yahoo.com/collections/265 > -- View this message in context: http://www.nabble.com/Crawl-not-crawling-entire-page-tf3446522.html#a9619061 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
