Thus saith Mike Howarth:
> I've already played around with differing depths generally from 3 to 10 and
> have had no distinguisable difference in results
>......
> Anymore ideas?
I fought with a similar problem for quite a while. I suggest changing 2 things
in your nutch-site.xml
The http.content.limit will prevent nutch from truncating the page. As long as
your pages aren't so big that you're going to kill the machine you're using,
removing the truncation should work.
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
</description>
</property>
Second, by default, nutch only crawls the first 100 links it encounters on a
page. So if you set db.max.outlinks.per.page to -1, it will crawl all the links.
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
I hope this helps!
Ann
____________________________________________________________________________________
We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.com/collections/265
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general