Hi, On Mon, Jan 12, 2009 at 7:03 PM, ahammad <[email protected]> wrote: > > I just started using Nutch to crawl an intranet site. In my urls file, I have > a single link that refers to a jhtml page, which contains roughly 2000 links > in it. The links contain characters like '?' and '=', so I removed the > following from the crawl-urlfilter.txt file: > > # skip URLs containing certain characters as probable queries, etc. > -[...@=] > > I finally got the crawl to work, but I only see 111 results under "TOTAL > urls:" when I run the following command: > > bin/nutch readdb crawlTest/crawldb -stats > > I'm not sure where to look at this point. Any ideas? > > BTW what's the command that dumps all the links? Every one that I found > online doesn't work... >
Nutch only considers the first 100 links from a page by default. You can change this with this options: <property> <name>db.max.outlinks.per.page</name> <value>100</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> You can dump inverted links with command "readlinkdb". To see all links from a page you can do: bin/nutch readseg -get <segment> <url> -nocontent -nofetch -noparse -nogenerate -noparsetext > Cheers > -- > View this message in context: > http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21418679.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
