you can dump the linkdb and analyze where it differs.
my guess is, that you have different urls there because crawl uses
crawl-urlfilter.txt to filter urls
and fetch uses regex-urlfilter.txt.
so different filters.
i cant explain why. i have not implemented this. i have only experienced
the difference myself.
how to dump the linkdb:
reinh...@thord:>bin/nutch readlinkdb
Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
-dump <out_dir> dump whole link db to a text file in <out_dir>
-url <url> print information about <url> to System.out
Hrishikesh Agashe schrieb:
> Hi,
>
> I am observing that the size of LinkDB is different when I do a run for same
> URLs with "crawl" command(intranet crawling) as compared to running
> individual commands (like inject, generate, fetch, invertlink etc i.e.
> Internet crawl)
> Are there any parameters that Nutch passes to invertlink while running with
> "crawl" option?
>
> TIA,
> --Hrishi
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the
> property of Persistent Systems Ltd. It is intended only for the use of the
> individual or entity to which it is addressed. If you are not the intended
> recipient, you are not authorized to read, retain, copy, print, distribute or
> use this message. If you have received this communication in error, please
> notify the sender and delete all copies of this message. Persistent Systems
> Ltd. does not accept any liability for virus infected mails.
>
>