you can dump the linkdb and analyze where it differs.
my guess is, that you have different urls there because crawl uses
crawl-urlfilter.txt to filter urls
and fetch uses regex-urlfilter.txt.
so different filters.
i cant explain why. i have not implemented this. i have only experienced
the difference myself.

how to dump the linkdb:

reinh...@thord:>bin/nutch readlinkdb
Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
        -dump <out_dir> dump whole link db to a text file in <out_dir>
        -url <url>      print information about <url> to System.out




Hrishikesh Agashe schrieb:
> Hi,
>
> I am observing that the size of LinkDB is different when I do a run for same 
> URLs with "crawl" command(intranet crawling) as compared to running 
> individual commands (like inject, generate, fetch, invertlink etc i.e. 
> Internet crawl)
> Are there any parameters that Nutch passes to invertlink while running with 
> "crawl" option?
>
> TIA,
> --Hrishi
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.
>
>   

Reply via email to