are you sure that you have used the same config?

in nutch-default.xml and nutch-site.xml you have or may have a config
property

<property>
  <name>db.ignore.internal.links</name>
  <value>true</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

i'm only aware of the difference described below.
you may look into the Crawl.java code to check whether there are other
differences.

ok, i have done this now.
Crawl.java uses crawl-tool.xml as additional config file
and i have there ( it is default i guess )

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping the only the highest quality
  links.
  </description>
</property>

this is conform to your observation. the "crawl" command does not ignore
internal links because
of this additional crawl-tool.xml config option, which seems to
overwrite nutch-default.xml and
nutch-site.xml
if you add this to nutch-site.xml, it should behave equal.

reinhard



Hrishikesh Agashe schrieb:
> Thanks Reinhard. I checked this, but both the files are same.
>
> Just to elaborate more, I am downloading images using Nutch, so I have 
> changed both files and removed jpg, gif, png etc from extensions to be 
> skipped. What I see is that if I use "crawl" command, I get all image URLs in 
> LinkDB, but if I execute commands separately I see only absolute links to 
> images. All relative links are missing from LinkDB. (i.e. If HTML page has 
> URL like "http://www.abc.com/img/img.jpg"; for image, I can see it in LinkDB 
> in both cases, but if it has URL like "/img/img.jpg" for image, it's missing 
> from LinkDB in case of execution using separate commands.)
>
> Any thoughts?
>
> TIA,
> --Hrishi
>
> -----Original Message-----
> From: reinhard schwab [mailto:[email protected]] 
> Sent: Tuesday, September 01, 2009 3:19 PM
> To: [email protected]
> Subject: Re: LinkDB size difference
>
> you can dump the linkdb and analyze where it differs.
> my guess is, that you have different urls there because crawl uses
> crawl-urlfilter.txt to filter urls
> and fetch uses regex-urlfilter.txt.
> so different filters.
> i cant explain why. i have not implemented this. i have only experienced
> the difference myself.
>
> how to dump the linkdb:
>
> reinh...@thord:>bin/nutch readlinkdb
> Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
>         -dump <out_dir> dump whole link db to a text file in <out_dir>
>         -url <url>      print information about <url> to System.out
>
>
>
>
> Hrishikesh Agashe schrieb:
>   
>> Hi,
>>
>> I am observing that the size of LinkDB is different when I do a run for same 
>> URLs with "crawl" command(intranet crawling) as compared to running 
>> individual commands (like inject, generate, fetch, invertlink etc i.e. 
>> Internet crawl)
>> Are there any parameters that Nutch passes to invertlink while running with 
>> "crawl" option?
>>
>> TIA,
>> --Hrishi
>>
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is the 
>> property of Persistent Systems Ltd. It is intended only for the use of the 
>> individual or entity to which it is addressed. If you are not the intended 
>> recipient, you are not authorized to read, retain, copy, print, distribute 
>> or use this message. If you have received this communication in error, 
>> please notify the sender and delete all copies of this message. Persistent 
>> Systems Ltd. does not accept any liability for virus infected mails.
>>
>>   
>>     
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.
>
>   

Reply via email to