are you sure that you have used the same config? in nutch-default.xml and nutch-site.xml you have or may have a config property
<property> <name>db.ignore.internal.links</name> <value>true</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. </description> </property> i'm only aware of the difference described below. you may look into the Crawl.java code to check whether there are other differences. ok, i have done this now. Crawl.java uses crawl-tool.xml as additional config file and i have there ( it is default i guess ) <property> <name>db.ignore.internal.links</name> <value>false</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping the only the highest quality links. </description> </property> this is conform to your observation. the "crawl" command does not ignore internal links because of this additional crawl-tool.xml config option, which seems to overwrite nutch-default.xml and nutch-site.xml if you add this to nutch-site.xml, it should behave equal. reinhard Hrishikesh Agashe schrieb: > Thanks Reinhard. I checked this, but both the files are same. > > Just to elaborate more, I am downloading images using Nutch, so I have > changed both files and removed jpg, gif, png etc from extensions to be > skipped. What I see is that if I use "crawl" command, I get all image URLs in > LinkDB, but if I execute commands separately I see only absolute links to > images. All relative links are missing from LinkDB. (i.e. If HTML page has > URL like "http://www.abc.com/img/img.jpg" for image, I can see it in LinkDB > in both cases, but if it has URL like "/img/img.jpg" for image, it's missing > from LinkDB in case of execution using separate commands.) > > Any thoughts? > > TIA, > --Hrishi > > -----Original Message----- > From: reinhard schwab [mailto:[email protected]] > Sent: Tuesday, September 01, 2009 3:19 PM > To: [email protected] > Subject: Re: LinkDB size difference > > you can dump the linkdb and analyze where it differs. > my guess is, that you have different urls there because crawl uses > crawl-urlfilter.txt to filter urls > and fetch uses regex-urlfilter.txt. > so different filters. > i cant explain why. i have not implemented this. i have only experienced > the difference myself. > > how to dump the linkdb: > > reinh...@thord:>bin/nutch readlinkdb > Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>) > -dump <out_dir> dump whole link db to a text file in <out_dir> > -url <url> print information about <url> to System.out > > > > > Hrishikesh Agashe schrieb: > >> Hi, >> >> I am observing that the size of LinkDB is different when I do a run for same >> URLs with "crawl" command(intranet crawling) as compared to running >> individual commands (like inject, generate, fetch, invertlink etc i.e. >> Internet crawl) >> Are there any parameters that Nutch passes to invertlink while running with >> "crawl" option? >> >> TIA, >> --Hrishi >> >> DISCLAIMER >> ========== >> This e-mail may contain privileged and confidential information which is the >> property of Persistent Systems Ltd. It is intended only for the use of the >> individual or entity to which it is addressed. If you are not the intended >> recipient, you are not authorized to read, retain, copy, print, distribute >> or use this message. If you have received this communication in error, >> please notify the sender and delete all copies of this message. Persistent >> Systems Ltd. does not accept any liability for virus infected mails. >> >> >> > > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is the > property of Persistent Systems Ltd. It is intended only for the use of the > individual or entity to which it is addressed. If you are not the intended > recipient, you are not authorized to read, retain, copy, print, distribute or > use this message. If you have received this communication in error, please > notify the sender and delete all copies of this message. Persistent Systems > Ltd. does not accept any liability for virus infected mails. > >
