As I indexed about 600 sites with Nutch 0.9,  I noticed that, at least, one of 
them were showing less results than expected. This site was www.nrc.gov.   As a 
test I tried to index only the NRC site, allowing only internal links in 
"site.xml" conf. file,  using  "crawl-urlfiter.txt" with  
"+^http://([a-z0-9]*\.)*www.nrc.gov/ "   and  also "regex-urlfilter.txt"  with  
"+^http\:\/\/www\.nrc\.gov\/" (to avoid indexing the google site, that was 
being fetched using only the crawl-urlfiter). 

 I have used the crawl method with a depth 0f 10, but as Nutch reached the 
level 5, it stated that there were no more urls to fetch.  The total urls 
number in crawldb was only 124.   

As I checked the nrc.gov/robots.txt I found:

 

User-agent: *

Disallow: /acrs/

------

------

Disallow: /what-we-do/".

 

So it seemed that the robots could be blocking the fetch of the pages in a lot 
of directories.  But as I checked for a particular class of document in the NRC 
site, using the query "nureg site: www.nrc.gov", I found about 11,000 results 
in Google and about 7,000 in Gigablast  

 

So, I would like to get some help in this issue.

 

Thanks
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to