As I indexed about 600 sites with Nutch 0.9, I noticed that, at least, one of
them were showing less results than expected. This site was www.nrc.gov. As a
test I tried to index only the NRC site, allowing only internal links in
"site.xml" conf. file, using "crawl-urlfiter.txt" with
"+^http://([a-z0-9]*\.)*www.nrc.gov/ " and also "regex-urlfilter.txt" with
"+^http\:\/\/www\.nrc\.gov\/" (to avoid indexing the google site, that was
being fetched using only the crawl-urlfiter).
I have used the crawl method with a depth 0f 10, but as Nutch reached the
level 5, it stated that there were no more urls to fetch. The total urls
number in crawldb was only 124.
As I checked the nrc.gov/robots.txt I found:
User-agent: *
Disallow: /acrs/
------
------
Disallow: /what-we-do/".
So it seemed that the robots could be blocking the fetch of the pages in a lot
of directories. But as I checked for a particular class of document in the NRC
site, using the query "nureg site: www.nrc.gov", I found about 11,000 results
in Google and about 7,000 in Gigablast
So, I would like to get some help in this issue.
Thanks
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general