Hi,
            I am trying to run the nutch crawler for the first time and
I am getting an exception but can't find out the cause behind it. The
following are the details.
 
The urls file contains:
 
http://facweb.iitkgp.ernet.in/
 
The conf/crawl-urlfilter.txt contains:
 
+^http://([a-z0-9]*\.)*iitkgp.ernet.in/
 
The command I specified was:
 
bin/nutch crawl urls -dir crawl.my -depth 10
 
 
And the part of the log along with the exception is:
 
060321 164802 logging at INFO
060321 164802 fetching http://facweb.iitkgp.ernet.in/
060321 164802 http.proxy.host = 10.5.17.147
060321 164802 http.proxy.port = 8080
060321 164802 http.timeout = 100000
060321 164802 http.content.limit = 65536
060321 164802 http.agent = NutchCVS/0.7.1 (Nutch;
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060321 164802 fetcher.server.delay = 1000
060321 164802 http.max.delays = 100
060321 164802 fetching http://facweb.iitkgp.ernet.in/robots.txt
060321 164802 fetched 1060 bytes from
http://facweb.iitkgp.ernet.in/robots.txt
060321 164812 fetch of http://facweb.iitkgp.ernet.in/ failed with:
java.lang.Exception: org.apache.nutch.protocol.http.HttpException:
java.net.UnknownHostException: facweb.iitkgp.ernet.in:
facweb.iitkgp.ernet.in
060321 164813 status: segment 20060321164801, 0 pages, 1 errors, 0
bytes, 11175 ms
060321 164813 status: 0.0 pages/s, 0.0 kb/s, NaN bytes/page
060321 164814 Updating /home/anindyac/crawl/nutch-0.7.1/crawl.my/db
060321 164814 Updating for
/home/anindyac/crawl/nutch-0.7.1/crawl.my/segments/20060321164801
060321 164814 Processing document 0
 
As can be seen, the fetching is failing. I also enquired the proxy (at
10.5.17.147:8080) log. It contains 
[2006-03-21 16:49:14] 10.5.17.146 unknown Web GET
http://lucene.apache.org/robots.txt 404 Not Found
[2006-03-21 16:51:23] 10.5.17.146 unknown Web GET
http://facweb.iitkgp.ernet.in/robots.txt
[2006-03-21 16:51:23] 10.5.17.146 unknown Web GET
http://facweb.iitkgp.ernet.in/robots.txt Object not found!
 
Does that mean that a site can't be crawled with NUTCH if it does not
have a robots.txt file. 
 
Note that the same thing happened when I tried to crawl the NUTCH
website.
 
            Please tell me what to do in-order to get NUTCH going.
 
Thanks and Regards,
Anindya
 

Reply via email to