Re: Hosts File & Nutch 1.0+

Alex Tue, 19 Apr 2011 05:45:13 -0700

Dear Lewis:

According to the application developer, we the reason Nutch does notcrawl is because of our local environment (our server) not able tosearch itself or a host file issue.

So currently, my hostname is different than my domain name and that iswhat is causing the issue. When I do a nslookup from the server itresolves to the correct IP address. My question was more towards whatit is that I have to put on the host file for Nutch to crawl correctly.


Thank YOU for all your help!

Alex



On Apr 19, 2011, at 5:52 AM, McGibbney, Lewis John wrote:

Hi Alex,
By referring to 'hosts' file, I assume you mean some seed list thatyou are providing to Nutch for a crawl?
If this is the case then I understand (and have experienced) whatyou are referring to. In the past, and in my case, I managed todefine the difference between host name and domain name to the factthat the actual page you are trying to fetch E.g. domain name, is aredirect from the original host name. There is a property in nutch-default which you can try adding to nutch-site which deals with http-redirects. I think I set the redirect to a value of >2 or >3 and itwas at this stage that Nutch began to fetch more urls from theinitial seed list.
Can you please try the above and post if this is the solution. If itis not then I am unsure, I could maybe try crawling the domain youare referring to if you would post it.
HTH

Lewis
________________________________________
From: Alex [[email protected]]
Sent: 19 April 2011 05:07
To: [email protected]
Subject: Re: Hosts File  & Nutch 1.0+

Can anyone help me here?  Or, am I asking in the wrong place?

On Apr 14, 2011, at 9:57 PM, Alex wrote:
Hi,

I am new to Nutch.  I have an application that uses Nutch to
search.  I have configured the application so that Nutch can run.
However, after a lot of troubleshooting I have been pointed to the
fact that there is something wrong with my hosts file.  My hostname
is different than my domain name and that "seems" to make Nutch stop
in depth 1.  Does anyone have any idea of what is the correct
configuration of the hosts file so that nutch runs properly?

My domain name resolves fine.  Please help me!

Here are the logs of the indexing:

Stopping at depth=1 - no more URLs to fetch.

INFO sitesearch.CrawlerUtil: indexHost : Starting an Site Search
index on host www.mydomain.com
INFO sitesearch.CrawlerUtil: site search crawl started in: /path/to/
search_index/www.mydomain.com/1-XXX_temp/crawl-index
] INFO sitesearch.CrawlerUtil: rootUrlDir = /path/to/directory/
search_index/www.mydomain.com/url_folder
INFO sitesearch.CrawlerUtil: threads = 10
INFO sitesearch.CrawlerUtil: depth = 20
INFO sitesearch.CrawlerUtil: indexer=lucene

INFO sitesearch.CrawlerUtil: Stopping at depth=1 - no more URLs to
fetch.
NFO sitesearch.CrawlerUtil: site search crawl finished: /
directorypath/search_index/www.mydomain.com/1xxx/crawl-index
INFO sitesearch.CrawlerUtil: indexHost : Finished Site Search index
on host www.mydomain.com
Email has been scanned for viruses by Altman Technologies' emailmanagement service - www.altman.co.uk/emailsystems
Glasgow Caledonian University is a registered Scottish charity,number SC021474
Winner: Times Higher Education’s Widening Participation Initiativeof the Year 2009 and Herald Society’s Education Initiative of theYear 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
Winner: Times Higher Education’s Outstanding Support for EarlyCareer Researchers of the Year 2010, GCU as a lead with UniversitiesScotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Re: Hosts File & Nutch 1.0+

Reply via email to