RE: Hosts File & Nutch 1.0+

McGibbney, Lewis John Tue, 19 Apr 2011 04:53:05 -0700

Hi Alex,

By referring to 'hosts' file, I assume you mean some seed list that you are 
providing to Nutch for a crawl?


If this is the case then I understand (and have experienced) what you are 
referring to. In the past, and in my case, I managed to define the difference 
between host name and domain name to the fact that the actual page you are 
trying to fetch E.g. domain name, is a redirect from the original host name. 
There is a property in nutch-default which you can try adding to nutch-site 
which deals with http-redirects. I think I set the redirect to a value of >2 or 
>3 and it was at this stage that Nutch began to fetch more urls from the 
initial seed list.

Can you please try the above and post if this is the solution. If it is not 
then I am unsure, I could maybe try crawling the domain you are referring to if 
you would post it.

HTH

Lewis
________________________________________
From: Alex [[email protected]]
Sent: 19 April 2011 05:07
To: [email protected]
Subject: Re: Hosts File  & Nutch 1.0+

Can anyone help me here?  Or, am I asking in the wrong place?

On Apr 14, 2011, at 9:57 PM, Alex wrote:

> Hi,
>
> I am new to Nutch.  I have an application that uses Nutch to
> search.  I have configured the application so that Nutch can run.
> However, after a lot of troubleshooting I have been pointed to the
> fact that there is something wrong with my hosts file.  My hostname
> is different than my domain name and that "seems" to make Nutch stop
> in depth 1.  Does anyone have any idea of what is the correct
> configuration of the hosts file so that nutch runs properly?
>
> My domain name resolves fine.  Please help me!
>
> Here are the logs of the indexing:
>
> Stopping at depth=1 - no more URLs to fetch.
>
> INFO sitesearch.CrawlerUtil: indexHost : Starting an Site Search
> index on host www.mydomain.com
> INFO sitesearch.CrawlerUtil: site search crawl started in: /path/to/
> search_index/www.mydomain.com/1-XXX_temp/crawl-index
> ] INFO sitesearch.CrawlerUtil: rootUrlDir = /path/to/directory/
> search_index/www.mydomain.com/url_folder
> INFO sitesearch.CrawlerUtil: threads = 10
> INFO sitesearch.CrawlerUtil: depth = 20
> INFO sitesearch.CrawlerUtil: indexer=lucene
>
> INFO sitesearch.CrawlerUtil: Stopping at depth=1 - no more URLs to
> fetch.
> NFO sitesearch.CrawlerUtil: site search crawl finished: /
> directorypath/search_index/www.mydomain.com/1xxx/crawl-index
> INFO sitesearch.CrawlerUtil: indexHost : Finished Site Search index
> on host www.mydomain.com


Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

RE: Hosts File & Nutch 1.0+

Reply via email to