RE: Crawling process - Fetching

McGibbney, Lewis John Tue, 31 May 2011 02:48:50 -0700

From your original correspondence on this topic you mentioned you appeared to 
be using a custom crawl script, you also mentioned that you were getting some 
2/5 of the seed urls. If the search intervals below have not been exceeded then 
it is unlikely that you will get a re-fetch of these exact urls but would be 
extremely likely to get outlink (or inlinlk) urls depending on the config of 
nutch-site. Is it possible you have been experimenting with default options and 
have asked nutch to execute irregular behaviour?


Personally I would try your crawl script with a fresh nutch-site/nutch-default 
file, after that you can then go back and fine tune it the way you wish.

Finally, it is merely a suggestion but is it possible that the urls you are 
fetching possibly have page re-directs? If this is the case then this will need 
to be considered within your nutch-site configuration.

HTH
Lewis
---------------------------------------------------------------------------------------
2011-05-30 11:21:25,148 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2011-05-30 11:21:25,149 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2011-05-30 11:21:25,589 WARN  crawl.Generator - Generator: 0 records
selected for fetching, exiting ...


I have started new crawling and despite of 5 domains in "seed file" Nutch
haven't discovered any urls for fetching

-----
Regards,
Jotta

PS. Sorry for my English :)
--

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

RE: Crawling process - Fetching

Reply via email to