Hi Jotta,

Do you have any log information you could post? If you could please add 
http.verbose and fetcher.verbose properties to true, then provide your log data 
it would greatly help.
________________________________________
From: jotta [[email protected]]
Sent: 26 May 2011 09:49
To: [email protected]
Subject: Re: Crawling process - Fetching

Hi again!
I have another problem with injecting urls to fetching.
I'm crawling couple of sites (about 6) at the same time, but only one or two
of them is processed by nutch. Rest of them is omitted and when this one
site is finished, nutch don't want to inject another urls...

I'm using this script for crawling:
while [[ $i -lt $depth ]]
do
  echo
  echo "inject urls"
  bin/nutch inject crawl/crawldb $seedsDir

  echo "generate-fetch-updatedb-invertlinks-solrindex iteration "$i":"

  cmd="bin/nutch generate crawl/crawldb crawl/segments -topN 500"
  output=`$cmd`

  if [[ $output == *'0 records selected for fetching'* ]]
    then
    break;
  fi

  s1=`ls -d crawl/segments/2* | tail -1`

  bin/nutch fetch $s1
  bin/nutch updatedb crawl/crawldb $s1 -filter -normalize
  bin/nutch invertlinks crawl/linkdb -dir crawl/segments
  bin/nutch solrindex http://127.0.0.1:8080/crawltest/ crawl/crawldb
crawl/linkdb $s1
  rm -r $s1
  ((i++))
done

Domains for inject I keep in txt file.
Also I'm using regex-urlfilter.txt for allowing urls in this domains

-----
Regards,
Jotta

PS. Sorry for my English :)
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawling-process-Fetching-tp2873786p2987988.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Reply via email to