Re: Crawling process - Fetching

jotta Thu, 26 May 2011 11:44:40 -0700

Hi again!
I have another problem with injecting urls to fetching. 
I'm crawling couple of sites (about 6) at the same time, but only one or two
of them is processed by nutch. Rest of them is omitted and when this one
site is finished, nutch don't want to inject another urls...


I'm using this script for crawling:
while [[ $i -lt $depth ]]
do
  echo
  echo "inject urls"
  bin/nutch inject crawl/crawldb $seedsDir

  echo "generate-fetch-updatedb-invertlinks-solrindex iteration "$i":"

  cmd="bin/nutch generate crawl/crawldb crawl/segments -topN 500"
  output=`$cmd`

  if [[ $output == *'0 records selected for fetching'* ]]
    then
    break;
  fi

  s1=`ls -d crawl/segments/2* | tail -1`

  bin/nutch fetch $s1
  bin/nutch updatedb crawl/crawldb $s1 -filter -normalize
  bin/nutch invertlinks crawl/linkdb -dir crawl/segments
  bin/nutch solrindex http://127.0.0.1:8080/crawltest/ crawl/crawldb
crawl/linkdb $s1
  rm -r $s1
  ((i++))
done

Domains for inject I keep in txt file.
Also I'm using regex-urlfilter.txt for allowing urls in this domains

-----
Regards,
Jotta

PS. Sorry for my English :)
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawling-process-Fetching-tp2873786p2987988.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawling process - Fetching

Reply via email to