Hi Jotta,
Do you have any log information you could post? If you could please add
http.verbose and fetcher.verbose properties to true, then provide your log data
it would greatly help.
________________________________________
From: jotta [[email protected]]
Sent: 26 May 2011 09:49
To: [email protected]
Subject: Re: Crawling process - Fetching
Hi again!
I have another problem with injecting urls to fetching.
I'm crawling couple of sites (about 6) at the same time, but only one or two
of them is processed by nutch. Rest of them is omitted and when this one
site is finished, nutch don't want to inject another urls...
I'm using this script for crawling:
while [[ $i -lt $depth ]]
do
echo
echo "inject urls"
bin/nutch inject crawl/crawldb $seedsDir
echo "generate-fetch-updatedb-invertlinks-solrindex iteration "$i":"
cmd="bin/nutch generate crawl/crawldb crawl/segments -topN 500"
output=`$cmd`
if [[ $output == *'0 records selected for fetching'* ]]
then
break;
fi
s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch updatedb crawl/crawldb $s1 -filter -normalize
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch solrindex http://127.0.0.1:8080/crawltest/ crawl/crawldb
crawl/linkdb $s1
rm -r $s1
((i++))
done
Domains for inject I keep in txt file.
Also I'm using regex-urlfilter.txt for allowing urls in this domains
-----
Regards,
Jotta
PS. Sorry for my English :)
--
View this message in context:
http://lucene.472066.n3.nabble.com/Crawling-process-Fetching-tp2873786p2987988.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Email has been scanned for viruses by Altman Technologies' email management
service - www.altman.co.uk/emailsystems
Glasgow Caledonian University is a registered Scottish charity, number SC021474
Winner: Times Higher Education’s Widening Participation Initiative of the Year
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
Winner: Times Higher Education’s Outstanding Support for Early Career
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html