Hi again!
I have another problem with injecting urls to fetching.
I'm crawling couple of sites (about 6) at the same time, but only one or two
of them is processed by nutch. Rest of them is omitted and when this one
site is finished, nutch don't want to inject another urls...
I'm using this script for crawling:
while [[ $i -lt $depth ]]
do
echo
echo "inject urls"
bin/nutch inject crawl/crawldb $seedsDir
echo "generate-fetch-updatedb-invertlinks-solrindex iteration "$i":"
cmd="bin/nutch generate crawl/crawldb crawl/segments -topN 500"
output=`$cmd`
if [[ $output == *'0 records selected for fetching'* ]]
then
break;
fi
s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch updatedb crawl/crawldb $s1 -filter -normalize
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch solrindex http://127.0.0.1:8080/crawltest/ crawl/crawldb
crawl/linkdb $s1
rm -r $s1
((i++))
done
Domains for inject I keep in txt file.
Also I'm using regex-urlfilter.txt for allowing urls in this domains
-----
Regards,
Jotta
PS. Sorry for my English :)
--
View this message in context:
http://lucene.472066.n3.nabble.com/Crawling-process-Fetching-tp2873786p2987988.html
Sent from the Nutch - User mailing list archive at Nabble.com.