Crawl a whole domain with indicization

Matteo Simoncini Wed, 29 Aug 2012 15:14:14 -0700

Hi,

I'm using Nutch version 1.5. My problem is to crawl every URL in a domain.
I also want to indicize everything using Solr but, instead of doing that in
the endo of the process, since is a very big domain, I would like to call
the indiciziong command of Solr every X URL (for example let's say every
10000 URL).


Since now all I was capable to do is this script:

#!/bin/bash
# inject the initial seed into crawlDB
bin/nutch inject test/crawldb urls

# initialization of the variables
counter=1
error=0

#while there is no error
while [ $error -ne 1 ]
do

# crawl 500 URL

echo [ Script ] Starting generating phase

bin/nutch generate test/crawldb test/segments -topN 10000


if [ $? -ne 0 ]

then

echo [ Script ] Stopping: No more URLs to fetch.

error=1

break

fi

segment=`ls -d test/segments/2* | tail -1`


#fetching phase

echo [ Script ] Starting fetching phase

bin/nutch fetch $segment -threads 20

if [ $? -ne 0 ]

then

echo [ Script ] Fetch $segment failed. Deleting it.

rm -rf $segment

continue

fi

#parsing phase

echo [ Script ] Starting parsing phase

bin/nutch parse $segment


#updateDB phase

echo [ Script ] Starting updateDB phase

bin/nutch updatedb test/crawldb $segment


#indicizing with solr

bin/nutch invertlinks test/linkdb -dir test/segments

bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
test/linkdb test/segments/*

done


but it seems to not work. In fact crawling using the command:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 20


and testing on the apache.org domain I get more URL than using the script
(command: 1676, script: 1658)
Can anyone tell me what's wrong with my script? Is there a better way to
solve my problem?

Thanks,

Matteo

Crawl a whole domain with indicization

Reply via email to