Hi, I'm using Nutch version 1.5. My problem is to crawl every URL in a domain. I also want to indicize everything using Solr but, instead of doing that in the endo of the process, since is a very big domain, I would like to call the indiciziong command of Solr every X URL (for example let's say every 10000 URL).
Since now all I was capable to do is this script: #!/bin/bash # inject the initial seed into crawlDB bin/nutch inject test/crawldb urls # initialization of the variables counter=1 error=0 #while there is no error while [ $error -ne 1 ] do # crawl 500 URL echo [ Script ] Starting generating phase bin/nutch generate test/crawldb test/segments -topN 10000 if [ $? -ne 0 ] then echo [ Script ] Stopping: No more URLs to fetch. error=1 break fi segment=`ls -d test/segments/2* | tail -1` #fetching phase echo [ Script ] Starting fetching phase bin/nutch fetch $segment -threads 20 if [ $? -ne 0 ] then echo [ Script ] Fetch $segment failed. Deleting it. rm -rf $segment continue fi #parsing phase echo [ Script ] Starting parsing phase bin/nutch parse $segment #updateDB phase echo [ Script ] Starting updateDB phase bin/nutch updatedb test/crawldb $segment #indicizing with solr bin/nutch invertlinks test/linkdb -dir test/segments bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb test/linkdb test/segments/* done but it seems to not work. In fact crawling using the command: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 20 and testing on the apache.org domain I get more URL than using the script (command: 1676, script: 1658) Can anyone tell me what's wrong with my script? Is there a better way to solve my problem? Thanks, Matteo