RE: Crawl a whole domain with indicization

Markus Jelsma Wed, 29 Aug 2012 15:20:07 -0700

There is nothing wrong with your script but it depends on your data how much 
URL's are generated. The difference in your script and the crawl command (both 
are almost identical) could also be explained by the state of your CrawlDb.



 
 
-----Original message-----
> From:Matteo Simoncini <sicc...@gmail.com>
> Sent: Thu 30-Aug-2012 00:16
> To: user@nutch.apache.org
> Subject: Crawl a whole domain with indicization
> 
> Hi,
> 
> I'm using Nutch version 1.5. My problem is to crawl every URL in a domain.
> I also want to indicize everything using Solr but, instead of doing that in
> the endo of the process, since is a very big domain, I would like to call
> the indiciziong command of Solr every X URL (for example let's say every
> 10000 URL).
> 
> Since now all I was capable to do is this script:
> 
> #!/bin/bash
> # inject the initial seed into crawlDB
> bin/nutch inject test/crawldb urls
> 
> # initialization of the variables
> counter=1
> error=0
> 
> #while there is no error
> while [ $error -ne 1 ]
> do
> 
> # crawl 500 URL
> 
> echo [ Script ] Starting generating phase
> 
> bin/nutch generate test/crawldb test/segments -topN 10000
> 
> 
> if [ $? -ne 0 ]
> 
> then
> 
> echo [ Script ] Stopping: No more URLs to fetch.
> 
> error=1
> 
> break
> 
> fi
> 
> segment=`ls -d test/segments/2* | tail -1`
> 
> 
> #fetching phase
> 
> echo [ Script ] Starting fetching phase
> 
> bin/nutch fetch $segment -threads 20
> 
> if [ $? -ne 0 ]
> 
> then
> 
> echo [ Script ] Fetch $segment failed. Deleting it.
> 
> rm -rf $segment
> 
> continue
> 
> fi
> 
> #parsing phase
> 
> echo [ Script ] Starting parsing phase
> 
> bin/nutch parse $segment
> 
> 
> #updateDB phase
> 
> echo [ Script ] Starting updateDB phase
> 
> bin/nutch updatedb test/crawldb $segment
> 
> 
> #indicizing with solr
> 
> bin/nutch invertlinks test/linkdb -dir test/segments
> 
> bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> test/linkdb test/segments/*
> 
> done
> 
> 
> but it seems to not work. In fact crawling using the command:
> 
> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 20
> 
> 
> and testing on the apache.org domain I get more URL than using the script
> (command: 1676, script: 1658)
> Can anyone tell me what's wrong with my script? Is there a better way to
> solve my problem?
> 
> Thanks,
> 
> Matteo
>

RE: Crawl a whole domain with indicization

Reply via email to