Re: crawl and update one url already in crawldb

Markus Jelsma Thu, 22 Mar 2012 06:24:54 -0700


On Thursday 22 March 2012 14:10:41 webdev1977 wrote:
> Thanks for the quick response Markus!
> 
> How would that fit into this continuous crawling scenario (I am trying to
> get the updates as quickly as possible into solr :-)
> 
> If I am doing the generate --> fetch $SEGMENT --> parse $SEGMENT -->
> updatedb crawldb $segment --> solrindex --> solrdedub  cycle and i am
> generating an "on the fly" segment and I just happen to be generating it
> (and not done) when the updatedb command runs (changing it to the -dir
> option), isn't that bad?


You can just fetch and parse that tiny segment and have it updated in the 
crawldb together with another segment. You don't have to update with only one 
segment. -dir is ok, but you can also list the segments.


> Has anyone tested the mergedb command with potentially hundreds and
> hundreds of dbs to merge (one per changed url)?

I wouldn't try that. More scripting and locking horror and it's an I/O 
consumer.

> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
> wldb-tp3848358p3848423.html Sent from the Nutch - User mailing list archive
> at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex

Re: crawl and update one url already in crawldb

Reply via email to