Re: sanity check - large, long running index updates and concurrent read-only service

Yonik Seeley Tue, 10 May 2005 18:54:52 -0700

Could you explain why you need to copy the index?  It doesn't seem
like that buys you anything (except maybe if the copy is to a
physically separate disk)


-Yonik


On 5/10/05, Naomi Dushay <[EMAIL PROTECTED]> wrote:
> Context:  our index is currently around 6 gig and takes about an hour just to
> optimize.  Updating it, even in batches, can involve active updating for 15
> or more minutes.
> 
> Index updates are done with two different batch processes as there are
> currently two different workflows to update the index.  No obvious indexing
> partitioning is suggested by our workflows.** The index is used in a read
> only fashion by our REST search service, which runs under tomcat.
> 
> Issues:  we don't want our REST service to have slow or strange results while
> the index is being updated.
> 
> Proposed solution:
> 
> The world starts with REST service pointing to index A .   Our REST service
> is read only, so the index file(s) themselves can be read only.
> 
> To update the index, the following sequence occurs:
> 
> 1.      lock index for update (via a lock/flag file)
> 2.      copy index A into another directory, and make it writable.  This new
> index will become A'
> 3.      update index A', including optimization
> 4.      set index A' to read only
> 5.      gracefully change REST service to point use index A';  verify that
> REST service is working properly with A'.
> 6.      optional, but likely:  remove old, out-of-date index A.
> 7.      unlock index for update (remove lock/flag file)
> 
> The index updates could keep re-using the same two index directories, or they
> could create new directories.
> 
> Does anyone see any problems or have any suggestions for how to improve this?
> 
> - Naomi Dushay
> 
> National Science Digital Library - Core Integration
> 
> Cornell University
> 
> ** workflow 1:  we get bibliographic metadata that needs to go in the index.
> This metadata may be new, or it may be an update to existing metadata.  The
> metadata records refer to resource URLs.   Workflow 2:  we fetch the content
> for the resource URLs, using Nutch as our crawler and content database.
> Fetched content may be for a new URL or updated content for a known URL.  The
> Lucene Documents are a combination of bibliographic metadata and fetched
> content.
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: sanity check - large, long running index updates and concurrent read-only service

Reply via email to