RE: Managing a large archival (and constantly changing) database

James Pine Thu, 06 Jul 2006 12:09:42 -0700

Hey,

I found this thread to be very useful when deciding
upon an indexing strategy.


http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12700.html

The system I work on has 3 million or so documents and
it was (until a non-lucene performance issue came up)
setup to add/delete new documents every 15 minutes in
a similar manner as described in the thread. We were
adding/deleting a few thousand documents every 15
minutes, during peak traffic. We have a dedicated
indexing machine and distribute portions of our index
across multiple machines, but you could still follow
the pattern all on one box, just with separate
processes/threads. 

Even though lucene allows certain types of index
operations to happen concurrently with search
activity, IMHO, if you can decouple the indexing
process from the searching process your system as a
whole will be more flexible and scalable with only a
little extra maintenance overhead.

JAMES

--- Larry Ogrodnek <[EMAIL PROTECTED]> wrote:

> We have a similar setup, although probably only
> 1/5th the number of
> documents and updates.  I'd suggest just making
> periodic index backups.
> 
> I've been storing my index as follows:
> 
> <workdir>/<index-name>/data/ (lucene index
> directory)
> <workdir>/<index-name>/backups/
> 
> The "data" is what's passed into
> IndexWriter/IndexReader.  Additionally,
> I create/update a .last_update file, which just
> contains the timestamp
> of when the last update was started, so when the app
> starts up it only
> needs to retrieve updates from the db since then.
> 
> Periodically the app copies the contents of data
> into a new directory in
> backups named by the date/time, e.g.
> backups/2007-07-04.110051.  If
> needed, I can delete data and replace the contents
> with the latest
> backup, and the app will only retrieve records
> updated since the backup
> was made (using the backup's .last_update)...
> 
> I'd recommend making the complete index creation
> from scratch a normal
> operation as much as possible (but you're right, for
> that number of
> documents it will take awhile).  It's been really
> helpful here when
> doing additional deploys for testing, or deciding we
> want to index
> things differently, etc...
> 
> -larry
> 
> 
> -----Original Message-----
> From: Scott Smith [mailto:[EMAIL PROTECTED]
> 
> Sent: Thursday, July 06, 2006 1:48 PM
> To: lucene-user@jakarta.apache.org
> Subject: Managing a large archival (and constantly
> changing) database
> 
> I've been asked to do a project which provides
> full-text search for a
> large database of articles.  The expectation is that
> most of the
> articles are fairly small (<2k bytes).  There will
> be an initial
> population of around 400,000 articles.  There will
> then be approximately
> 2000 new articles added each day (they need to be
> added in "real time"
> (within a few minutes of arrival), but will be
> spread out during the
> day).  So, roughly another 700,000 articles each
> year.
> 
>  
> 
> I've read enough to believe that having a lucene
> database of several
> million articles is doable.  And, adding 2000
> articles per day wouldn't
> seem to be that many.  My concern is the real-time
> nature of the
> application.  I'm a bit nervous (perhaps without
> justification) at
> simply growing one monolithic lucene database. 
> Should there be a crash,
> the database will be unusable and I'll have to
> rebuild from scratch
> (which, based on my experience, would be hours of
> time).
> 
>  
> 
> Some of my thoughts were:
> 
> 1)     having monthly databases and using
> MultiSearcher to search across
> them.  That way my exposure for a corrupted database
> is limited to this
> month's database.  This would also seem to give me
> somewhat better
> control--meaning a) if the search was generating
> lots of hits, I could
> display the results a month at a time and not bury
> them with output.  It
> would also spread their search CPU out better and
> not prevent other
> individuals from doing a search.  If there were very
> few results, I
> could sleep between each month's search and again,
> not lock everyone
> else out from searches.
> 
> 2)     Have a "this month's" searchable and an
> "everything else"
> searchable.  At the beginning of each month, I would
> consolidate the
> previous month's database into the "everything else"
> searchable.  This
> would give more consistent results for relevancy
> ranked searches.  But,
> it means that a bad search could return lots of
> results.
> 
>  
> 
> Has anyone else dealt with a similar problem?  Am I
> expecting too much
> from Lucene running on a single machine (or should I
> be looking at
> Hadoop?).  Any comments or links to previous
> discussions on this topic
> would be appreciated.
> 
>  
> 
> Scott
> 
>  
> 
>  
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Managing a large archival (and constantly changing) database

Reply via email to