We have a similar setup, although probably only 1/5th the number of
documents and updates.  I'd suggest just making periodic index backups.

I've been storing my index as follows:

<workdir>/<index-name>/data/ (lucene index directory)
<workdir>/<index-name>/backups/

The "data" is what's passed into IndexWriter/IndexReader.  Additionally,
I create/update a .last_update file, which just contains the timestamp
of when the last update was started, so when the app starts up it only
needs to retrieve updates from the db since then.

Periodically the app copies the contents of data into a new directory in
backups named by the date/time, e.g. backups/2007-07-04.110051.  If
needed, I can delete data and replace the contents with the latest
backup, and the app will only retrieve records updated since the backup
was made (using the backup's .last_update)...

I'd recommend making the complete index creation from scratch a normal
operation as much as possible (but you're right, for that number of
documents it will take awhile).  It's been really helpful here when
doing additional deploys for testing, or deciding we want to index
things differently, etc...

-larry


-----Original Message-----
From: Scott Smith [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 06, 2006 1:48 PM
To: lucene-user@jakarta.apache.org
Subject: Managing a large archival (and constantly changing) database

I've been asked to do a project which provides full-text search for a
large database of articles.  The expectation is that most of the
articles are fairly small (<2k bytes).  There will be an initial
population of around 400,000 articles.  There will then be approximately
2000 new articles added each day (they need to be added in "real time"
(within a few minutes of arrival), but will be spread out during the
day).  So, roughly another 700,000 articles each year.

 

I've read enough to believe that having a lucene database of several
million articles is doable.  And, adding 2000 articles per day wouldn't
seem to be that many.  My concern is the real-time nature of the
application.  I'm a bit nervous (perhaps without justification) at
simply growing one monolithic lucene database.  Should there be a crash,
the database will be unusable and I'll have to rebuild from scratch
(which, based on my experience, would be hours of time).

 

Some of my thoughts were:

1)     having monthly databases and using MultiSearcher to search across
them.  That way my exposure for a corrupted database is limited to this
month's database.  This would also seem to give me somewhat better
control--meaning a) if the search was generating lots of hits, I could
display the results a month at a time and not bury them with output.  It
would also spread their search CPU out better and not prevent other
individuals from doing a search.  If there were very few results, I
could sleep between each month's search and again, not lock everyone
else out from searches.

2)     Have a "this month's" searchable and an "everything else"
searchable.  At the beginning of each month, I would consolidate the
previous month's database into the "everything else" searchable.  This
would give more consistent results for relevancy ranked searches.  But,
it means that a bad search could return lots of results.

 

Has anyone else dealt with a similar problem?  Am I expecting too much
from Lucene running on a single machine (or should I be looking at
Hadoop?).  Any comments or links to previous discussions on this topic
would be appreciated.

 

Scott

 

 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to