We have a similar setup, although probably only 1/5th the number of documents and updates. I'd suggest just making periodic index backups.
I've been storing my index as follows: <workdir>/<index-name>/data/ (lucene index directory) <workdir>/<index-name>/backups/ The "data" is what's passed into IndexWriter/IndexReader. Additionally, I create/update a .last_update file, which just contains the timestamp of when the last update was started, so when the app starts up it only needs to retrieve updates from the db since then. Periodically the app copies the contents of data into a new directory in backups named by the date/time, e.g. backups/2007-07-04.110051. If needed, I can delete data and replace the contents with the latest backup, and the app will only retrieve records updated since the backup was made (using the backup's .last_update)... I'd recommend making the complete index creation from scratch a normal operation as much as possible (but you're right, for that number of documents it will take awhile). It's been really helpful here when doing additional deploys for testing, or deciding we want to index things differently, etc... -larry -----Original Message----- From: Scott Smith [mailto:[EMAIL PROTECTED] Sent: Thursday, July 06, 2006 1:48 PM To: lucene-user@jakarta.apache.org Subject: Managing a large archival (and constantly changing) database I've been asked to do a project which provides full-text search for a large database of articles. The expectation is that most of the articles are fairly small (<2k bytes). There will be an initial population of around 400,000 articles. There will then be approximately 2000 new articles added each day (they need to be added in "real time" (within a few minutes of arrival), but will be spread out during the day). So, roughly another 700,000 articles each year. I've read enough to believe that having a lucene database of several million articles is doable. And, adding 2000 articles per day wouldn't seem to be that many. My concern is the real-time nature of the application. I'm a bit nervous (perhaps without justification) at simply growing one monolithic lucene database. Should there be a crash, the database will be unusable and I'll have to rebuild from scratch (which, based on my experience, would be hours of time). Some of my thoughts were: 1) having monthly databases and using MultiSearcher to search across them. That way my exposure for a corrupted database is limited to this month's database. This would also seem to give me somewhat better control--meaning a) if the search was generating lots of hits, I could display the results a month at a time and not bury them with output. It would also spread their search CPU out better and not prevent other individuals from doing a search. If there were very few results, I could sleep between each month's search and again, not lock everyone else out from searches. 2) Have a "this month's" searchable and an "everything else" searchable. At the beginning of each month, I would consolidate the previous month's database into the "everything else" searchable. This would give more consistent results for relevancy ranked searches. But, it means that a bad search could return lots of results. Has anyone else dealt with a similar problem? Am I expecting too much from Lucene running on a single machine (or should I be looking at Hadoop?). Any comments or links to previous discussions on this topic would be appreciated. Scott --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]