
I found this thread to be very useful when deciding
upon an indexing strategy. 


The system I work on has 3 million or so documents and
it was (until a non-lucene performance issue came up)
setup to add/delete new documents every 15 minutes in
a similar manner as described in the thread. We were
adding/deleting a few thousand documents every 15
minutes, during peak traffic. We have a dedicated
indexing machine and distribute portions of our index
across multiple machines, but you could still follow
the pattern all on one box, just with separate

Even though lucene allows certain types of index
operations to happen concurrently with search
activity, IMHO, if you can decouple the indexing
process from the searching process your system as a
whole will be more flexible and scalable with only a
little extra maintenance overhead.


> We have a similar setup, although probably only
> 1/5th the number of
> documents and updates.  I'd suggest just making
> periodic index backups.
> I've been storing my index as follows:
> <workdir>/<index-name>/data/ (lucene index
> directory)
> <workdir>/<index-name>/backups/
> The "data" is what's passed into
> IndexWriter/IndexReader.  Additionally,
> I create/update a .last_update file, which just
> contains the timestamp
> of when the last update was started, so when the app
> starts up it only
> needs to retrieve updates from the db since then.
> Periodically the app copies the contents of data
> into a new directory in
> backups named by the date/time, e.g.
> backups/2007-07-04.110051.  If
> needed, I can delete data and replace the contents
> with the latest
> backup, and the app will only retrieve records
> updated since the backup
> was made (using the backup's .last_update)...
> I'd recommend making the complete index creation
> from scratch a normal
> operation as much as possible (but you're right, for
> that number of
> documents it will take awhile).  It's been really
> helpful here when
> doing additional deploys for testing, or deciding we
> want to index
> things differently, etc...
> -larry
> I've been asked to do a project which provides
> full-text search for a
> large database of articles.  The expectation is that
> most of the
> articles are fairly small (<2k bytes).  There will
> be an initial
> population of around 400,000 articles.  There will
> then be approximately
> 2000 new articles added each day (they need to be
> added in "real time"
> (within a few minutes of arrival), but will be
> spread out during the
> day).  So, roughly another 700,000 articles each
> year.
> I've read enough to believe that having a lucene
> database of several
> million articles is doable.  And, adding 2000
> articles per day wouldn't
> seem to be that many.  My concern is the real-time
> nature of the
> application.  I'm a bit nervous (perhaps without
> justification) at
> simply growing one monolithic lucene database. 
> Should there be a crash,
> the database will be unusable and I'll have to
> rebuild from scratch
> (which, based on my experience, would be hours of
> time).
> Some of my thoughts were:
> 1)     having monthly databases and using
> MultiSearcher to search across
> them.  That way my exposure for a corrupted database
> is limited to this
> month's database.  This would also seem to give me
> somewhat better
> control--meaning a) if the search was generating
> lots of hits, I could
> display the results a month at a time and not bury
> them with output.  It
> would also spread their search CPU out better and
> not prevent other
> individuals from doing a search.  If there were very
> few results, I
> could sleep between each month's search and again,
> not lock everyone
> else out from searches.
> 2)     Have a "this month's" searchable and an
> "everything else"
> searchable.  At the beginning of each month, I would
> consolidate the
> previous month's database into the "everything else"
> searchable.  This
> would give more consistent results for relevancy
> ranked searches.  But,
> it means that a bad search could return lots of
> results.
> Has anyone else dealt with a similar problem?  Am I
> expecting too much
> from Lucene running on a single machine (or should I
> be looking at
> Hadoop?).  Any comments or links to previous
> discussions on this topic
> would be appreciated.
> Scott
> To unsubscribe, e-mail:
> For additional commands, e-mail:

