Hey, I found this thread to be very useful when deciding upon an indexing strategy.
http://www.mail-archive.com/[email protected]/msg12700.html The system I work on has 3 million or so documents and it was (until a non-lucene performance issue came up) setup to add/delete new documents every 15 minutes in a similar manner as described in the thread. We were adding/deleting a few thousand documents every 15 minutes, during peak traffic. We have a dedicated indexing machine and distribute portions of our index across multiple machines, but you could still follow the pattern all on one box, just with separate processes/threads. Even though lucene allows certain types of index operations to happen concurrently with search activity, IMHO, if you can decouple the indexing process from the searching process your system as a whole will be more flexible and scalable with only a little extra maintenance overhead. JAMES --- Larry Ogrodnek <[EMAIL PROTECTED]> wrote: > We have a similar setup, although probably only > 1/5th the number of > documents and updates. I'd suggest just making > periodic index backups. > > I've been storing my index as follows: > > <workdir>/<index-name>/data/ (lucene index > directory) > <workdir>/<index-name>/backups/ > > The "data" is what's passed into > IndexWriter/IndexReader. Additionally, > I create/update a .last_update file, which just > contains the timestamp > of when the last update was started, so when the app > starts up it only > needs to retrieve updates from the db since then. > > Periodically the app copies the contents of data > into a new directory in > backups named by the date/time, e.g. > backups/2007-07-04.110051. If > needed, I can delete data and replace the contents > with the latest > backup, and the app will only retrieve records > updated since the backup > was made (using the backup's .last_update)... > > I'd recommend making the complete index creation > from scratch a normal > operation as much as possible (but you're right, for > that number of > documents it will take awhile). It's been really > helpful here when > doing additional deploys for testing, or deciding we > want to index > things differently, etc... > > -larry > > > -----Original Message----- > From: Scott Smith [mailto:[EMAIL PROTECTED] > > Sent: Thursday, July 06, 2006 1:48 PM > To: [email protected] > Subject: Managing a large archival (and constantly > changing) database > > I've been asked to do a project which provides > full-text search for a > large database of articles. The expectation is that > most of the > articles are fairly small (<2k bytes). There will > be an initial > population of around 400,000 articles. There will > then be approximately > 2000 new articles added each day (they need to be > added in "real time" > (within a few minutes of arrival), but will be > spread out during the > day). So, roughly another 700,000 articles each > year. > > > > I've read enough to believe that having a lucene > database of several > million articles is doable. And, adding 2000 > articles per day wouldn't > seem to be that many. My concern is the real-time > nature of the > application. I'm a bit nervous (perhaps without > justification) at > simply growing one monolithic lucene database. > Should there be a crash, > the database will be unusable and I'll have to > rebuild from scratch > (which, based on my experience, would be hours of > time). > > > > Some of my thoughts were: > > 1) having monthly databases and using > MultiSearcher to search across > them. That way my exposure for a corrupted database > is limited to this > month's database. This would also seem to give me > somewhat better > control--meaning a) if the search was generating > lots of hits, I could > display the results a month at a time and not bury > them with output. It > would also spread their search CPU out better and > not prevent other > individuals from doing a search. If there were very > few results, I > could sleep between each month's search and again, > not lock everyone > else out from searches. > > 2) Have a "this month's" searchable and an > "everything else" > searchable. At the beginning of each month, I would > consolidate the > previous month's database into the "everything else" > searchable. This > would give more consistent results for relevancy > ranked searches. But, > it means that a bad search could return lots of > results. > > > > Has anyone else dealt with a similar problem? Am I > expecting too much > from Lucene running on a single machine (or should I > be looking at > Hadoop?). Any comments or links to previous > discussions on this topic > would be appreciated. > > > > Scott > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
