I've been curious for a while about this scheme, and I'm hoping you implement it and tell me if it works <G>. In truth, my data is pretty static so I haven't had to worry about it much. That said...
Would it do (and, perhaps, be less complex) to have a FSDirectory and a RAMDirectory that you search? And another FSDirectory that gets updated in the background? Here's the scheme as I see it. FSDirectorySearch. Holds the bulk of your index. Everything up until, say, midnight the night before. FSDirectoryBack. starts out as a copy of FSDirectorySearch, but is where you add your new stories. NOTE: you don't search this. RAMDirectory. Where you stash your new documents dating from the time your two FSDirectories were identical. It contains the delta between your two FSDirectories. Your searcher opens FSDirectorySearch for searching. A new document comes in. You add it to your FSDirectoryBack and your RAMDirectory. A search request comes in. Use a MultiSearcher (or variant) to search FSDirectorySearch and RAMDirectory. You probably have to re-open your RAMDirectory for search each time to pick up the most recent additions. NOTE: you might want to search the archives for performance data on multi-searchers, I'm not all that familiar with them.... At some interval (daily? hourly? at some pre-determined number of new stories?) you close up everything, copy your FSDirectoryBack to FSDirectorySearch, and re-start things. I'm wondering if this kind of scheme allows you to keep the speed and memory requirements down by processing requests faster. You might also be able to get some advantage with caching. Don't know if this is actually a viable scheme, but thought I'd mention it. And I'm sure you can see several variations on it that might fit your problem space better. On a side note: I ran some tests at one point throwing a variable number of searches (sorted) at my searcher using XmlRpc. I never had out of memory errors. The index was on the order of 1.4G, 870K documents. What I did see was the speed take a dive eventually, but at least it was graceful. I have no idea what was going on in the background, specifically how XmlRpc was handling memory issues, so I'm not sure how much that helps. I was servicing 100 simultaneous threads as fast as I could spawn them.... So, I wonder if your out of memory issue is really related to the number of requests you're servicing. But only you will be able to figure that out <G>. These problems are...er...unpleasant to track down... I guess I wonder a bit about what large result sets is all about. That is, do your users really care about results 100-10,000 or do they just want to page through them on demand? I'm sure you see where this is going, and if you're already returning, say, 100 documents out of N and letting them page, ignore this part. If you don't already know all about the inefficiency inherent in iterating over a Hits object, you might want to search the archives and/or look at TopDocs and HitCollector.... Best Erick On 10/17/06, Paul Waite <[EMAIL PROTECTED]> wrote:
Hi chaps, Just looking for some ideas/experience as to how to improve our current architecture. We have a single-index system containing approx. 2.5 million docs of about 1-3k each. The Lucene implementation is a daemon and it services requests on a port in multi-threaded manner, and it runs on a fairly new dual cpu box with 2G of ram. Although I have the jvm using ~1.5G, this system does fairly regularly crash with 'out of memory' errors. It's hard to see the exact conditions at that point as to cause, but I'm guessing it's simply a number of users executing queries which return large resultsets, and then require sorting (just about all queries are sorted by reverse date, using a field), so chewing up too much memory. This index is updated frequently, since it is a news site, so this makes the use of cacheing filters problematic. Typically about 1500 articles come in per day, and during working hours you'd see them popping in maybe every few seconds, with longer periods interspersed fairly randomly. Access to these new articles is expected to be 'immediate' for folks doing searches. The nature of this area is such that a great deal of activity focusses on 'recent' news, in particular the last 24 hours, then the last week, and perhaps the last month in that order. With that in mind I had the idea of creating a dual-index architecture "recent" and "archive", where the "recent" index holds approx. the most recent 30 days and the "archive" holds the rest. But there are several refinements on this, and I wondered if anyone else out there has already solved or at least tackled this problem and has any suggestions. For example, here is one idea for how the above might operate: At a defined point in time, the 30-day index is generated. For us this is easy. Our article bodies are all stored out on disk, timestamped, and we can simply generate a list newer than a certain date and index these to a brand new index. At the same time, the "archive" index is merged with the existing 30-day index, to make an updated "archive index. The system then operates by indexing to the 30-day index and directing searches to it where date-range is appropriate, otherwise to the archive index. We would then operate in this mode for a week or so before refreshing the indexes again. So searching and sorting would then mostly be done on an index which has around 45,000 docs in it rather than 2.5 million. I'm supposing that this will be massively faster to operate with both indexing and searching/sorting. Any comments from anyone on this would be very much appreciated. Cheers, Paul. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]