Hi chaps,

Just looking for some ideas/experience as to how to improve our
current architecture.

We have a single-index system containing approx. 2.5 million docs of
about 1-3k each.

The Lucene implementation is a daemon and it services requests on
a port in multi-threaded manner, and it runs on a fairly new dual cpu
box with 2G of ram. Although I have the jvm using ~1.5G, this system
does fairly regularly crash with 'out of memory' errors. It's hard to see
the exact conditions at that point as to cause, but I'm guessing it's
simply a number of users executing queries which return large
resultsets, and then require sorting (just about all queries are sorted
by reverse date, using a field), so chewing up too much memory.

This index is updated frequently, since it is a news site, so this makes
the use of cacheing filters problematic. Typically about 1500 articles
come in per day, and during working hours you'd see them popping
in maybe every few seconds, with longer periods interspersed fairly
randomly. Access to these new articles is expected to be 'immediate'
for folks doing searches.

The nature of this area is such that a great deal of activity focusses
on 'recent' news, in particular the last 24 hours, then the last week,
and perhaps the last month in that order.

With that in mind I had the idea of creating a dual-index architecture
"recent" and "archive", where the "recent" index holds approx. the most
recent 30 days and the "archive" holds the rest.

But there are several refinements on this, and I wondered if anyone
else out there has already solved or at least tackled this problem and
has any suggestions.


For example, here is one idea for how the above might operate:

At a defined point in time, the 30-day index is generated. For us this is
easy. Our article bodies are all stored out on disk, timestamped, and
we can simply generate a list newer than a certain date and index
these to a brand new index.

At the same time, the "archive" index is merged with the existing 30-day
index, to make an updated "archive index.

The system then operates by indexing to the 30-day index and directing
searches to it where date-range is appropriate, otherwise to the
archive index. We would then operate in this mode for a week or so
before refreshing the indexes again.

So searching and sorting would then mostly be done on an index
which has around 45,000 docs in it rather than 2.5 million. I'm supposing
that this will be massively faster to operate with both indexing and
searching/sorting.

Any comments from anyone on this would be very much appreciated.

Cheers,
Paul.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to