Re: one huge index or many small ones?

Sergiu Gordea Thu, 04 Nov 2004 10:00:11 -0800

javier muguruza wrote:

Hi Javier,

I think the your optimization should take care of the response time of search queries. I asume that this is the variable you need to optimize. Probably it will be a good thing to read first the lucene benchmarks: http://jakarta.apache.org/lucene/docs/benchmarks.html. <http://jakarta.apache.org/lucene/docs/benchmarks.html>

If you have a mandatory date constraint for each of your indexes you can split the index on time basis, I asume that one index per month will be enough I think ... 10.000 emails I think it will be fast enough if you will search in only one index afterwards. But I think this is not such a good Idea?

What about creating one index per user? If your search require a user or a sender, and you can get its name from database, and apply only the other constrains on an index dedicated to that user .. I think the lucene search will be much more faster.

Also the database search will be fast .. I don'T think you will have more then 1.000-10.000 user names.

or maybe 1 index/user/year

or 1 index/receiver/year + 1index/sender/year

What about this solution is it feasible for your system?

All the best,

 Sergiu

Thanks Erik and Giulio for the fast reply.

I am just starting to look at lucene so forgive me if I got some ideas
wrong. I understand your concerns about one index per email. But
having one index only is also (I guess) out of question.

I am building an email archive. Email will be kept indefinitely
available for search, adding new email every day. Imagine a company
with millions of emails per day (been there), keep it growing for
years, adding stuff to the index while using it for searches
continuously...

That's why my idea is to decide on a time frame (a day, a month...an
extreme would be an instant, that is a single email, my original idea)
and build the index for all the email in that timeframe. After the
timeframe is finished no more stuff will be ever added.

Before the lucene search emails are selected based on other conditions
(we store the from, to, date etc in database as well, and these
conditions are enforced with a sql query first, so I would not need to
enforce them in the lucene search again, also that query can be quite
sophisticated and I guess would not be easyly possible to do it in
lucene by itself). That first db step gives me a group of emails that
maybe I have to further narrow down based on a lucene search (of body
and attachment contents). Having an index for more than one emails
means that after the search I would have to get only the overlaping
emails from the two searches...Maybe this is better than keeping the
same info I have in the db in lucene fields as well.

An example: I want all the email from [EMAIL PROTECTED] from Jan
to Dec containing the word 'money'. I run the db query that returns a
list with john's email for that period of time, then (lets assume I
have one index per day) I iterate on every day, looking for emails
that contain 'money', from the results returned by lucene I keep only
these that are also in the first list.

Does that sound better?

On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli <[EMAIL PROTECTED]> wrote:

Hi Javier,

I suggest you to build a single index, with all the information you
need to find the right mail you are looking for. You than can use
Lucene alone to find you messages.

Giulio Cesare

On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza <[EMAIL PROTECTED]> wrote:

Hi,

We are going to move from a just-in-time perl based search to using
lucene in our project. I have to index emails (bodies and also
attachements). I keep in the filesystem all the bodies and attachments
for a long period of time. I have to find emails that fullfil certain
conditions, some of the conditions are take care of at a different
level, so in the end I have a SUBSET of emails I have to run through
lucene.

I was assuming that the best way would be to create an index for each
email. Having an unique index for a group of emails (say a day worth
of email) seems too coarse grained, imagine a day has 10000 emails,
and some queries will like to look in only a handful of the
emails...But the problem with having one index per emails is the
massive number of emails...imagine having 100000 indexes

Anyway, any idea about that? I just wanted to check wether someones
feels I am wrong.

Thanks

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: one huge index or many small ones?

Reply via email to