javier muguruza wrote:
Hi Javier,
I think the your optimization should take care of the response time of search queries. I asume that this is the
variable you need to optimize. Probably it will be a good thing to read first the lucene benchmarks:
http://jakarta.apache.org/lucene/docs/benchmarks.html. <http://jakarta.apache.org/lucene/docs/benchmarks.html>
If you have a mandatory date constraint for each of your indexes you can split the index on time basis, I asume that
one index per month will be enough I think ... 10.000 emails I think it will be fast enough if you will search in only one index afterwards.
But I think this is not such a good Idea?
What about creating one index per user? If your search require a user or a sender, and you can get its name from database, and apply only
the other constrains on an index dedicated to that user .. I think the lucene search will be much more faster.
Also the database search will be fast .. I don'T think you will have more then 1.000-10.000 user names.
or maybe 1 index/user/year
or 1 index/receiver/year + 1index/sender/year
What about this solution is it feasible for your system?
All the best,
Sergiu
Thanks Erik and Giulio for the fast reply.
I am just starting to look at lucene so forgive me if I got some ideas wrong. I understand your concerns about one index per email. But having one index only is also (I guess) out of question.
I am building an email archive. Email will be kept indefinitely available for search, adding new email every day. Imagine a company with millions of emails per day (been there), keep it growing for years, adding stuff to the index while using it for searches continuously...
That's why my idea is to decide on a time frame (a day, a month...an extreme would be an instant, that is a single email, my original idea) and build the index for all the email in that timeframe. After the timeframe is finished no more stuff will be ever added.
Before the lucene search emails are selected based on other conditions (we store the from, to, date etc in database as well, and these conditions are enforced with a sql query first, so I would not need to enforce them in the lucene search again, also that query can be quite sophisticated and I guess would not be easyly possible to do it in lucene by itself). That first db step gives me a group of emails that maybe I have to further narrow down based on a lucene search (of body and attachment contents). Having an index for more than one emails means that after the search I would have to get only the overlaping emails from the two searches...Maybe this is better than keeping the same info I have in the db in lucene fields as well.
An example: I want all the email from [EMAIL PROTECTED] from Jan to Dec containing the word 'money'. I run the db query that returns a list with john's email for that period of time, then (lets assume I have one index per day) I iterate on every day, looking for emails that contain 'money', from the results returned by lucene I keep only these that are also in the first list.
Does that sound better?
On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
<[EMAIL PROTECTED]> wrote:
Hi Javier,
I suggest you to build a single index, with all the information you need to find the right mail you are looking for. You than can use Lucene alone to find you messages.
Giulio Cesare
On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza <[EMAIL PROTECTED]> wrote:
Hi,
We are going to move from a just-in-time perl based search to using lucene in our project. I have to index emails (bodies and also attachements). I keep in the filesystem all the bodies and attachments for a long period of time. I have to find emails that fullfil certain conditions, some of the conditions are take care of at a different level, so in the end I have a SUBSET of emails I have to run through lucene.
I was assuming that the best way would be to create an index for each email. Having an unique index for a group of emails (say a day worth of email) seems too coarse grained, imagine a day has 10000 emails, and some queries will like to look in only a handful of the emails...But the problem with having one index per emails is the massive number of emails...imagine having 100000 indexes
Anyway, any idea about that? I just wanted to check wether someones feels I am wrong.
Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]