We are actually grabbing emails by becoming part of the SMTP stream.
This part is figured out and we have archived over 600k emails into a
mysql database. The problem is that since we currently store the blobs
in the DB this databases are getting large and searching takes plenty of
time. We want to convert the searching to lucene to add more advanced
features.
Can I have multiple "to", "from" and "bcc" fields?
-Michael
Rob Staveley (Tom) wrote:
you cannot index PST files standalone
You can with LibPST (a C library - see
http://sourceforge.net/projects/ol2mbox), if they are 97-2002 format.
-----Original Message-----
From: Mike Streeton [mailto:[EMAIL PROTECTED]
Sent: 19 June 2006 08:33
To: java-user@lucene.apache.org
Subject: RE: indexing emails
When you talk about indexing emails are you indexing Outlook mails? We have
only found a few libraries that will do this and all require Outlook to be
online at the time i.e. you cannot index PST files standalone.
As far as indexing goes index each address in a separate un-tokenized field
not space delimited in a single field. It is also useful to put the To; CC
and BCC in a single field to enable you to search to email you have sent to
a person. I would also recommend you do some processing on the Subject field
to remove FW and RE this will allow you to search by subject and pick up all
emails in the thread.
Mike
-----Original Message-----
From: Michael Wechner [mailto:[EMAIL PROTECTED]
Sent: 19 June 2006 08:21
To: java-user@lucene.apache.org
Subject: Re: indexing emails
Rob Staveley (Tom) wrote:
Having spent a lot of time getting this wrong myself in an e-mail
indexer(!), I urge you to consider whether in your query interface you
will
need to look for mail to "john*" rather than [EMAIL PROTECTED], because
"john*"
may have been addressed to [EMAIL PROTECTED] or [EMAIL PROTECTED] If you
index
only [EMAIL PROTECTED] (untokenised) you will have to use a PrefixQuery to
look
for "john*", and you are liable to hit BooleanQuery.TooManyClauses
problems,
if you have more than 1024 (or BooleanQuery.getMaxClauseCount())
e-mail
addresses in your index starting with "john".
I'm trying to figure out a good design for this now for my own e-mail
indexing application,
btw, is your code available somewhere, I mean as Open Source ;-) ?
Thanks
Michi
considering also whether I should cater for searches for "*smith*".
I'm coming round to the realisation that WildCardQuery
and
PrefixQuery are not great things to depend upon for getting e-mail
addresses
from an index and the right thing to do is to break the address up
into
natural tokens ('.' or '-') in one field and leave them intact in
another
field. It isn't ideal; e-mail addresses with no separator between
initials
or first names and last name still need a PrefixQuery or
WildcardQuery, if
you want to search for last names, but it does make some queries
possible
which would otherwise blow up.
-----Original Message-----
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: 16 June 2006 21:13
To: java-user@lucene.apache.org
Subject: Re: indexing emails
On Fri, 2006-06-16 at 15:20 -0400, Michael J. Prichard wrote:
I am working on indexing emails and want to have a "to" field. I am
currently putting all the emails on one line seperated w/
spaces...example:
[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]
Then i index that with a StandardAnalyzer as follows:
doc.add(new Field("to", (String) itemContent.get("to"),
Field.Store.YES, Field.Index.UN_TOKENIZED));
Question is...is this the best way to do it? I want to be able to
search for [EMAIL PROTECTED] and pick out just those Documents, etc.
You can either do it as above (but you want to TOKENIZE the field) or
you
could create a new UN_TOKENIZED field for each email address.
The second will require less CPU as it does not involve any lexical
analysis. It will also create larger distance between the addresses in
the
index (see span queries and term positions).
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
Michael Wechner
Wyona - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED] [EMAIL PROTECTED]
+41 44 272 91 61
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]