Re: indexing emails

John Haxby Mon, 19 Jun 2006 04:49:05 -0700

Michael J. Prichard wrote:

I am working on indexing emails and want to have a "to" field. I amcurrently putting all the emails on one line seperated w/spaces...example:
[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]

Then i index that with a StandardAnalyzer as follows:
doc.add(new Field("to", (String) itemContent.get("to"),Field.Store.YES, Field.Index.UN_TOKENIZED));
Question is...is this the best way to do it? I want to be able tosearch for [EMAIL PROTECTED] and pick out just those Documents, etc.

I took a slightly different approach. Using javamail, given a To: linelike this:

To: Fred Smith <[EMAIL PROTECTED]>,=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <[EMAIL PROTECTED]>


I re-constructed the address list to look like this:

   Fred Smith [EMAIL PROTECTED] Keld Jørn Simonsen [EMAIL PROTECTED]

and fed that to the analyser. I forget which analyser we eventuallysettled on, but the "[EMAIL PROTECTED]" turns into the tokens "fred""example" and "com". This actually gives rise to a remarkably naturalway of search foraddresses. People do things like searching for "lucene.apache.org" tolook for mail sent to the lucene lists, they search for me variously as"jch", "john haxby" and "haxby"; they even, occasionally, search forcomplete mail addresses. They all work.

The RFC2047 syntax in the example above gives one hint as to theminefield that address parsing can be. If you look at the javamailspec, you'll also see reference to group-syntax -- it's often seen as


   undisclosed-recipients:;

but you'll also occasionally see

   example-group: [EMAIL PROTECTED], [EMAIL PROTECTED];

Javamail knows how to parse these and I threw away the group name andjust indexed the messages. It might've been better to keep the groupname, but groups aren't that widely used so it probably doesn't makemuch difference.

Other heads cause headaches as well. Things like the subject can beRFC2047 encoded so you'll need to decode them. The various message-idheaders are also slightly problematic. If you're using "message-id"and "references" and "in-reply-to" you'll need to be careful -- theindividual message-id's will need their angle brackets removed and theyreally ought not to be tokenized.

It's also worth indexing *all* the message headers. People do dosearches on some odd things. I also index the raw content-type as well-- those huge presentations can be found and deleted by searching for"content-type:application/vnd.ms-powerpoint". Or at least I could. Itseems to be broken at the moment :-(


jch

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing emails

Reply via email to