Michael J. Prichard wrote:
I am working on indexing emails and want to have a "to" field. I am currently putting all the emails on one line seperated w/ spaces...example:

[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]

Then i index that with a StandardAnalyzer as follows:

doc.add(new Field("to", (String) itemContent.get("to"), Field.Store.YES, Field.Index.UN_TOKENIZED));

Question is...is this the best way to do it? I want to be able to search for [EMAIL PROTECTED] and pick out just those Documents, etc.
I took a slightly different approach. Using javamail, given a To: line like this:

To: Fred Smith <[EMAIL PROTECTED]>, =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <[EMAIL PROTECTED]>

I re-constructed the address list to look like this:

   Fred Smith [EMAIL PROTECTED] Keld Jørn Simonsen [EMAIL PROTECTED]

and fed that to the analyser. I forget which analyser we eventually settled on, but the "[EMAIL PROTECTED]" turns into the tokens "fred" "example" and "com". This actually gives rise to a remarkably natural way of search for addresses. People do things like searching for "lucene.apache.org" to look for mail sent to the lucene lists, they search for me variously as "jch", "john haxby" and "haxby"; they even, occasionally, search for complete mail addresses. They all work.

The RFC2047 syntax in the example above gives one hint as to the minefield that address parsing can be. If you look at the javamail spec, you'll also see reference to group-syntax -- it's often seen as

   undisclosed-recipients:;

but you'll also occasionally see

   example-group: [EMAIL PROTECTED], [EMAIL PROTECTED];

Javamail knows how to parse these and I threw away the group name and just indexed the messages. It might've been better to keep the group name, but groups aren't that widely used so it probably doesn't make much difference.

Other heads cause headaches as well. Things like the subject can be RFC2047 encoded so you'll need to decode them. The various message-id headers are also slightly problematic. If you're using "message-id" and "references" and "in-reply-to" you'll need to be careful -- the individual message-id's will need their angle brackets removed and they really ought not to be tokenized.

It's also worth indexing *all* the message headers. People do do searches on some odd things. I also index the raw content-type as well -- those huge presentations can be found and deleted by searching for "content-type:application/vnd.ms-powerpoint". Or at least I could. It seems to be broken at the moment :-(

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to