Michael J. Prichard wrote:
I am working on indexing emails and want to have a "to" field. I am
currently putting all the emails on one line seperated w/
spaces...example:
[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]
Then i index that with a StandardAnalyzer as follows:
doc.add(new Field("to", (String) itemContent.get("to"),
Field.Store.YES, Field.Index.UN_TOKENIZED));
Question is...is this the best way to do it? I want to be able to
search for [EMAIL PROTECTED] and pick out just those Documents, etc.
I took a slightly different approach. Using javamail, given a To: line
like this:
To: Fred Smith <[EMAIL PROTECTED]>,
=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <[EMAIL PROTECTED]>
I re-constructed the address list to look like this:
Fred Smith [EMAIL PROTECTED] Keld Jørn Simonsen [EMAIL PROTECTED]
and fed that to the analyser. I forget which analyser we eventually
settled on, but the "[EMAIL PROTECTED]" turns into the tokens "fred"
"example" and "com". This actually gives rise to a remarkably natural
way of search for
addresses. People do things like searching for "lucene.apache.org" to
look for mail sent to the lucene lists, they search for me variously as
"jch", "john haxby" and "haxby"; they even, occasionally, search for
complete mail addresses. They all work.
The RFC2047 syntax in the example above gives one hint as to the
minefield that address parsing can be. If you look at the javamail
spec, you'll also see reference to group-syntax -- it's often seen as
undisclosed-recipients:;
but you'll also occasionally see
example-group: [EMAIL PROTECTED], [EMAIL PROTECTED];
Javamail knows how to parse these and I threw away the group name and
just indexed the messages. It might've been better to keep the group
name, but groups aren't that widely used so it probably doesn't make
much difference.
Other heads cause headaches as well. Things like the subject can be
RFC2047 encoded so you'll need to decode them. The various message-id
headers are also slightly problematic. If you're using "message-id"
and "references" and "in-reply-to" you'll need to be careful -- the
individual message-id's will need their angle brackets removed and they
really ought not to be tokenized.
It's also worth indexing *all* the message headers. People do do
searches on some odd things. I also index the raw content-type as well
-- those huge presentations can be found and deleted by searching for
"content-type:application/vnd.ms-powerpoint". Or at least I could. It
seems to be broken at the moment :-(
jch
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]