Kewl :)
I updated the Filter....(for anyone interested). Actually..if anyone
wants I can zip it up and send it to them...let me know.
-------- EmailFilter
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.Token;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Stack;
public class EmailFilter extends TokenFilter {
public static final String TOKEN_TYPE_EMAIL = "EMAILPART";
private Stack emailTokenStack;
public EmailFilter(TokenStream in) {
super(in);
emailTokenStack = new Stack();
}
public Token next() throws IOException {
if (emailTokenStack.size() > 0) {
return (Token) emailTokenStack.pop();
}
Token token = input.next();
if (token == null) {
return null;
}
addEmailPartsToStack(token);
return token;
}
private void addEmailPartsToStack(Token token) throws IOException {
String[] parts = getEmailParts(token.termText());
if (parts == null) return;
for (int i = 0; i < parts.length; i++) {
Token synToken = new Token(parts[i],
token.startOffset(),
token.endOffset(),
TOKEN_TYPE_EMAIL);
synToken.setPositionIncrement(0);
emailTokenStack.push(synToken);
}
}
/*
* Parses emails into its parts for tokenization.
* For example [EMAIL PROTECTED] would be broken into
*
* [EMAIL PROTECTED]
* [john]
* [foo.com]
* [foo]
* [com]
*
*/
private String[] getEmailParts(String email) {
// array for the parts
String[] emailParts;
// so i can add them before calling toArray
ArrayList partsList = new ArrayList();
/* let's do it */
// split on the @
String[] splitOnAmpersand = email.split("@");
// add the username
try {
partsList.add(splitOnAmpersand[0]);
} catch (ArrayIndexOutOfBoundsException ae) {
// ignore
}
// add the full host name
try {
partsList.add(splitOnAmpersand[1]);
} catch (ArrayIndexOutOfBoundsException ae) {
// ignore
}
// split the host name into pieces
if (splitOnAmpersand.length > 1) {
String[] splitOnDot = splitOnAmpersand[1].split("\\.");
// add all pieces from splitOnDot
for (int i=0; i < splitOnDot.length; i++) {
partsList.add(splitOnDot[i]);
}
/*
* if this is great than 2 then we need to add the domain
name which
* should be the last two
*
*/
if (splitOnDot.length > 2) {
String domain = splitOnDot[splitOnDot.length-2] + "." +
splitOnDot[splitOnDot.length-1];
// add domain
partsList.add(domain);
}
}
return (String[]) partsList.toArray(new String[0]);
}
}
---- end EmailFilter
Otis Gospodnetic wrote:
No, you're not missing anything. :)
That JavaMail API is good for getting the whole email, but you then need to
chop it up with your EmailAnalyzer, so you're doing the right thing.
Otis
----- Original Message ----
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, July 29, 2006 2:51:59 PM
Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
Hasan Diwan wrote:
Michael:
On 7/28/06, Michael J. Prichard <[EMAIL PROTECTED]> wrote:
Howdy....not sure if anyone else wants this but here is my first attempt
at writing an analyzer for an email address...modifications, updates,
fixes welcome.
Why reinvent the wheel? See
http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#parse(java.lang.String)
and use as:
InternetAddress valid = InternetAddress.parse(string)[0]; // far
simpler than rewriting it
i dont see where i can break an email address into simpler pieces for
tokens. i use javamail when parsing the message and then pulling the
email using InternetAddress. I don't see where I can break an email
address like [EMAIL PROTECTED] into "[EMAIL PROTECTED]", "john", "foo.com", "foo"
and "com" without splitting it. Am I missing something?
Thanks!
Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]