Kewl :)

I updated the Filter....(for anyone interested). Actually..if anyone wants I can zip it up and send it to them...let me know.

-------- EmailFilter

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.Token;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Stack;

public class EmailFilter extends TokenFilter {
   public static final String TOKEN_TYPE_EMAIL = "EMAILPART";

   private Stack emailTokenStack;
public EmailFilter(TokenStream in) {
       super(in);
       emailTokenStack = new Stack();
   }

   public Token next() throws IOException {

       if (emailTokenStack.size() > 0) {
           return (Token) emailTokenStack.pop();
}
       Token token = input.next();
       if (token == null) {
           return null;
       }

       addEmailPartsToStack(token);

       return token;
   }
private void addEmailPartsToStack(Token token) throws IOException {
       String[] parts = getEmailParts(token.termText());

       if (parts == null) return;

       for (int i = 0; i < parts.length; i++) {
           Token synToken = new Token(parts[i],
                                token.startOffset(),
                                token.endOffset(),
                                TOKEN_TYPE_EMAIL);
           synToken.setPositionIncrement(0);

           emailTokenStack.push(synToken);
       }
   }

   /*
    * Parses emails into its parts for tokenization.
    * For example [EMAIL PROTECTED] would be broken into
    *
    *    [EMAIL PROTECTED]
    *    [john]
    *    [foo.com]
    *    [foo]
    *    [com]
* */
   private String[] getEmailParts(String email) {

       // array for the parts
       String[] emailParts;
       // so i can add them before calling toArray
       ArrayList partsList = new ArrayList();

       /* let's do it */
       // split on the @
       String[] splitOnAmpersand = email.split("@");
       // add the username
       try {
           partsList.add(splitOnAmpersand[0]);
       } catch (ArrayIndexOutOfBoundsException ae) {
           // ignore
       }

       // add the full host name
       try {
           partsList.add(splitOnAmpersand[1]);
       } catch (ArrayIndexOutOfBoundsException ae) {
           // ignore
       }

       // split the host name into pieces
       if (splitOnAmpersand.length > 1) {
           String[] splitOnDot = splitOnAmpersand[1].split("\\.");
           // add all pieces from splitOnDot
           for (int i=0; i < splitOnDot.length; i++) {
               partsList.add(splitOnDot[i]);
           }

           /*
* if this is great than 2 then we need to add the domain name which
            *  should be the last two
* */
           if (splitOnDot.length > 2) {
String domain = splitOnDot[splitOnDot.length-2] + "." + splitOnDot[splitOnDot.length-1];
               // add domain
               partsList.add(domain);
           }
       }
return (String[]) partsList.toArray(new String[0]); }

}

---- end EmailFilter




Otis Gospodnetic wrote:

No, you're not missing anything. :)
That JavaMail API is good for getting the whole email, but you then need to 
chop it up with your EmailAnalyzer, so you're doing the right thing.

Otis

----- Original Message ----
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, July 29, 2006 2:51:59 PM
Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

Hasan Diwan wrote:

Michael:

On 7/28/06, Michael J. Prichard <[EMAIL PROTECTED]> wrote:

Howdy....not sure if anyone else wants this but here is my first attempt
at writing an analyzer for an email address...modifications, updates,
fixes welcome.
Why reinvent the wheel? See
http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#parse(java.lang.String)
and use as:

InternetAddress valid = InternetAddress.parse(string)[0]; // far
simpler than rewriting it

i dont see where i can break an email address into simpler pieces for tokens. i use javamail when parsing the message and then pulling the email using InternetAddress. I don't see where I can break an email address like [EMAIL PROTECTED] into "[EMAIL PROTECTED]", "john", "foo.com", "foo" and "com" without splitting it. Am I missing something?

Thanks!
Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to