A good place for that in JIRA. could you put it there? We have a bunch of analyzers in Lucene's contrib, so if you are okay with putting Apache license on top of the source code, we can include it there. Same for EmailAnalyzer.
Otis ----- Original Message ---- From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, July 30, 2006 1:37:57 PM Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer) Kewl :) I updated the Filter....(for anyone interested). Actually..if anyone wants I can zip it up and send it to them...let me know. -------- EmailFilter import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.Token; import java.io.IOException; import java.util.ArrayList; import java.util.Stack; public class EmailFilter extends TokenFilter { public static final String TOKEN_TYPE_EMAIL = "EMAILPART"; private Stack emailTokenStack; public EmailFilter(TokenStream in) { super(in); emailTokenStack = new Stack(); } public Token next() throws IOException { if (emailTokenStack.size() > 0) { return (Token) emailTokenStack.pop(); } Token token = input.next(); if (token == null) { return null; } addEmailPartsToStack(token); return token; } private void addEmailPartsToStack(Token token) throws IOException { String[] parts = getEmailParts(token.termText()); if (parts == null) return; for (int i = 0; i < parts.length; i++) { Token synToken = new Token(parts[i], token.startOffset(), token.endOffset(), TOKEN_TYPE_EMAIL); synToken.setPositionIncrement(0); emailTokenStack.push(synToken); } } /* * Parses emails into its parts for tokenization. * For example [EMAIL PROTECTED] would be broken into * * [EMAIL PROTECTED] * [john] * [foo.com] * [foo] * [com] * */ private String[] getEmailParts(String email) { // array for the parts String[] emailParts; // so i can add them before calling toArray ArrayList partsList = new ArrayList(); /* let's do it */ // split on the @ String[] splitOnAmpersand = email.split("@"); // add the username try { partsList.add(splitOnAmpersand[0]); } catch (ArrayIndexOutOfBoundsException ae) { // ignore } // add the full host name try { partsList.add(splitOnAmpersand[1]); } catch (ArrayIndexOutOfBoundsException ae) { // ignore } // split the host name into pieces if (splitOnAmpersand.length > 1) { String[] splitOnDot = splitOnAmpersand[1].split("\\."); // add all pieces from splitOnDot for (int i=0; i < splitOnDot.length; i++) { partsList.add(splitOnDot[i]); } /* * if this is great than 2 then we need to add the domain name which * should be the last two * */ if (splitOnDot.length > 2) { String domain = splitOnDot[splitOnDot.length-2] + "." + splitOnDot[splitOnDot.length-1]; // add domain partsList.add(domain); } } return (String[]) partsList.toArray(new String[0]); } } ---- end EmailFilter Otis Gospodnetic wrote: >No, you're not missing anything. :) >That JavaMail API is good for getting the whole email, but you then need to >chop it up with your EmailAnalyzer, so you're doing the right thing. > >Otis > >----- Original Message ---- >From: Michael J. Prichard <[EMAIL PROTECTED]> >To: java-user@lucene.apache.org >Sent: Saturday, July 29, 2006 2:51:59 PM >Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer) > >Hasan Diwan wrote: > > > >>Michael: >> >>On 7/28/06, Michael J. Prichard <[EMAIL PROTECTED]> wrote: >> >> >> >>>Howdy....not sure if anyone else wants this but here is my first attempt >>>at writing an analyzer for an email address...modifications, updates, >>>fixes welcome. >>> >>> >>Why reinvent the wheel? See >>http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#parse(java.lang.String) >> >> >>and use as: >> >>InternetAddress valid = InternetAddress.parse(string)[0]; // far >>simpler than rewriting it >> >> >> >i dont see where i can break an email address into simpler pieces for >tokens. i use javamail when parsing the message and then pulling the >email using InternetAddress. I don't see where I can break an email >address like [EMAIL PROTECTED] into "[EMAIL PROTECTED]", "john", "foo.com", >"foo" >and "com" without splitting it. Am I missing something? > >Thanks! >Michael > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]