1. Sure, just have an analyzer that splits on all non letter characters.
2. Phrase queries keep the order intact. (And yes, the positional information for the terms is kept, which is what allows span queries to work)

So searching on the following "foo bar com" will match f...@bar.com but not b...@foo.com

Matt

Phil Whelan wrote:
Hi,

We have a very large lucene index that we're developing that has a
field of email addresses. (Actually mulitple fields with multiple
emails addresses, but I'll simplify here)

Each document will have one "email" field containing multiple email addresses.

I am indexing email addresses only using WhitespaceAnalyzer, so to
preserve the exact adresses and store multiple emails for one
document.

Example...
doc.add(new Field("email", "f...@bar.com b...@foo.com c...@bar.foo",
Field.Store.YES, Field.Index.ANALYZED ));

Terms for this document will then be...
email:f...@bar.com
email:b...@foo.com
email:c...@bar.foo

The problem I having is that these terms are rarely re-used in other
documents. There is little overlap with email usage, and there is a
lot of very long emails addresses. Because of this, the number of
terms in my index is very big and I think it's is causing performance
issues and bloating the index.

I think I'm not using Lucene optimally here.


A couple of questions...

1) Is there a way I can analyze these emails down to smaller terms but
still search for the exact email address? For instance, if I used a
different analyzer and broke these down to the terms "foo", "bar", and
"com", is Lucene able to find "email:f...@bar.com" without matching
"email:c...@foo.bar"?

2) Does Lucene retain the positional information of tokens in the
index? Knowing this will help me anwer question 1.

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to