Hi, We have a very large lucene index that we're developing that has a field of email addresses. (Actually mulitple fields with multiple emails addresses, but I'll simplify here)
Each document will have one "email" field containing multiple email addresses. I am indexing email addresses only using WhitespaceAnalyzer, so to preserve the exact adresses and store multiple emails for one document. Example... doc.add(new Field("email", "f...@bar.com b...@foo.com c...@bar.foo", Field.Store.YES, Field.Index.ANALYZED )); Terms for this document will then be... email:f...@bar.com email:b...@foo.com email:c...@bar.foo The problem I having is that these terms are rarely re-used in other documents. There is little overlap with email usage, and there is a lot of very long emails addresses. Because of this, the number of terms in my index is very big and I think it's is causing performance issues and bloating the index. I think I'm not using Lucene optimally here. A couple of questions... 1) Is there a way I can analyze these emails down to smaller terms but still search for the exact email address? For instance, if I used a different analyzer and broke these down to the terms "foo", "bar", and "com", is Lucene able to find "email:f...@bar.com" without matching "email:c...@foo.bar"? 2) Does Lucene retain the positional information of tokens in the index? Knowing this will help me anwer question 1. Thanks, Phil --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org