Hi,

We have a very large lucene index that we're developing that has a
field of email addresses. (Actually mulitple fields with multiple
emails addresses, but I'll simplify here)

Each document will have one "email" field containing multiple email addresses.

I am indexing email addresses only using WhitespaceAnalyzer, so to
preserve the exact adresses and store multiple emails for one
document.

Example...
doc.add(new Field("email", "f...@bar.com b...@foo.com c...@bar.foo",
Field.Store.YES, Field.Index.ANALYZED ));

Terms for this document will then be...
email:f...@bar.com
email:b...@foo.com
email:c...@bar.foo

The problem I having is that these terms are rarely re-used in other
documents. There is little overlap with email usage, and there is a
lot of very long emails addresses. Because of this, the number of
terms in my index is very big and I think it's is causing performance
issues and bloating the index.

I think I'm not using Lucene optimally here.


A couple of questions...

1) Is there a way I can analyze these emails down to smaller terms but
still search for the exact email address? For instance, if I used a
different analyzer and broke these down to the terms "foo", "bar", and
"com", is Lucene able to find "email:f...@bar.com" without matching
"email:c...@foo.bar"?

2) Does Lucene retain the positional information of tokens in the
index? Knowing this will help me anwer question 1.

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to