Just to add to that, as I said before, in my case, I found more useful not to use UN_Tokenized. Instead, I used Tokenized with a custom analyzer that uses the KeywordTokenizer (entire input as only one token) with the LowerCaseFilter: This way I get the best of both worlds.
public class KeywordLowerAnalyzer extends Analyzer { public KeywordLowerAnalyzer() { } public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new KeywordTokenizer(reader); result = new LowerCaseFilter(result); return result; } } On Wed, Aug 20, 2008 at 10:21 AM, Dino Korah <[EMAIL PROTECTED]> wrote: > Hi Steve, > > Thanks a lot for that. > > I have a question on TokenStreams and email addresses, but I will post them > on a separate thread. > > Many thanks, > Dino > > > -----Original Message----- > From: Steven A Rowe [mailto:[EMAIL PROTECTED] > Sent: 19 August 2008 17:43 > To: java-user@lucene.apache.org > Subject: RE: Case Sensitivity > > Hi Dino, > > I think you'd benefit from reading some FAQ answers, like: > > "Why is it important to use the same analyzer type during indexing and > search?" > < http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4 > 4472d10961ba63c> > > Also, have a look at the AnalysisParalysis wiki page for some hints: > <http://wiki.apache.org/lucene-java/AnalysisParalysis> > > On 08/19/2008 at 8:57 AM, Dino Korah wrote: >> From the discussion here what I could understand was, if I am using >> StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying, >> I shouldn't have any problems with cases. > > If by "shouldn't have problems with cases" you mean "can match > case-insensitively", then this is true. > >> But if I have any UN_TOKENIZED fields there will be problems if I do >> not case-normalize them myself before adding them as a field to the >> document. > > Again, assuming that by "case-normalize" you mean "downcase", and that you > want case-insensitive matching, and that you use the StandardAnalyzer (or > some other downcasing analyzer) at query-time, then this is true. > >> In my case I have a mixed scenario. I am indexing emails and the email >> addresses are indexed UN_TOKENIZED. I do have a second set of custom >> tokenized field, which keep the tokens in individual fields with same >> name. > [...] >> Does it mean that where ever I use UN_TOKENIZED, they do not get >> through the StandardAnalyzer before getting Indexed, but they do when >> they are searched on? > > This is true. > >> If that is the case, Do I need to normalise them before adding to >> document? > > If you want case-insensitive matching, then yes, you do need to normalize > them before adding them to the document. > >> I also would like to know if it is better to employ an EmailAnalyzer >> that makes a TokenStream out of the given email address, rather than >> using a simplistic function that gives me a list of string pieces and >> adding them one by one. With searches, would both the approaches give >> same result? > > Yes, both approaches give the same result. When you add string pieces > one-by-one, you are adding multiple same-named fields. By contrast, the > EmailAnalyzer approach would add a single field, and would allow you to > control positions (via Token.setPositionIncrement(): > < http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht > ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if > you make up an EmailAnalyzer, you can use it to search against your > tokenized email field, along with other analyzer(s) on other field(s), using > the PerFieldAnalyzerWrapper > < http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField > AnalyzerWrapper.html>. > > Steve > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >