Looks like my question got unnoticed among the more important Jira discussion. :(
On the same topic, what would be the effect of the following code. Document doc = new Document(); Field f = new Field("body", bodyText, Field.Store.NO, Field.Index.TOKENIZED); f.setOmitNorms(true); Would that be equivalent to Document doc = new Document(); Field f = new Field("body", bodyText, Field.Store.NO ,Field.Index.NO_NORMS); And Field.Index.TOKENIZED has no effect after f.setOmitNorms(true); ? Many thanks, Dino -----Original Message----- From: Michael McCandless [mailto:[EMAIL PROTECTED] Sent: 27 August 2008 10:37 To: java-user@lucene.apache.org Subject: Re: Case Sensitivity Actually, as confusing as it is, Field.Index.NO_NORMS means Field.Index.UN_TOKENIZED plus field.setOmitNorms(true). Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS? Mike Otis Gospodnetic wrote: > Dino, you lost me half-way through your email :( > > NO_NORMS does not mean the field is not tokenized. > UN_TOKENIZED does mean the field is not tokenized. > > > Otis-- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: Dino Korah <[EMAIL PROTECTED]> >> To: java-user@lucene.apache.org >> Sent: Tuesday, August 26, 2008 9:17:49 AM >> Subject: RE: Case Sensitivity >> >> I think I should rephrase my question. >> >> [ Context: Using out of the box StandardAnalyzer for indexing and >> searching. >> ] >> >> Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS- >> ized ( >> field.setOmitNorms(true) ), it doesn't get analyzed while indexing? >> Which means that when we search, it gets thru the analyzer and we >> need to analyze them differently in the analyzer we use for >> searching? >> Doesn't it mean that a setOmitNorms(true) field also doesn't get >> tokenized? >> >> What is the best solution if one was to add a set of fields >> UN_TOKENIZED and others TOKENIZED, of the later set a few with >> setOmitNorms(true) (the index writer is plain StandardAnalyzer >> based)? A per field analyzer at query time ?! >> >> Many thanks, >> Dino >> >> >> -----Original Message----- >> From: Dino Korah [mailto:[EMAIL PROTECTED] >> Sent: 26 August 2008 12:12 >> To: 'java-user@lucene.apache.org' >> Subject: RE: Case Sensitivity >> >> A little more case sensitivity questions. >> >> Based on the discussion on >> http://markmail.org/message/q7dqr4r7o6t6dgo5 >> and >> on this thread, is it right to say that a field, if either >> UN_TOKENIZED or NO_NORMS-ized, it doesn't get analyzed while >> indexing? Which means we need to case-normalize (down-case) those >> fields before hand? >> >> Doest it mean that if I can afford, I should use norms. >> >> Many thanks, >> Dino >> >> >> >> -----Original Message----- >> From: Steven A Rowe [mailto:[EMAIL PROTECTED] >> Sent: 19 August 2008 17:43 >> To: java-user@lucene.apache.org >> Subject: RE: Case Sensitivity >> >> Hi Dino, >> >> I think you'd benefit from reading some FAQ answers, like: >> >> "Why is it important to use the same analyzer type during indexing >> and search?" >> >> 4472d10961ba63c> >> >> Also, have a look at the AnalysisParalysis wiki page for some hints: >> >> >> On 08/19/2008 at 8:57 AM, Dino Korah wrote: >>> From the discussion here what I could understand was, if I am using >>> StandardAnalyzer on TOKENIZED fields, for both Indexing and >>> Querying, I shouldn't have any problems with cases. >> >> If by "shouldn't have problems with cases" you mean "can match >> case-insensitively", then this is true. >> >>> But if I have any UN_TOKENIZED fields there will be problems if I do >>> not case-normalize them myself before adding them as a field to the >>> document. >> >> Again, assuming that by "case-normalize" you mean "downcase", and >> that you want case-insensitive matching, and that you use the >> StandardAnalyzer (or some other downcasing analyzer) at query-time, >> then this is true. >> >>> In my case I have a mixed scenario. I am indexing emails and the >>> email addresses are indexed UN_TOKENIZED. I do have a second set of >>> custom tokenized field, which keep the tokens in individual fields >>> with same name. >> [...] >>> Does it mean that where ever I use UN_TOKENIZED, they do not get >>> through the StandardAnalyzer before getting Indexed, but they do >>> when they are searched on? >> >> This is true. >> >>> If that is the case, Do I need to normalise them before adding to >>> document? >> >> If you want case-insensitive matching, then yes, you do need to >> normalize them before adding them to the document. >> >>> I also would like to know if it is better to employ an EmailAnalyzer >>> that makes a TokenStream out of the given email address, rather than >>> using a simplistic function that gives me a list of string pieces >>> and adding them one by one. With searches, would both the approaches >>> give same result? >> >> Yes, both approaches give the same result. When you add string >> pieces one-by-one, you are adding multiple same-named fields. By >> contrast, the EmailAnalyzer approach would add a single field, and >> would allow you to control positions (via >> Token.setPositionIncrement(): >> >> ml#setPositionIncrement(int)>), e.g. to improve phrase handling. >> Also, if >> you make up an EmailAnalyzer, you can use it to search against your >> tokenized email field, along with other analyzer(s) on other >> field(s), using the PerFieldAnalyzerWrapper >> >> AnalyzerWrapper.html>. >> >> Steve >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]