Hi Guys, >From the discussion here what I could understand was, if I am using StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying, I shouldn't have any problems with cases. But if I have any UN_TOKENIZED fields there will be problems if I do not case-normalize them myself before adding them as a field to the document.
In my case I have a mixed scenario. I am indexing emails and the email addresses are indexed UN_TOKENIZED. I do have a second set of custom tokenized field, which keep the tokens in individual fields with same name. For example, if the email had a from address "John Smith" <[EMAIL PROTECTED]>, my document looks like this ------------------8<---------------- to: ... - UN_TOKENIZED from: [EMAIL PROTECTED] - UN_TOKENIZED From-tokenized: John - UN_TOKENIZED From-tokenized: Smith - UN_TOKENIZED From-tokenized: J - UN_TOKENIZED From-tokenized: Smith - UN_TOKENIZED From-tokenized: world.net - UN_TOKENIZED From-tokenized: world - UN_TOKENIZED From-tokenized: net - UN_TOKENIZED Subject: ... - TOKENIZED Body: ... - TOKENIZED ------------------8<---------------- Does it mean that where ever I use UN_TOKENIZED, they do not get through the StandardAnalyzer before getting Indexed, but they do when they are searched on? If that is the case, Do I need to normalise them before adding to document? I also would like to know if it is better to employ an EmailAnalyzer that makes a TokenStream out of the given email address, rather than using a simplistic function that gives me a list of string pieces and adding them one by one. With searches, would both the approaches give same result? Many thanks, Dino -----Original Message----- From: Doron Cohen [mailto:[EMAIL PROTECTED] Sent: 16 August 2008 21:01 To: java-user@lucene.apache.org Subject: Re: Case Sensitivity Hi Sergey, seems like case 4 and 5 are equivalent, both meaning case insensitive right. Otherwise please explain the difference. If it is required to support both case sensitive (cases 1,2,3) and case insensitive (case 4/5) then both forms must be saved in the index - in two separate fields (as Erick mentioned, I think). Hope this helps, Doron On Fri, Aug 15, 2008 at 10:51 AM, Sergey Kabashnyuk <[EMAIL PROTECTED]>wrote: > Hello > > Here's my use case content of the field > Doc1 - > Field - "text " - "Field Without Norms" > > Doc2 - > Field - "text " - "field without norms" > > Doc3 - > Field - "text " - "FIELD WITHOUT NORMS" > > > Query expected result > 1. new Term("text","Field Without Norms") doc1 > 2. new Term("text","field without norms") doc2 > 3. new Term("text","FIELD WITHOUT NORMS") doc3 > lowercase("text","field without norms") doc1, doc2, doc3 > uppercase("text","FIELD WITHOUT NORMS") doc1, doc2, doc3 > > I stor "text" field like : > new Field("text", Field.Store.NO, > Field.Index.NO_NORMS,Field.TermVector.NO > ) > using StandardAnalyzer and query 1-3 works perfectly as I need. The > question is how create query 4-5? > > Thanks > > Sergey Kabashnyuk > eXo Platform SAS > > > Be aware that StandardAnalyzer lowercases all the input, >> both at index and query times. Field.Store.YES will store the >> original text without any transformations, so doc.get(<field>) will >> return the original text. However, no matter what the Field.Store >> value, the *indexed* tokens (using TOKENIZED as you >> Field.Index.TOKENIZED) are passed through the analyzer. >> >> For instance, indexing "MIXed CasE TEXT" in a field called "myfield" >> with Field.Store.YES, Field.Index.TOKENIZED would index the following >> tokens (with StandardAnalyzer). >> mixed >> case >> text >> >> and searches (with StandardAnalyzer) would match any case in the >> query terms (e.g. MIXED would hit, as would mixed as would CaSE). >> >> However, doc.get("myfield") would return "MIXed CasE TEXT" >> >> As Doron said, though, a few use cases would help us provide better >> answers. >> >> Best >> Erick >> >> >> On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk >> <[EMAIL PROTECTED] >> >wrote: >> >> Thanks for you reply Erick. >>> >>> >>> About the only way to do this that I know of is to >>> >>>> index the data three times, once without any case changing, once >>>> uppercased and once lowercased. >>>> You'll have to watch your analyzer, probably making up your own >>>> (easily done, see the synonym analyzer in Lucene in Action). >>>> >>>> Your example doesn't tell us anything, since the critical >>>> information is the *analyzer* you use, both at query and at index >>>> times. The analyzer is responsible for any transformations, like >>>> case folding, tokenizing, etc. >>>> >>>> >>> >>> In example I want to show what I stored field as >>> Field.Index.NO_NORMS >>> >>> As I understand it means what field contains original string despite >>> what analyzer I chose(StandardAnalyzer by default). >>> >>> All querys I made myself without using Parsers. >>> For example new TermQuery(new Term("filed", "MaMa")); >>> >>> >>> I agree with you about possible implementation, but it increase size >>> of index at times. >>> >>> But are there other possibilities, such as using custom query, >>> possibly similar to RegexQuery,RegexTermEnum that would compare >>> terms at it's own discretion? >>> >>> >>> >>> >>> >>> But what is your use-case for needing both upper and >>>> lower case comparisons? I have a hard time coming up with a reason >>>> to do both that wouldn't be satisfied by just a caseless search. >>>> >>>> Best >>>> Erick >>>> >>>> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk >>>> <[EMAIL PROTECTED] >>>> >wrote: >>>> >>>> Hello. >>>> >>>>> >>>>> I have the similar question. >>>>> >>>>> I need to implement >>>>> 1. Case sensitive search. >>>>> 2. Lower case search for concrete field. >>>>> 3. Upper case search for concrete filed. >>>>> >>>>> For now I use >>>>> new Field("PROPERTIES", >>>>> content, >>>>> Field.Store.NO, >>>>> Field.Index.NO_NORMS, >>>>> Field.TermVector.NO) for original string and make >>>>> case sensitive search. >>>>> >>>>> But does anyone have an idea to how implement second and third >>>>> type of search? >>>>> >>>>> Thanks >>>>> >>>>> >>>>> >>>>> Hi All, >>>>> >>>>> Once I index a bunch of documents with a StandardAnalyzer (and if >>>>> the >>>>>> effort >>>>>> I need to put in to reindex the documents is not worth the >>>>>> effort), is there a way to search on the index without case >>>>>> sensitivity. >>>>>> I do not use any sophisticated Analyzer that makes use of >>>>>> LowerCaseTokenizer. >>>>>> Please let me know if there is a solution to circumvent this case >>>>>> sensitivity problem. >>>>>> Many thanks >>>>>> Dino >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>> Sergey Kabashnyuk >>>>> eXo Platform SAS >>>>> >>>>> >>>>> ------------------------------------------------------------------ >>>>> --- To unsubscribe, e-mail: >>>>> [EMAIL PROTECTED] >>>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>>> >>>>> >>>>> -- >>>>> >>>> Sergey Kabashnyuk >>> eXo Platform SAS >>> >>> -------------------------------------------------------------------- >>> - To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> > > > -- > Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]