It sounds to me like you'll have to pre-process your text, then use something like KeywordAnalyzer. The idea here is to do something like lowercase the strings (both index and query), and remove all non-letter (or whatever) characters, normalize whitespace (e.g. remove leading and trailing, turn all sequences of whitespace into a single space, etc) and go from there.
HTH Erick On Wed, Feb 24, 2010 at 2:10 PM, Murdoch, Paul <paul.b.murd...@saic.com>wrote: > I manually change all indexed and searched content to lowercase. The > whole groupC thing was just for the example...sorry. My main problem is > with the comma and whitespace. I would like to query for "night" and > only get the one hit. The only reason changing StandardAnalyzer "may" > :-) not be an option is due to project scheduling constraints. However, > if another analyzer solves my problem and passes all of our unit tests > within those constraints then I'm all for it. I looked at the > PerFieldAnalyzerWrapper some time ago. I like it, but my index has > hundreds of fields so I'm looking for a more generic approach instead of > handling them on a case by case basis. > > I tried the WhitespaceAnalyzer and liked the way the comma (among other > punctuation) was preserved. I'm running tests with that right now. > Unfortunately, if I want to look for "groupC" I have to append the comma > which won't make sense to a user. Also the query choice:"groupC, night" > didn't give me a hit. Does the WhitespaceAnalyzer split on whitespaces > in phrases? > > Thanks, > Paul > > > > -----Original Message----- > From: java-user-return-45137-paul.b.murdoch=saic....@lucene.apache.org > [mailto:java-user-return-45137-paul.b.murdoch=saic....@lucene.apache.org > ] On Behalf Of Erick Erickson > Sent: Wednesday, February 24, 2010 1:40 PM > To: java-user@lucene.apache.org > Subject: Re: StandardAnalyzer and comma > > OK, I'm confused. In your original message, you said that > changing analyzers is NOT an option. Then you said you'll > give WhitespaceAnalyzer a shot.... > > Assuming your original constraint is accurate, > why isn't changing analyzers an option? Are you aware of > PerFieldAnalyzerWrapper which allows you to specify different > analyzers for different fields? If absolutely necessary, you could > copy the field indicated into another field that you use for this case, > which would isolate this change from any other part of your index. > > Be aware that WhitespaceAnalyzer does NOT fold case, so > groupc would not match groupC. > > But it's easy to fix this. You can either take care to lowercase > your input and query streams, or compose your own analyzer > from, say, lowerCaseFilter and WhiteSpaceTokenizer to handle > all that automatically. > > HTH > Erick > > On Wed, Feb 24, 2010 at 12:10 PM, Murdoch, Paul > <paul.b.murd...@saic.com>wrote: > > > Thanks for the input. I'll give the WhitespaceAnalyzer a shot. Also, > > AFAIK, Field.Index.NOT_ANALYZED means that the content you index is > not > > split into separate tokens so it is searchable, but only for exact > > matches. I may be able to get what I want with the WhitespaceAnalyzer > > and Field.Index.NOT_ANALYZED. Thanks again. > > > > Paul > > > > -----Original Message----- > > From: java-user-return-45134-paul.b.murdoch=saic....@lucene.apache.org > > > [mailto:java-user-return-45134-paul.b.murdoch=saic....@lucene.apache.org > > ] On Behalf Of Max Lynch > > Sent: Wednesday, February 24, 2010 11:42 AM > > To: java-user@lucene.apache.org > > Subject: Re: StandardAnalyzer and comma > > > > Personally punctuation matters in my queries so I use > > WhitespaceAnalyzer. I > > also only want exact hits, so that analyzer works well for me. > > > > Also, AFAIK you don't set NOT_ANALYZED if you want to search through > it. > > > > On Wed, Feb 24, 2010 at 10:33 AM, Murdoch, Paul > > <paul.b.murd...@saic.com>wrote: > > > > > I'm using Lucene 2.9. How do I make a comma behave like a regular > > > character using the StandardAnalyzer? Example: > > > > > > > > > > > > I have a field called "choice" and some field values: > > > > > > > > > > > > groupA, morning > > > > > > groupB, noon > > > > > > groupC, night > > > > > > morning > > > > > > noon > > > > > > night > > > > > > > > > > > > So a query choice:night returns "groupC, night" and "night". Well, > I > > > only wanted "night". The StandardAnalyzer strips the commas from > > > phrases and splits on whitespace. A phrase query choice:"night" > > > produces the same results. I think indexing the field values as > > > NOT_ANALYZED and making the comma behave as a regular character will > > > solve this. > > > > > > > > > > > > Of course I have thought about choice:(night -groupC). That is not > an > > > option because the contents of the index are hidden from the front > end > > > where queries are made by users. I looked into changing > > > StandardTokenizerImpl punctuation, but I'm hoping for a more simple > > > solution. Also, changing analyzers is not an option. I could > > possibly > > > extend the StandardAnalyzer, but how do I set the punctuation > > settings? > > > Any help here would be great. This seems like it should be an easy > > fix > > > so I hope I've missed something simple. > > > > > > > > > > > > Thanks, > > > > > > Paul > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >