hello, RTL languages like arabic are encoded "logically" in left-to-right order, too. only your display swaps the order.
On Fri, Jan 15, 2010 at 2:07 PM, David Seltzer <dselt...@tveyes.com> wrote: > Hi Erik, > > Thanks for your thoughtful reply! > >> It's actually quite rare for simple tokenizers like these to be > satisfactory >> unless it's a field you can guarantee is indexed/searched in a very >> controlled manner, say part numbers or words from a list. In your >> example above, none of the three variants would get a hit if the >> user searched for "nation". Is that what you want? > > Yes, this is what I want. The reason for this behavior is that the > output of SOLR needs to closely match the search results provided by a > different legacy system. Our user have rigidly defined queries. A user > who was interested in "nation's" is required either to search for > "nations" or "nation*". > >> But no, Standard* don't have any stemming built in. And >> what do you mean by "language specific functionality"? >> They do NOT fold accents for instance if that's what >> you're getting at. > > I asked that because I'm not super comfortable I know what's going on > under the hood inside these tokenizers. Do they work the same on > RightToLeft languages (such as Arabic) as they do in LeftToRight > languages? (My assumption regarding the WhiteSpaceTokenizer is that it > would be very language/direction neutral) > >> Could you explain a bit about *why* you want this behavior? > In short we have to support multiple languages and match the behavior of > an existing non-solr system. > > -Dave > > -----Original Message----- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Friday, January 15, 2010 1:42 PM > To: solr-user@lucene.apache.org > Subject: Re: Stripping Punctuation in a fieldType > > If you haven't seen it, this page is invaluable for this kind of > question: > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LetterT > okenizerFactory > <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Letter > TokenizerFactory> > > LetterTokenizerFactory may well be your friend here, followed by > LowerCaserFilterFactory. There is a problem that it would > split "nation's" up into "nation" and "s", so searching on "nations" > wouldn't get a hit. > > But you have equally ugly stuff with WhiteSpaceTokenizerFactory > as you're finding out. > > It's actually quite rare for simple tokenizers like these to be > satisfactory > unless it's a field you can guarantee is indexed/searched in a very > controlled manner, say part numbers or words from a list. In your > example above, none of the three variants would get a hit if the > user searched for "nation". Is that what you want? > > But no, Standard* don't have any stemming built in. And > what do you mean by "language specific functionality"? > They do NOT fold accents for instance if that's what > you're getting at. > > Could you explain a bit about *why* you want this behavior? > > HTH > Erick > > On Fri, Jan 15, 2010 at 1:17 PM, David Seltzer <dselt...@tveyes.com> > wrote: > >> I'm hesitant to change Tokenizers at the moment because what we have > is >> working so nicely - or so I thought. >> >> What I'm looking for is case-insensitive search for words and numbers >> without any of the stemming features turned on. The new requirement is >> that we take punctuation out of the mix. >> >> Right now when I search for "Obama" I'm not getting any hits on > "Obama." >> >> So I'm basically looking to strip punctuation. The consequence would > be >> that "nation's", "nations" and "nations," would all be represented the >> same way. >> >> Would the StandardTokenizerFactory accomplish this? >> Does it have any language specific functionality? >> Does it do anything with stemming? >> >> Thanks for everyone's input! >> >> -Dave >> >> >> >> -----Original Message----- >> From: Ahmet Arslan [mailto:iori...@yahoo.com] >> Sent: Friday, January 15, 2010 12:42 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Stripping Punctuation in a fieldType >> >> > I'm trying to find the best way to set up a fieldType that >> > strips punctuation. >> >> Use solr.StandardTokenizerFactory that strips punctuations. >> >> Or if you do not care about alphanumeric or numeric queries use >> solr.LowerCaseTokenizerFactory that uses LetterTokenizer. >> >> I think the right way to do this is using a >> > CharacterFilter >> > of some type, but I can't seem to find any examples of how >> > to set this >> > up in a schema.xml file. >> >> If you want to use solr.MappingCharFilterFactory you need to write all >> punctiation characters to a text file manually. e.g. "," => "" >> >> >> >> > -- Robert Muir rcm...@gmail.com