Actually, WhitespaceTokenizer won't work. Too many person names and it won't do anything with punctuation. Something had to have changed in StandardTokenizer, and we need some of the 2.4 fixes/features, so we are kind of stuck.
-----Original Message----- From: Philip Puffinburger [mailto:ppuffinbur...@tlcdelivers.com] Sent: Monday, February 16, 2009 7:19 PM To: java-user@lucene.apache.org Subject: 2.3.2 -> 2.4.0 StandardTokenizer issue We have our own Analyzer which has the following Public final TokenStream tokenStream(String fieldname, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new MyAccentFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(result); return result; } In 2.3.2 if the token ‘Cómo’ came through this it would get changed to ‘como’ by the time it made it through the filters. In 2.4.0 this isn’t the case. It treats this one token as two so we get ‘co’ and ‘mo’. So instead of search ‘como’ or ‘Cómo’ to get all the hits we now have to do them separately. I switched to the WhitespaceTokenizer as a test and that is indexing and searching the way we expect it, but I haven’t looked into what we lost by using that tokenizer. Were we relying on a bug to get what we wanted from StandardTokenizer or did something break in 2.4.0? --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org