RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

Philip Puffinburger Thu, 19 Feb 2009 10:52:00 -0800

Actually, WhitespaceTokenizer won't work.   Too many person names and it
won't do anything with punctuation.   Something had to have changed in
StandardTokenizer, and we need some of the 2.4 fixes/features, so we are
kind of stuck.

-----Original Message-----
From: Philip Puffinburger [mailto:ppuffinbur...@tlcdelivers.com] 
Sent: Monday, February 16, 2009 7:19 PM
To: java-user@lucene.apache.org
Subject: 2.3.2 -> 2.4.0 StandardTokenizer issue

We have our own Analyzer which has the following

Public final TokenStream tokenStream(String fieldname, Reader reader) {

  TokenStream result = new StandardTokenizer(reader);

  result = new StandardFilter(result);

  result = new MyAccentFilter(result);

  result = new LowerCaseFilter(result);

  result = new StopFilter(result);

  return result;

}

In 2.3.2 if the token ‘Cómo’ came through this it would get changed to
‘como’ by the time it made it through the filters.    In 2.4.0 this isn’t
the case.   It treats this one token as two so we get ‘co’ and ‘mo’.    So
instead of search ‘como’ or ‘Cómo’ to get all the hits we now have to do
them separately.

I switched to the WhitespaceTokenizer as a test and that is indexing and
searching the way we expect it, but I haven’t looked into what we lost by
using that tokenizer.

Were we relying on a bug to get what we wanted from StandardTokenizer or did
something break in 2.4.0?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

Reply via email to