Hi all. I created a test using Lucene 2.3. When run, this generates a single token:
public static void main(String[] args) throws Exception { String string = "\u0412\u0430\u0441\u0438\u0301\u043B\u044C\u0435\u0432"; StandardAnalyzer analyser = new StandardAnalyzer(); TokenStream stream = analyser.tokenStream("text", new StringReader(string)); Token token; while ((token = stream.next()) != null) { System.out.println(new String(token.termBuffer(), 0, token.termLength())); } } I then wrote much a similar test on Lucene 3.0, but specifying the version of StandardAnalyzer behaviour to use: public static void main(String[] args) throws Exception { String string = "\u0412\u0430\u0441\u0438\u0301\u043B\u044C\u0435\u0432"; StandardAnalyzer analyser = new StandardAnalyzer(Version.LUCENE_23); TokenStream stream = analyser.tokenStream("text", new StringReader(string)); TermAttribute termAttribute = stream.getAttribute(TermAttribute.class); while (stream.incrementToken()) { System.out.println(termAttribute.term()); } } But this generates two tokens, splitting at the accent. (I assume that this accent issue itself has already been fixed since v3.1.) I was under the impression that the Version parameter was for supporting this sort of backwards compatibility, so that indexes created in the past could still be searched meaningfully using an updated version of Lucene, but have I found a gap in the backwards compatibility support here? TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org