Hi all.

I created a test using Lucene 2.3.  When run, this generates a single token:

    public static void main(String[] args) throws Exception {
        String string =
"\u0412\u0430\u0441\u0438\u0301\u043B\u044C\u0435\u0432";
        StandardAnalyzer analyser = new StandardAnalyzer();
        TokenStream stream = analyser.tokenStream("text", new
StringReader(string));
        Token token;
        while ((token = stream.next()) != null)
        {
            System.out.println(new String(token.termBuffer(), 0,
token.termLength()));
        }
    }

I then wrote much a similar test on Lucene 3.0, but specifying the
version of StandardAnalyzer behaviour to use:

    public static void main(String[] args) throws Exception {
        String string =
"\u0412\u0430\u0441\u0438\u0301\u043B\u044C\u0435\u0432";
        StandardAnalyzer analyser = new StandardAnalyzer(Version.LUCENE_23);
        TokenStream stream = analyser.tokenStream("text", new
StringReader(string));
        TermAttribute termAttribute = stream.getAttribute(TermAttribute.class);
        while (stream.incrementToken())
        {
            System.out.println(termAttribute.term());
        }
    }

But this generates two tokens, splitting at the accent.  (I assume
that this accent issue itself has already been fixed since v3.1.)

I was under the impression that the Version parameter was for
supporting this sort of backwards compatibility, so that indexes
created in the past could still be searched meaningfully using an
updated version of Lucene, but have I found a gap in the backwards
compatibility support here?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to