[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

Uwe Schindler (JIRA) Thu, 03 Dec 2009 09:52:57 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785416#action_12785416
 ]


Uwe Schindler commented on LUCENE-2074:
---------------------------------------

bq. I'm actually surprised that the DFAs are identical, since I'm almost 
certain that the set of characters matching [:letter:] changed between Unicode 
3.0 and Unicode 4.0 (maybe [:digit:] too). I'll take a look this weekend.

Because of that we have the patch: We now have two flex files, one with 
%unicode 3.0, which produces the same DFA as the old flex file when processed 
with Java 1.4 (as it was in Lucene 2.x). This is used for backwards 
compatibiulity (using the matchVersion parameter of ctor).

For later Lucene versions we will have a new jflex file (currently unicode 4.0) 
and that produces the same matrix as java 1.5 in jflex 1.4 (at the moment).

By that we simply made the parser regeneration invariant to the developer's 
JVM. About nothing more is this issue at the moment.

> Use a separate JFlex generated Unicode 4 by Java 5 compatible 
> StandardTokenizer
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-2074
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2074
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
> LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
> LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
> (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
> regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. 
> Because of that we should only use the new TokenizerImpl when 
> Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

Reply via email to