[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

Steven Rowe (JIRA) Thu, 03 Dec 2009 09:47:53 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785414#action_12785414
 ]


Steven Rowe commented on LUCENE-2074:
-------------------------------------

bq. Do you see a problem with just requiring Flex 1.5 for Lucene trunk at the 
moment?

I think it's fine to do that.

bq. The new parsers (see patch) are pre-generated in SVN, so somebody compiling 
lucene from source does need to use jflex. And the parsers for 
StandardTokenizer are verified to work correct and are even identical (DFA 
wise) for the old Java 1.4 / Unicode 3.0 case.

Most of the StandardTokenizerImpl.jflex grammar is expressed in absolute terms 
- the only JVM-/Unicode-version-sensistive usages are [:letter:] and [:digit:], 
which under JFlex <1.5 were expanded using the scanner-generation-time JVM's 
Character.isLetter() and .isDigit() definitions, but under JFlex 1.5-SNAPSHOT 
depend on the declared Unicode version definitions (i.e., [:letter:] = 
\p{Letter}).

I'm actually surprised that the DFAs are identical, since I'm almost certain 
that the set of characters matching [:letter:] changed between Unicode 3.0 and 
Unicode 4.0 (maybe [:digit:] too).  I'll take a look this weekend.


> Use a separate JFlex generated Unicode 4 by Java 5 compatible 
> StandardTokenizer
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-2074
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2074
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
> LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
> LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
> (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
> regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. 
> Because of that we should only use the new TokenizerImpl when 
> Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

Reply via email to