[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785414#action_12785414 ]
Steven Rowe commented on LUCENE-2074: ------------------------------------- bq. Do you see a problem with just requiring Flex 1.5 for Lucene trunk at the moment? I think it's fine to do that. bq. The new parsers (see patch) are pre-generated in SVN, so somebody compiling lucene from source does need to use jflex. And the parsers for StandardTokenizer are verified to work correct and are even identical (DFA wise) for the old Java 1.4 / Unicode 3.0 case. Most of the StandardTokenizerImpl.jflex grammar is expressed in absolute terms - the only JVM-/Unicode-version-sensistive usages are [:letter:] and [:digit:], which under JFlex <1.5 were expanded using the scanner-generation-time JVM's Character.isLetter() and .isDigit() definitions, but under JFlex 1.5-SNAPSHOT depend on the declared Unicode version definitions (i.e., [:letter:] = \p{Letter}). I'm actually surprised that the DFAs are identical, since I'm almost certain that the set of characters matching [:letter:] changed between Unicode 3.0 and Unicode 4.0 (maybe [:digit:] too). I'll take a look this weekend. > Use a separate JFlex generated Unicode 4 by Java 5 compatible > StandardTokenizer > ------------------------------------------------------------------------------- > > Key: LUCENE-2074 > URL: https://issues.apache.org/jira/browse/LUCENE-2074 > Project: Lucene - Java > Issue Type: Bug > Affects Versions: 3.0 > Reporter: Uwe Schindler > Assignee: Uwe Schindler > Fix For: 3.1 > > Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, > LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, > LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch > > > The current trunk version of StandardTokenizerImpl was generated by Java 1.4 > (according to the warning). In Java 3.0 we switch to Java 1.5, so we should > regenerate the file. > After regeneration the Tokenizer behaves different for some characters. > Because of that we should only use the new TokenizerImpl when > Version.LUCENE_30 or LUCENE_31 is used as matchVersion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org