[
https://issues.apache.org/jira/browse/LUCENE-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Rowe updated LUCENE-5770:
-------------------------------
Attachment: LUCENE-5770.patch
Preliminary patch modifying the specifications for StandardTokenizer,
UAX29URLEmailTokenizer and HTMLStripCharFilter, the three JFlex specifications
that use ICU4J-generated supplementary code points.
When I manually generate the scanners for these using JFlex 1.6.0-SNAPSHOT,
some tests are failing, I haven't looked at them yet:
{noformat}
[junit4] Tests with failures:
[junit4] -
org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testSupplementaryCharsInTags
[junit4] -
org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testRandomHugeStrings
[junit4] -
org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testRandom
[junit4] -
org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings
[junit4] - org.apache.lucene.analysis.core.TestFactories.test
[junit4] - org.apache.lucene.analysis.core.TestRandomChains.testRandomChain
{noformat}
The Standard and UAX29URLEmail tests are all passing, including
WordBreakTestUnicode_6_3_0, which is built from the Unicode test data for the
UAX#29 word break rules, and includes some tests with supplementary code points.
> Upgrade JFlex to 1.6.0
> ----------------------
>
> Key: LUCENE-5770
> URL: https://issues.apache.org/jira/browse/LUCENE-5770
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Steve Rowe
> Assignee: Steve Rowe
> Priority: Minor
> Fix For: 5.0, 4.10
>
> Attachments: LUCENE-5770.patch
>
>
> JFlex 1.6, to be released shortly, will have direct support for supplementary
> code points - JFlex 1.5 and earlier only support code points in the BMP.
> We can drop the use of ICU4J to generate surrogate pairs to extend our JFlex
> scanner specifications to handle supplementary code points.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]