[ 
https://issues.apache.org/jira/browse/LUCENE-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-5770:
-------------------------------

    Attachment: LUCENE-5770.patch

Preliminary patch modifying the specifications for StandardTokenizer, 
UAX29URLEmailTokenizer and HTMLStripCharFilter, the three JFlex specifications 
that use ICU4J-generated supplementary code points.

When I manually generate the scanners for these using JFlex 1.6.0-SNAPSHOT, 
some tests are failing, I haven't looked at them yet:

{noformat}
   [junit4] Tests with failures:
   [junit4]   - 
org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testSupplementaryCharsInTags
   [junit4]   - 
org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testRandomHugeStrings
   [junit4]   - 
org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testRandom
   [junit4]   - 
org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings
   [junit4]   - org.apache.lucene.analysis.core.TestFactories.test
   [junit4]   - org.apache.lucene.analysis.core.TestRandomChains.testRandomChain
{noformat}

The Standard and UAX29URLEmail tests are all passing, including 
WordBreakTestUnicode_6_3_0, which is built from the Unicode test data for the 
UAX#29 word break rules, and includes some tests with supplementary code points.

> Upgrade JFlex to 1.6.0
> ----------------------
>
>                 Key: LUCENE-5770
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5770
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>            Priority: Minor
>             Fix For: 5.0, 4.10
>
>         Attachments: LUCENE-5770.patch
>
>
> JFlex 1.6, to be released shortly, will have direct support for supplementary 
> code points - JFlex 1.5 and earlier only support code points in the BMP.
> We can drop the use of ICU4J to generate surrogate pairs to extend our JFlex 
> scanner specifications to handle supplementary code points.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to