[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241213#comment-14241213 ]
Steve Rowe commented on LUCENE-6103: ------------------------------------ 0. In Lucene 4.7 through 4.10, yes, it implements the revision of UAX#29 associated with Unicode 6.3. I thought there was a JIRA to upgrade Lucene to Unicode 7.0, but I can't find it ATM. JFlex 1.6 and ICU 54.1 support Unicode 7.0. 1. I recommend a language-specific tailoring of UAX#29. There are tailoring notes in the standard you'll want to look at. 2. Unfortunately, I think the correct approach here is lobbying to change the standard. > StandardTokenizer doesn't tokenize word:word > -------------------------------------------- > > Key: LUCENE-6103 > URL: https://issues.apache.org/jira/browse/LUCENE-6103 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 4.9 > Reporter: Itamar Syn-Hershko > Assignee: Steve Rowe > > StandardTokenizer (and by result most default analyzers) will not tokenize > word:word and will preserve it as one token. This can be easily seen using > Elasticsearch's analyze API: > localhost:9200/_analyze?tokenizer=standard&text=word%20word:word > If this is the intended behavior, then why? I can't really see the logic > behind it. > If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org