[
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241236#comment-14241236
]
Steve Rowe commented on LUCENE-6103:
------------------------------------
bq. Maybe out of scope of this ticket, but how do we go about #2? will be happy
to take this discussion offline as well
Yeah, I'm not sure where the discussion should go, here's fine for me.
Prior to releasing new Unicode versions, PRIs (Public Review Issues) are
created for proposed changes to individual standards:
[http://www.unicode.org/review/] - people can then submit comments, which are
then considered for incorporation into the final standard. I don't see one
there for UAX#29, but there have been for previous releases.
I think [~rcmuir] is an individual member of the Unicode consortium - maybe
he'll have some ideas on how to proceed?
> StandardTokenizer doesn't tokenize word:word
> --------------------------------------------
>
> Key: LUCENE-6103
> URL: https://issues.apache.org/jira/browse/LUCENE-6103
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 4.9
> Reporter: Itamar Syn-Hershko
> Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize
> word:word and will preserve it as one token. This can be easily seen using
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic
> behind it.
> If not, I'll be happy to join in the effort of fixing this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]