[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

Steve Rowe (JIRA) Wed, 10 Dec 2014 07:26:30 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241236#comment-14241236
 ]


Steve Rowe commented on LUCENE-6103:
------------------------------------

bq. Maybe out of scope of this ticket, but how do we go about #2? will be happy 
to take this discussion offline as well

Yeah, I'm not sure where the discussion should go, here's fine for me.

Prior to releasing new Unicode versions, PRIs (Public Review Issues) are 
created for proposed changes to individual standards: 
[http://www.unicode.org/review/] - people can then submit comments, which are 
then considered for incorporation into the final standard.  I don't see one 
there for UAX#29, but there have been for previous releases.

I think [~rcmuir] is an individual member of the Unicode consortium - maybe 
he'll have some ideas on how to proceed?

> StandardTokenizer doesn't tokenize word:word
> --------------------------------------------
>
>                 Key: LUCENE-6103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6103
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.9
>            Reporter: Itamar Syn-Hershko
>            Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize 
> word:word and will preserve it as one token. This can be easily seen using 
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic 
> behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

Reply via email to