[ 
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239784#comment-14239784
 ] 

Itamar Syn-Hershko commented on LUCENE-6103:
--------------------------------------------

Yes, I figured it will be down to some Unicode rules. Can you give a rationale 
for this, mainly out of curiosity?

Not a Unicode expert, but I'd assume just like you wouldn't want English words 
to not-break on Hebrew Punctuation Gershayim (e.g. Test"Word is actually 2 
tokens and מנכ"לים is one), maybe this rule is meant for specific scenarios and 
not for the general use case?

On another note, any type of Gershayim should be preserved within Hebrew words, 
not only U+05F4. This is mainly because keyboards and editors used produce the 
standard " character in most cases. I had a chat with Robert a while back where 
he said that's the case, I'm just making sure you didn't follow the specs to 
the letter in that regard...

> StandardTokenizer doesn't tokenize word:word
> --------------------------------------------
>
>                 Key: LUCENE-6103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6103
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.9
>            Reporter: Itamar Syn-Hershko
>            Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize 
> word:word and will preserve it as one token. This can be easily seen using 
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic 
> behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to