[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239928#comment-14239928 ]
Steve Rowe commented on LUCENE-6103: ------------------------------------ bq. Yes, I figured it will be down to some Unicode rules. Can you give a rationale for this, mainly out of curiosity? The comment in the {{MidLetter}} list says it's for Swedish. If you look at the [revision history at the bottom of the page|http://www.unicode.org/reports/tr29/#Modifications], the colon was temporarily removed from {{MidLetter}} in between Unicode versions 6.2 and 6.3, but then put back before 6.3 was released (I guess this should be read from the bottom upward): {quote} * Restored colon and equivalents (removed in previous draft). * Removed colon from MidLetter, so that it is no longer contained within words. Handling of colon for word boundary determination in Swedish would be done by tailoring, instead – for example by a Swedish localization definition in CLDR. {quote} I guess the Swedish contingent among Unicoders is strong? {quote} Not a Unicode expert, but I'd assume just like you wouldn't want English words to not-break on Hebrew Punctuation Gershayim (e.g. Test"Word is actually 2 tokens and מנכ"לים is one), maybe this rule is meant for specific scenarios and not for the general use case? {quote} StandardTokenizer is not intended to be English-centric - instead it should do something reasonable with any text. {quote} On another note, any type of Gershayim should be preserved within Hebrew words, not only U+05F4. This is mainly because keyboards and editors used produce the standard " character in most cases. I had a chat with Robert a while back where he said that's the case, I'm just making sure you didn't follow the specs to the letter in that regard... {quote} I did follow the specs to the letter, and it does the right thing: Rules [WB7b and WB7c|http://www.unicode.org/reports/tr29/#WB7b] forbids breaks around the ASCII double quote character, but only when surrounded by Hewbrew letters. > StandardTokenizer doesn't tokenize word:word > -------------------------------------------- > > Key: LUCENE-6103 > URL: https://issues.apache.org/jira/browse/LUCENE-6103 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 4.9 > Reporter: Itamar Syn-Hershko > Assignee: Steve Rowe > > StandardTokenizer (and by result most default analyzers) will not tokenize > word:word and will preserve it as one token. This can be easily seen using > Elasticsearch's analyze API: > localhost:9200/_analyze?tokenizer=standard&text=word%20word:word > If this is the intended behavior, then why? I can't really see the logic > behind it. > If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org