[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240133#comment-14240133 ]
Itamar Syn-Hershko commented on LUCENE-6103: -------------------------------------------- Ok so I did some homework. In swedish, "connect" is a way to shortcut writings of words. So "C:a" is infact "cirka" which means "approximately". I guess it can be thought of as English acronyms, only apparently its way less commonly used in Swedish (my source says "very very seldomly used; old style and not used in modern Swedish at all"). Not only it is hardly being used, apparently it's only legal in 3 letter combinations (c:a but not c:ka). And also, the affects it has are quite severe at the moment - 2 words with a colon in between that didn't have space will be outputted as one token even though its 100% its not applicable to Swedish, since each words has > 2 characters. I'm not aiming at changing the Unicode standards, that's way beyond my limited powers, but: 1. Given the above, does it really make sense to use this tokenizer in all language-specific analyzers as well? e.g. https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L105 I'd think for language specific analyzers we'd want tokenizers aiming for this language with limited support for others. So, in this case, colon will always be considered a tokenizing char. 2. Can we change the jflex definition to at least limit the effects of this, e.g. only support colon as MidLetter if the overall token length == 3, so c:a is a valid token and word:word is not? > StandardTokenizer doesn't tokenize word:word > -------------------------------------------- > > Key: LUCENE-6103 > URL: https://issues.apache.org/jira/browse/LUCENE-6103 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 4.9 > Reporter: Itamar Syn-Hershko > Assignee: Steve Rowe > > StandardTokenizer (and by result most default analyzers) will not tokenize > word:word and will preserve it as one token. This can be easily seen using > Elasticsearch's analyze API: > localhost:9200/_analyze?tokenizer=standard&text=word%20word:word > If this is the intended behavior, then why? I can't really see the logic > behind it. > If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org