[
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240133#comment-14240133
]
Itamar Syn-Hershko commented on LUCENE-6103:
--------------------------------------------
Ok so I did some homework. In swedish, "connect" is a way to shortcut writings
of words. So "C:a" is infact "cirka" which means "approximately". I guess it
can be thought of as English acronyms, only apparently its way less commonly
used in Swedish (my source says "very very seldomly used; old style and not
used in modern Swedish at all").
Not only it is hardly being used, apparently it's only legal in 3 letter
combinations (c:a but not c:ka).
And also, the affects it has are quite severe at the moment - 2 words with a
colon in between that didn't have space will be outputted as one token even
though its 100% its not applicable to Swedish, since each words has > 2
characters.
I'm not aiming at changing the Unicode standards, that's way beyond my limited
powers, but:
1. Given the above, does it really make sense to use this tokenizer in all
language-specific analyzers as well? e.g.
https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L105
I'd think for language specific analyzers we'd want tokenizers aiming for this
language with limited support for others. So, in this case, colon will always
be considered a tokenizing char.
2. Can we change the jflex definition to at least limit the effects of this,
e.g. only support colon as MidLetter if the overall token length == 3, so c:a
is a valid token and word:word is not?
> StandardTokenizer doesn't tokenize word:word
> --------------------------------------------
>
> Key: LUCENE-6103
> URL: https://issues.apache.org/jira/browse/LUCENE-6103
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 4.9
> Reporter: Itamar Syn-Hershko
> Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize
> word:word and will preserve it as one token. This can be easily seen using
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic
> behind it.
> If not, I'll be happy to join in the effort of fixing this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]