[ 
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240133#comment-14240133
 ] 

Itamar Syn-Hershko commented on LUCENE-6103:
--------------------------------------------

Ok so I did some homework. In swedish, "connect" is a way to shortcut writings 
of words. So "C:a" is infact "cirka" which means "approximately". I guess it 
can be thought of as English acronyms, only apparently its way less commonly 
used in Swedish (my source says "very very seldomly used; old style and not 
used in modern Swedish at all").

Not only it is hardly being used, apparently it's only legal in 3 letter 
combinations (c:a but not c:ka).

And also, the affects it has are quite severe at the moment - 2 words with a 
colon in between that didn't have space will be outputted as one token even 
though its 100% its not applicable to Swedish, since each words has > 2 
characters.

I'm not aiming at changing the Unicode standards, that's way beyond my limited 
powers, but:

1. Given the above, does it really make sense to use this tokenizer in all 
language-specific analyzers as well? e.g. 
https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L105

I'd think for language specific analyzers we'd want tokenizers aiming for this 
language with limited support for others. So, in this case, colon will always 
be considered a tokenizing char.

2. Can we change the jflex definition to at least limit the effects of this, 
e.g. only support colon as MidLetter if the overall token length == 3, so c:a 
is a valid token and word:word is not?

> StandardTokenizer doesn't tokenize word:word
> --------------------------------------------
>
>                 Key: LUCENE-6103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6103
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.9
>            Reporter: Itamar Syn-Hershko
>            Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize 
> word:word and will preserve it as one token. This can be easily seen using 
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic 
> behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to