[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240392#comment-14240392 ]
Itamar Syn-Hershko commented on LUCENE-6103: -------------------------------------------- 0. You mean it implements UAX#29 version 6.3 :) 1. I'll likely be sending a PR for #1 sometime soon. Would you recommend using UAX#29 minus specific non-English tweaks, or fall back to ClassicStandardTokenizer which is English specific, or something else? 2. Here's the thing: the standard is wrong, or buggy. Ask any Swedish and they will tell you, and any non-Swedish corpus wouldn't care. And basically this is a bug in every Lucene based system today because of the word:word scenario; its a bit of an edge case but I bet I can find multiple occurrences in every big enough system. What can we do about that? We already solved this using char filters, converting colons to a comma. It feels a bit hacky though, and again - this _is_ a flaw in Lucene's analysis even though it conforms to a Unicode standard. > StandardTokenizer doesn't tokenize word:word > -------------------------------------------- > > Key: LUCENE-6103 > URL: https://issues.apache.org/jira/browse/LUCENE-6103 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 4.9 > Reporter: Itamar Syn-Hershko > Assignee: Steve Rowe > > StandardTokenizer (and by result most default analyzers) will not tokenize > word:word and will preserve it as one token. This can be easily seen using > Elasticsearch's analyze API: > localhost:9200/_analyze?tokenizer=standard&text=word%20word:word > If this is the intended behavior, then why? I can't really see the logic > behind it. > If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org