[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930681#action_12930681 ]
Robert Muir commented on LUCENE-2747: ------------------------------------- bq. That is, SimpleAnalyzer is not appropriate for many languages. If it were based upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, but WORD instead, it would be the same type of token stream, just alpha words. Ok, now i understand you, and yes I agree... My question is, should we even bother fixing it? Like would anyone who actually cares about unicode really want only some hacked subset of UAX#29 ? These simple ones like SimpleAnalyzer, WhitespaceAnalyzer, StopAnalyzer are all really bad for Unicode text in different ways, though Simple/Stop are bigger offenders i think, because they will separate a base character from its combining characters (in my opinion, this should always be avoided) and worse: they will break on these. But people using them are probably happy? e.g. you can do like Solr, use whitespaceanalyzer and follow thru with something like WordDelimiterFilter and its mostly ok, depending upon options, except for cases like CJK where its a death trap. Personally i just don't use these things since I know the problems, but we could document "this is simplistic and won't work well for many languages" and keep them around for people that don't care? And yeah i suppose its confusing these really "simple" ones are in the .core package, but to me the package is meaningless, i was just trying to keep the analyzers arranged in some kind of order (e.g. pattern-based analysis in the .pattern package, etc). We could just as well call the package .basic or .simple or something else, its just a name. > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --------------------------------------------------------------------------- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Affects Versions: 3.1, 4.0 > Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch, LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to > provide language-neutral tokenization. Lucene contains several > language-specific tokenizers that should be replaced by UAX#29-based > StandardTokenizer (deprecated in 3.1 and removed in 4.0). The > language-specific *analyzers*, by contrast, should remain, because they > contain language-specific post-tokenization filters. The language-specific > analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond > just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and > depends on the fact that this tokenizer breaks tokens on the ZWNJ character > (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ > is not a word boundary. Robert Muir has suggested using a char filter > converting ZWNJ to spaces prior to StandardTokenizer in the converted > PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org