[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Robert Muir (JIRA) Wed, 10 Nov 2010 09:59:40 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930681#action_12930681
 ]


Robert Muir commented on LUCENE-2747:
-------------------------------------

bq. That is, SimpleAnalyzer is not appropriate for many languages. If it were 
based upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, 
but WORD instead, it would be the same type of token stream, just alpha words.

Ok, now i understand you, and yes I agree...  My question is, should we even 
bother fixing it? Like would anyone who actually cares about unicode really 
want only some hacked subset of UAX#29 ?

These simple ones like SimpleAnalyzer, WhitespaceAnalyzer, StopAnalyzer are all 
really bad for Unicode text in different ways, though Simple/Stop are bigger 
offenders i think, because they will separate a base character from its 
combining characters (in my opinion, this should always be avoided) and worse: 
they will break on these.

But people using them are probably happy? e.g. you can do like Solr,  use 
whitespaceanalyzer and follow thru with something like WordDelimiterFilter and 
its mostly ok, depending upon options, except for cases like CJK where its a 
death trap.

Personally i just don't use these things since I know the problems, but we 
could document "this is simplistic and won't work well for many languages" and 
keep them around for people that don't care?

And yeah i suppose its confusing these really "simple" ones are in the .core 
package, but to me the package is meaningless, i was just trying to keep the 
analyzers arranged in some kind of order (e.g. pattern-based analysis in the 
.pattern package, etc).

We could just as well call the package .basic or .simple or something else, its 
just a name.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Reply via email to