[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Robert Muir (JIRA) Wed, 10 Nov 2010 04:43:40 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930580#action_12930580
 ]


Robert Muir commented on LUCENE-2747:
-------------------------------------

bq. I meant o.a.l.analysis.core. I'd expect the premier analyzers to be in core.

I guess the package doesn't make a big difference to me, all the analyzers are 
in one place and the same.
its true we mixed the "core" analysis stuff with contrib and solr, but if there 
are warts with the contrib stuff,
we should be able to clean it up (I think this has been happening)

bq. I guess I meant: Shouldn't the SimpleAnalyzer just be constructed the same 
as StandardAnalyzer with the addition of a Filter that pitch tokens that are 
not needed?

I don't think so, this seems like a trap. Lots of people use StandardTokenizer 
without any filter... I don't think 
it should emit trash (punctuation). if you want to do the other stuff, you can 
use ClassicTokenizer, or even 
WhitespaceTokenizer and filter to your heart's content.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Reply via email to