[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

DM Smith (JIRA) Wed, 10 Nov 2010 05:16:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930578#action_12930578
 ]


DM Smith commented on LUCENE-2747:
----------------------------------

{quote}
bq. Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

In trunk (4.x codeline) there is no core, contrib, or solr for analyzer 
components any more. they are all combined into modules/analysis.
In branch_3x (3.x codeline) we did not make this rather disruptive refactor: 
there UAX29Tokenizer is in fact in lucene core.
{quote}

I meant o.a.l.analysis.core. I'd expect the *premier* analyzers to be in core.

{quote}
bq. Is there a point to having SimpleAnalyzer

I guess so, a lot of people can use this if they have english-only content and 
are probably happy with discard numbers etc... its not a big loss to me if it 
goes though.
{quote}

I guess I meant: Shouldn't the SimpleAnalyzer just be constructed the same as 
StandardAnalyzer with the addition of a Filter that pitch tokens that are not 
needed?
With the suggestion in LUCENE-2167 to use UAX29Tokenizer for StandardAnalyzer, 
effectively deprecating EMAIL and URL and possibly adding some kind of 
PUNCTUATION (so that URLs/emails/acronyms... can be reconstructed, if someone 
desires), the StandardAnalyzer is about as simple as one could get and properly 
handle non-english/non-western languages. It just creates ALPHANUM,  NUM and 
PUNCTUATION (if added) that SimpleAnalyzer does not care about.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Reply via email to