[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Robert Muir (JIRA) Tue, 09 Nov 2010 07:19:46 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930137#action_12930137
 ]


Robert Muir commented on LUCENE-2747:
-------------------------------------

DM, thanks, I see exactly where you are coming from.

I see your point: previously it was much easier to take something like 
SimpleAnalyzer and 'adapt' it to a given language based on things like unicode 
properties.
In fact thats exactly what we did in the cases here (Arabic, Persian, Hindi, 
etc)

But now we can actually tokenize "correctly" for more languages with jflex, 
thanks to its improved unicode support, and its superior to these previous 
hacks :)

to try to answer some of your questions (all my opinion):

bq. Is there a point to having SimpleAnalyzer

I guess so, a lot of people can use this if they have english-only content and 
are probably happy with discard numbers etc... its not a big loss to me if it 
goes though.

bq. Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

In trunk (4.x codeline) there is no core, contrib, or solr for analyzer 
components any more. they are all combined into modules/analysis.
In branch_3x (3.x codeline) we did not make this rather disruptive refactor: 
there UAX29Tokenizer is in fact in lucene core.

bq. Would there be a way to plugin ICUTokenizer as a replacement for 
UAX29Tokenizer into StandardTokenizer, such that all Analyzers using 
StandardTokenizer would get the alternate implementation?

Personally, i would prefer if we move towards a factory model where things like 
these supplied "language analyzers" are actually xml/json/properties snippets.
In other words, they are just example configurations that builds your analyzer, 
like solr does.
This is nice, because then you dont have to write code to easily customize how 
your analyzer works.

I think we have been making slow steps towards this, just doing basic things 
like moving stopwords lists to .txt files.
But i think the next step would be LUCENE-2510, where we have factories/config 
attribute parsers for all these analysis components already written.

Then we could have support for declarative analyzer specification via 
xml/json/.properties/whatever, and move all these Analyzers to that.
I still think you should be able to code up your own analyzer, but in my 
opinion this is much easier and preferred for the ones we supply.

Also i think this would solve a lot of analyzer-backwards-compat problems, 
because then our supplied analyzers are really just configuration file examples,
and we can change our examples however we want... someone can use their old 
config file (and hopefully old analysis module jar file!) to guarantee
the exact same behavior if they want.

Finally, most of the benefits of ICUTokenizer are actually in the UAX29 
support... the tokenizers are pretty close with some minor differences:
* the jflex-based implementation is faster, and better in my opinion.
* the ICU-based implementation allows tailoring, and supplies tailored 
tokenization for several complex scripts (jflex doesnt have this... yet)
* the ICU-based implementation works with all of unicode, at the moment jflex 
is limited to the basic multilingual plane.

In my opinion the last 2 points will probably be eventually resolved... i could 
see our ICUTokenizer possibly becoming obselete down the road 
by some better jflex support, though it would have to probably have hooks into 
ICU for the complex script support (so we get it for free from ICU)


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Reply via email to