[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Steven Rowe (JIRA) Mon, 10 May 2010 06:26:14 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865757#action_12865757
 ]


Steven Rowe commented on LUCENE-2167:
-------------------------------------

bq. should we look at any tailorings to this? The first thing that comes to 
mind is full-width forms, which have no WordBreak property

Looks like Latin full-width letters are included (from 
http://www.unicode.org/Public/5.2.0/ucd/auxiliary/WordBreakProperty.txt):

FF21..FF3A    ; ALetter # L&  [26] FULLWIDTH LATIN CAPITAL LETTER A..FULLWIDTH 
LATIN CAPITAL LETTER Z
FF41..FF5A    ; ALetter # L&  [26] FULLWIDTH LATIN SMALL LETTER A..FULLWIDTH 
LATIN SMALL LETTER Z

But as you mention in a code comment in TestICUTokenizer, there are no 
full-width WordBreak:Numeric characters, so we could just add these to the 
{NumericEx} macro, I think.

Was there anything else you were thinking of?

bq. is it simple, or would it be messy, to apply this to the existing grammar 
(English/EuroTokenizer)? Another way to say it, is it possible for 
English/EuroTokenizer (StandardTokenizer today) to instead be a tailoring to 
UAX#29, for companies,acronym, etc, such that if it encounters say some hindi 
or thai text it will behave better?

Not sure about difficulty level, but it should be possible.

Naming will require some thought, though - I don't like EnglishTokenizer or 
EuropeanTokenizer - both seem to exclude valid constituencies.

What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and 
Japanese?  (Are there others like these that aren't well served by UAX#29 
without customizations?)

I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as separate 
classes - what do you think?

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to