[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

DM Smith (JIRA) Mon, 17 May 2010 07:12:09 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868230#action_12868230
 ]


DM Smith commented on LUCENE-2167:
----------------------------------

{quote}
bq.Naming will require some thought, though - I don't like EnglishTokenizer or 
EuropeanTokenizer - both seem to exclude valid constituencies.
What valid constituencies do you refer to?
{quote}
{quote}
Well, we can't call it English/EuropeanTokenizer (maybe 
EnglishAndEuropeanAnalyzer? seems too long), and calling it either only English 
or only European seems to leave the other out. Americans, e.g., don't consider 
themselves European, maybe not even linguistically (however incorrect that 
might be).
{quote}

Tongue in cheek:
By and large, these are Romance languages (i.e. latin derivatives). And the 
constructs that are being considered for special processing for the most part 
are fairly recent additions to the languages. So how about 
*ModernRomanceAnalyzer*?

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.benchmark.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to