[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Robert Muir (JIRA) Mon, 10 May 2010 06:56:13 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865763#action_12865763
 ]


Robert Muir commented on LUCENE-2167:
-------------------------------------

{quote}
But as you mention in a code comment in TestICUTokenizer, there are no 
full-width WordBreak:Numeric characters, so we could just add these to the 
{NumericEx} macro, I think.

Was there anything else you were thinking of?
{quote}

No, that's it!

bq. Naming will require some thought, though - I don't like EnglishTokenizer or 
EuropeanTokenizer - both seem to exclude valid constituencies.

What valid constituencies do you refer to? In general the 
acronym,company,possessive stuff here are very english/euro-specific.
Bugs in JIRA get opened if it doesn't do this stuff right on english, but it 
doesn't even work at all for a lot of languages.
Personally I think its great to rip this stuff out of what should be a 
"default" language-independent tokenizer based on 
standards (StandardTokenizer), and put it into the language-specific package 
that it belongs. Otherwise we have to 
worry about these sort of things overriding and screwing up UAX#29 rules for 
words in real languages.

bq. What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, 
and Japanese? (Are there others like these that aren't well served by UAX#29 
without customizations?)

It gets a little tricky: we should be careful about how we interpret what is 
"reasonable" for a language-independent default tokenizer. 
I think its "enough" to output the best indexing unit that is possible and 
relatively unambiguous to identify. I think this is a shortcut
we can make, because we are trying to tokenize things for information 
retrieval, not for other purposes. The approach for Lao, 
Myanmar, Khmer, CJK, etc in ICUTokenizer is to just output syllables as 
indexing unit, since words are ambiguous. Thai is based 
on words, not syllables, in ICUTokenizer, which is inconsistent from this, but 
we get this for free, so its just a laziness thing.

By the way: none of those syllable-grammars in ICUTokenizer used chained rules, 
so you are welcome to steal what you want!

bq. I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as 
separate classes - what do you think?

Well, either way I again strongly feel this logic should be tied into 
"Standard" tokenizer, so that it has better unicode behavior. I think
it makes sense for us to have a reasonable, language-independent, 
standards-based tokenizer that works well for most languages.
 I think it also makes sense to have English/Euro-centric stuff thats 
language-specific, sitting in the analysis.en package just like we
 do with other languages.


> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to