[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Robert Muir (JIRA) Sun, 07 Nov 2010 11:19:28 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929385#action_12929385
 ]


Robert Muir commented on LUCENE-2167:
-------------------------------------

{quote}
You've convinced me, though I don't think this idea has been around long enough 
to qualify as intiutive.
{quote}

Well obviously i dont have hard references to this stuff, but from my 
interaction with my own users, most of them
dont even think of double quotes as doing phrases, nor are they technical 
enough to even know what a phrase
is or what that means for a search... they just think of it as more exact.

{quote}
I think if we remove EMAIL/HOSTNAME recognition, we need to have an alternative 
that provides the same thing. So we would have UAX#29 tokenizer as default; a 
UAX29+EMAIL+HOSTNAME tokenizer as the equivalent to the pre-3.1 
StandardTokenizer; and a UAX29+URL+EMAIL tokenizer (current StandardTokenizer). 
Or maybe the last two could be combined: a UAX29+URL+EMAIL tokenizer that 
provides a configurable feature to not output URLs, but instead HOSTNAMEs and 
URL component tokens?
{quote}

Well, like i said, i'm not particularly picky, especially since someone can 
always use ClassicTokenizer to get the old behavior,
which, no one could ever agree on and there was constantly issues about not 
recognizing my company's name etc etc.

To some extent, i like UAX#29 because there's someone else making and 
standardizing the decisions and validating
its not gonna annoy users of major languages, and making sure it works well by 
default: like its not gonna be the most 
full-featured tokenizer but theres little chance it will be really annoying: i 
think this is great for "defaults".

as for all the other "bonus" stuff we can always make options, especially if 
its some pluggable thing somehow (sorry not sure about how this could work in 
jflex)
where you could have options as to what you want to do.

but again, i think UAX#29 itself is more than sufficient by default, and even 
hostname etc is pretty dangerous *by default* 
(again my example of searching partial hostnames being flexible to the end-user 
and not baked-in, by letting them using quotes).


> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to