[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929385#action_12929385 ]
Robert Muir commented on LUCENE-2167: ------------------------------------- {quote} You've convinced me, though I don't think this idea has been around long enough to qualify as intiutive. {quote} Well obviously i dont have hard references to this stuff, but from my interaction with my own users, most of them dont even think of double quotes as doing phrases, nor are they technical enough to even know what a phrase is or what that means for a search... they just think of it as more exact. {quote} I think if we remove EMAIL/HOSTNAME recognition, we need to have an alternative that provides the same thing. So we would have UAX#29 tokenizer as default; a UAX29+EMAIL+HOSTNAME tokenizer as the equivalent to the pre-3.1 StandardTokenizer; and a UAX29+URL+EMAIL tokenizer (current StandardTokenizer). Or maybe the last two could be combined: a UAX29+URL+EMAIL tokenizer that provides a configurable feature to not output URLs, but instead HOSTNAMEs and URL component tokens? {quote} Well, like i said, i'm not particularly picky, especially since someone can always use ClassicTokenizer to get the old behavior, which, no one could ever agree on and there was constantly issues about not recognizing my company's name etc etc. To some extent, i like UAX#29 because there's someone else making and standardizing the decisions and validating its not gonna annoy users of major languages, and making sure it works well by default: like its not gonna be the most full-featured tokenizer but theres little chance it will be really annoying: i think this is great for "defaults". as for all the other "bonus" stuff we can always make options, especially if its some pluggable thing somehow (sorry not sure about how this could work in jflex) where you could have options as to what you want to do. but again, i think UAX#29 itself is more than sufficient by default, and even hostname etc is pretty dangerous *by default* (again my example of searching partial hostnames being flexible to the end-user and not baked-in, by letting them using quotes). > Implement StandardTokenizer with the UAX#29 Standard > ---------------------------------------------------- > > Key: LUCENE-2167 > URL: https://issues.apache.org/jira/browse/LUCENE-2167 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Affects Versions: 3.1, 4.0 > Reporter: Shyamal Prasad > Assignee: Steven Rowe > Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, > LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, > LUCENE-2167-lucene-buildhelper-maven-plugin.patch, > LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, > LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, > LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, > LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, > LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, > LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, > LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > It would be really nice for StandardTokenizer to adhere straight to the > standard as much as we can with jflex. Then its name would actually make > sense. > Such a transition would involve renaming the old StandardTokenizer to > EuropeanTokenizer, as its javadoc claims: > bq. This should be a good tokenizer for most European-language documents > The new StandardTokenizer could then say > bq. This should be a good tokenizer for most languages. > All the english/euro-centric stuff like the acronym/company/apostrophe stuff > can stay with that EuropeanTokenizer, and it could be used by the european > analyzers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org