Marvin Humphrey wrote on 11/16/11 11:09 PM: > On Wed, Nov 16, 2011 at 11:24:22PM +0100, Nick Wellnhofer wrote:
[snip] >> >> The default analyzer chain would be tokenize, normalize, stem. > > The gist of your proposal seems sound. It's great to see that you are > thinking about all these things, and to see them all laid out here. > > I don't see much to disagree with in your API choices, aside from the > questions > of what the default analyzer order should be and whether case_fold should be a > boolean. Neither of those quibbles block the proposal. > +1 to that. I've enjoyed following this thread, having wrestled with utf-8 analysis a lot in libswish3[0]. I think robust utf-8 string handling in core is a win, especially if it includes a relatively lightweight way of dealing with the Unicode tables in a portable way. +1 to utf8proc Thanks for initiating this thread, Nick. [0] http://s.apache.org/722 -- Peter Karman . http://peknet.com/ . [email protected]
