Re: [lucy-dev] Unicode integration

Peter Karman Thu, 17 Nov 2011 18:06:55 -0800

Marvin Humphrey wrote on 11/16/11 11:09 PM:
> On Wed, Nov 16, 2011 at 11:24:22PM +0100, Nick Wellnhofer wrote:


[snip]

>>
>> The default analyzer chain would be tokenize, normalize, stem.
> 
> The gist of your proposal seems sound.  It's great to see that you are
> thinking about all these things, and to see them all laid out here.
> 
> I don't see much to disagree with in your API choices, aside from the 
> questions
> of what the default analyzer order should be and whether case_fold should be a
> boolean.  Neither of those quibbles block the proposal.
> 

+1 to that.

I've enjoyed following this thread, having wrestled with utf-8 analysis a lot in
libswish3[0]. I think robust utf-8 string handling in core is a win, especially
if it includes a relatively lightweight way of dealing with the Unicode tables
in a portable way.

+1 to utf8proc

Thanks for initiating this thread, Nick.

[0] http://s.apache.org/722

-- 
Peter Karman  .  http://peknet.com/  .  [email protected]

Re: [lucy-dev] Unicode integration

Reply via email to