Re: [lucy-dev] Implementing a tokenizer in core

Dan Markham Tue, 22 Nov 2011 17:10:22 -0800

quick grep in my code base I find these.

'[^¡]+'            --- crazy unicode char to be unique
'[^\x{1}]+'     --- another crazy  unique char 
'\S+'              --- we use this a lot to not get hit by strings with hyphens 
in them.
'\w+(?:[\'\x{2019}]\w+)*'  -- the default




-Dan




On Nov 22, 2011, at 2:10 PM, Nick Wellnhofer wrote:

> Currently, Lucy only provides the RegexTokenizer which is implemented on top 
> of the perl regex engine. With the help of utf8proc we could implement a 
> simple but more efficient tokenizer without external dependencies in core. 
> Most important, we'd have to implement something similar to the \w regex 
> character class. The Unicode standard [1,2] recommends that \w is equivalent 
> to [\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categories 
> Letter, Mark, Decimal_Number, Letter_Number, and Connector_Punctuation plus 
> circled letters. That's exactly how perl implements \w. Other implementations 
> like .NET seem to differ slightly [3]. So we could lookup Unicode categories 
> with utf8proc and then a perl-compatible check for a word character would be 
> as easy as (cat <= 10 || cat == 12 || c >= 0x24b6 && c <= 0x24e9).
> 
> The default regex in RegexTokenizer also handles apostrophes which I don't 
> find very useful personally. But this could also be implemented in the core 
> tokenizer.
> 
> I'm wondering what other kind of regexes people are using with 
> RegexTokenizer, and whether this simple core tokenizer should be customizable 
> for some of these use cases.
> 
> Nick
> 
> [1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
> [2] http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
> [3] http://msdn.microsoft.com/en-us/library/20bw873z.aspx#WordCharacter

Re: [lucy-dev] Implementing a tokenizer in core

Reply via email to