quick grep in my code base I find these.
'[^ยก]+' --- crazy unicode char to be unique
'[^\x{1}]+' --- another crazy unique char
'\S+' --- we use this a lot to not get hit by strings with hyphens
in them.
'\w+(?:[\'\x{2019}]\w+)*' -- the default
-Dan
On Nov 22, 2011, at 2:10 PM, Nick Wellnhofer wrote:
> Currently, Lucy only provides the RegexTokenizer which is implemented on top
> of the perl regex engine. With the help of utf8proc we could implement a
> simple but more efficient tokenizer without external dependencies in core.
> Most important, we'd have to implement something similar to the \w regex
> character class. The Unicode standard [1,2] recommends that \w is equivalent
> to [\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categories
> Letter, Mark, Decimal_Number, Letter_Number, and Connector_Punctuation plus
> circled letters. That's exactly how perl implements \w. Other implementations
> like .NET seem to differ slightly [3]. So we could lookup Unicode categories
> with utf8proc and then a perl-compatible check for a word character would be
> as easy as (cat <= 10 || cat == 12 || c >= 0x24b6 && c <= 0x24e9).
>
> The default regex in RegexTokenizer also handles apostrophes which I don't
> find very useful personally. But this could also be implemented in the core
> tokenizer.
>
> I'm wondering what other kind of regexes people are using with
> RegexTokenizer, and whether this simple core tokenizer should be customizable
> for some of these use cases.
>
> Nick
>
> [1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
> [2] http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
> [3] http://msdn.microsoft.com/en-us/library/20bw873z.aspx#WordCharacter