Re: [pcre-dev] \b bug with extended Unicode characters?

Philip Hazel Sat, 28 Mar 2009 10:26:01 -0700

On Wed, 21 Jan 2009, Ralf Junker wrote:

> it appears the word boundary anchor fails to work when it bounds a
> word using extended Unicode characters (PCRE 7.8, UTF-8 enabled):
> 
>   ÅÅ°ÅÅ± -> Matches
>   \bÅÅ°ÅÅ±\b -> Fails
>   \bNAME\b -> Matches
> 
> Can anybody confirm this?


Did this ever get answered? The answer is that it is a limitation of 
PCRE. I have upgraded the documentation about \b to make it even 
clearer. It now says this:

  In UTF-8 mode, characters with values greater than 128 never match
  \d, \s, or \w, and always match \D, \S, and \W. This is true
  even when Unicode character property support is available. These
  sequences retain their original meanings from before UTF-8 support was
  available, mainly for efficiency reasons. Note that this also affects
  \b, because it is defined in terms of \w and \W.

Philip

-- 
Philip Hazel

-- 
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] \b bug with extended Unicode characters?

Reply via email to