Edit report at http://bugs.php.net/bug.php?id=52971&edit=1
ID: 52971 User updated by: marc dot bennewitz at giata dot de Reported by: marc dot bennewitz at giata dot de Summary: PCRE-Meta-Characters not working with utf-8 Status: Closed Type: Bug Package: PCRE related Operating System: Linux PHP Version: 5.3.3 Assigned To: felipe Block user comment: N New Comment: now it works fine :) thanks Previous Comments: ------------------------------------------------------------------------ [2010-10-03 18:02:19] fel...@php.net This bug has been fixed in SVN. Snapshots of the sources are packaged every three hours; this change will be in the next snapshot. You can grab the snapshot at http://snaps.php.net/. Thank you for the report, and for helping us make PHP better. In the last version of PCRE was added a flag PCRE_UCP, as states the doc: "In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII characters, even in UTF-8 mode. However, this can be changed by setting the PCRE_UCP option." Setting the flag we got: array(1) { [0]=> array(1) { [0]=> array(2) { [0]=> string(6) "Wasser" [1]=> int(61) } } } array(1) { [0]=> array(1) { [0]=> array(2) { [0]=> string(7) " Wasser" [1]=> int(60) } } } ------------------------------------------------------------------------ [2010-10-03 18:01:40] fel...@php.net Automatic comment from SVN on behalf of felipe Revision: http://svn.php.net/viewvc/?view=revision&revision=303963 Log: - Fixed bug #52971 (PCRE-Meta-Characters not working with utf-8) # In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII # characters, even in UTF-8 mode. However, this can be changed by setting # the PCRE_UCP option. ------------------------------------------------------------------------ [2010-10-03 11:02:15] cataphr...@php.net I'm reopening as there's indeed a different behavior in Windows that I can't yet quite explain, ------------------------------------------------------------------------ [2010-10-03 10:21:34] marc dot bennewitz at giata dot de There are some problems with it: 1. On windows it works as expected 2. With Unicode properties there is no word boundary (\w \W) 3. With the modifier "u" php knows that the subject is UTF-8 4. http://php.net/manual/regexp.reference.escape.php there is no note for UTF-8 incompatibility php.exe -i ... iconv iconv support => enabled iconv implementation => "libiconv" iconv library version => 1.11 Directive => Local Value => Master Value iconv.input_encoding => ISO-8859-1 => ISO-8859-1 iconv.internal_encoding => ISO-8859-1 => ISO-8859-1 iconv.output_encoding => ISO-8859-1 => ISO-8859-1 ... pcre PCRE (Perl Compatible Regular Expressions) Support => enabled PCRE Library Version => 8.02 2010-03-19 Directive => Local Value => Master Value pcre.backtrack_limit => 100000 => 100000 pcre.recursion_limit => 100000 => 100000 ... ------------------------------------------------------------------------ [2010-10-02 20:26:05] cataphr...@php.net This is by design, it's the way \b and \w are defined in PCRE. You'll have to use another strategy, like look behind and unicode character properties. ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/bug.php?id=52971 -- Edit this bug report at http://bugs.php.net/bug.php?id=52971&edit=1