Edit report at http://bugs.php.net/bug.php?id=52971&edit=1

 ID:                 52971
 User updated by:    marc dot bennewitz at giata dot de
 Reported by:        marc dot bennewitz at giata dot de
 Summary:            PCRE-Meta-Characters not working with utf-8
 Status:             Closed
 Type:               Bug
 Package:            PCRE related
 Operating System:   Linux
 PHP Version:        5.3.3
 Assigned To:        felipe
 Block user comment: N

 New Comment:

now it works fine :)

thanks


Previous Comments:
------------------------------------------------------------------------
[2010-10-03 18:02:19] fel...@php.net

This bug has been fixed in SVN.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

In the last version of PCRE was added a flag PCRE_UCP, as states the
doc:

"In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
characters, even in UTF-8 mode. However, this can be changed by setting
the PCRE_UCP option."



Setting the flag we got:

array(1) {

  [0]=>

  array(1) {

    [0]=>

    array(2) {

      [0]=>

      string(6) "Wasser"

      [1]=>

      int(61)

    }

  }

}

array(1) {

  [0]=>

  array(1) {

    [0]=>

    array(2) {

      [0]=>

      string(7) " Wasser"

      [1]=>

      int(60)

    }

  }

}

------------------------------------------------------------------------
[2010-10-03 18:01:40] fel...@php.net

Automatic comment from SVN on behalf of felipe
Revision: http://svn.php.net/viewvc/?view=revision&revision=303963
Log: - Fixed bug #52971 (PCRE-Meta-Characters not working with utf-8)
#   In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only
ASCII
#       characters, even in UTF-8 mode. However, this can be changed by
setting
#       the PCRE_UCP option.

------------------------------------------------------------------------
[2010-10-03 11:02:15] cataphr...@php.net

I'm reopening as there's indeed a different behavior in Windows that I
can't yet quite explain,

------------------------------------------------------------------------
[2010-10-03 10:21:34] marc dot bennewitz at giata dot de

There are some problems with it:

1. On windows it works as expected

2. With Unicode properties there is no word boundary (\w \W)

3. With the modifier "u" php knows that the subject is UTF-8

4. http://php.net/manual/regexp.reference.escape.php there is no note
for UTF-8 incompatibility



php.exe -i

...

iconv



iconv support => enabled

iconv implementation => "libiconv"

iconv library version => 1.11



Directive => Local Value => Master Value

iconv.input_encoding => ISO-8859-1 => ISO-8859-1

iconv.internal_encoding => ISO-8859-1 => ISO-8859-1

iconv.output_encoding => ISO-8859-1 => ISO-8859-1

...

pcre



PCRE (Perl Compatible Regular Expressions) Support => enabled

PCRE Library Version => 8.02 2010-03-19



Directive => Local Value => Master Value

pcre.backtrack_limit => 100000 => 100000

pcre.recursion_limit => 100000 => 100000

...

------------------------------------------------------------------------
[2010-10-02 20:26:05] cataphr...@php.net

This is by design, it's the way \b and \w are defined in PCRE.



You'll have to use another strategy, like look behind and unicode
character properties.

------------------------------------------------------------------------


The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

    http://bugs.php.net/bug.php?id=52971


-- 
Edit this bug report at http://bugs.php.net/bug.php?id=52971&edit=1

Reply via email to