Edit report at https://bugs.php.net/bug.php?id=62562&edit=1

 ID:                 62562
 Comment by:         a...@php.net
 Reported by:        magog dot the dot ogre at gmail dot com
 Summary:            preg_replace mangles UTF8 string - Windows only
 Status:             Analyzed
 Type:               Bug
 Package:            *Regular Expressions
 Operating System:   Windows x86
 PHP Version:        5.3.14
 Block user comment: N
 Private report:     N

 New Comment:

Btw. the PCRE version reported by PHP is 8.12, but the current is 8.30. May be 
a simple upgrade could solve this.


Previous Comments:
------------------------------------------------------------------------
[2012-07-16 15:19:54] a...@php.net

I've tested your PHP snippet on win7, but it's probably the same on any win. 
The behaviour is as you describe. But there is another point. The string to be 
matched is hardcoded into the script as UTF-8, if you open that file in the 
ASCII mode, you'll see each byte, see here (saved to a file as teh BT ruinates 
all the view) http://belsky.info/phpz/bugz/62562/62562_3.txt

Switch the encoding to UTF-8 in your browser and then to a non-multibyte one. 
Another way to do that - open the file under linux with 

vim -c 'set encoding=latin1' 62562_3.txt

In both cases one can see, that one byte is interpreted as a space. Combined 
with no UTF-8 modifier the behaviour is expected, further more windows seems do 
do it right :)

I've also debugged this under VS and it's definitely something coming back from 
the PCRE itself. Here http://lxr.php.net/xref/PHP_5_4/ext/pcre/php_pcre.c#621

is count > 0, so matched is incremented and returned some when. Nevertheless it 
could be a locale thing forcing PCRE to do UTF-8, but I actually don't see any 
locale dependent places in PCRE. Trying to boot linux with C locale might repro 
this there as well, I have no such mashines though.

------------------------------------------------------------------------
[2012-07-16 01:39:06] magog dot the dot ogre at gmail dot com

Yeah, it works SunOS and Ubuntu for me too.

Well if/when you get access to a Windows distro or another developer who has 
one comes along, then I guess you can work on this bug. :)

------------------------------------------------------------------------
[2012-07-15 22:43:01] ras...@php.net

Well, I have looked at the code. We take the raw binary string and pass it 
straight to PCRE both on Windows and UNIX. So something along the way isn't the 
same. But I am not a Windows guy, so I can't help you on the Windows side of 
things. It works fine on my Linux box here.

------------------------------------------------------------------------
[2012-07-15 22:32:03] magog dot the dot ogre at gmail dot com

OK then, after doing some more plugging around, it appears that it still might 
be a PHP issue. Correct me if I'm wrong, but here are my finding:

Create a php file with only the following content:
  <?php
  echo preg_match("/\s+/", "ინფორმაცია")?"1":"0";

Running this on Windows will return "1", running on Unix returns "0".

Now I've run this on PCRE, and PCRE has returned that there was no match. Thus, 
it may be a PHP issue. Here is the output:
***Contents of test.txt
/\s+/
ინფორმაცია
ინფორ მაცია

***Output via Cygwin, running the Windows native pcretest.exe
(redacted)@(redacted)-PC /cygdrive/c/Program Files (x86)/pcre-7.0-bin/bin
$ ./pcretest.exe test.txt
PCRE version 7.0 18-Dec-2006

/\s+/
ინფორმაცია
No match
ინფორ მაცია
 0:

(I included the second example above with a space purposefully added, just to 
show that the tool is functioning properly and will catch the space when it's 
properly there).

------------------------------------------------------------------------
[2012-07-15 21:48:18] ras...@php.net

No, PCRE is a Perl-Compatible-Regex library but it is not the code used by Perl 
itself. Many (most?) open source things that have regex support will use PCRE.

------------------------------------------------------------------------


The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

    https://bugs.php.net/bug.php?id=62562


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=62562&edit=1

Reply via email to