Edit report at http://bugs.php.net/bug.php?id=52923&edit=1

 ID:                 52923
 Updated by:         ras...@php.net
 Reported by:        masteram at gmail dot com
 Summary:            parse_url corrupts some UTF-8 strings
 Status:             Open
-Type:               Bug
+Type:               Feature/Change Request
 Package:            *URL Functions
 Operating System:   MS Windows XP
 PHP Version:        5.3.3
 Block user comment: N

 New Comment:

Reclassifying as a feature request.  parse_url has never been
multibyte-aware.


Previous Comments:
------------------------------------------------------------------------
[2010-09-25 11:09:39] masteram at gmail dot com

Description:
------------
I have tested this with PHP 5.2.9 and 5.3.3.

Some UTF-8 strings are not being processed correctly by parse_url.

In the given example, the result of the evaluation of strings which
contains the chars 'ם' or 'א' is corrupt, whereas the string
'מישהו'(which does not contain the above chars) is being processed
correctly.

The affected characters (in UTF-8) are comprised of the following
bytes:

ם - d7|9d

א - d7|90



Those are converted to a char which contains the following bytes:
d7|5f.



In addition to ruining the url, this char is not safe with
preg_replace.

Therefore, if we merge the result of parse_url back into a string, and
then attempting to replace, say, spaces with underscores using
preg_replace, we will get an empty string.



I believe that this is similar to bug #26391.

Test script:
---------------
$url = 'http://www.mysite.org/he/פרויקטים/ByYear.html';

$url = parse_url($url); //$url['path'] is now corrupt



$url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined

Expected result:
----------------
The correct portion of the url.

Actual result:
--------------
Corrupt string (or blank after using preg_replace).


------------------------------------------------------------------------



-- 
Edit this bug report at http://bugs.php.net/bug.php?id=52923&edit=1

Reply via email to