[PHP-BUG] Bug #52923 [NEW]: parse_url corrupts some UTF-8 strings

2010-09-25 Thread masteram at gmail dot com
From: 
Operating system: MS Windows XP
PHP version:  5.3.3
Package:  *URL Functions
Bug Type: Bug
Bug description:parse_url corrupts some UTF-8 strings

Description:

I have tested this with PHP 5.2.9 and 5.3.3.

Some UTF-8 strings are not being processed correctly by parse_url.

In the given example, the result of the evaluation of strings which
contains the chars 'ם' or 'א' is corrupt, whereas the string
'מישהו'(which does not contain the above chars) is being processed
correctly.

The affected characters (in UTF-8) are comprised of the following bytes:

ם - d7|9d

א - d7|90



Those are converted to a char which contains the following bytes: d7|5f.



In addition to ruining the url, this char is not safe with preg_replace.

Therefore, if we merge the result of parse_url back into a string, and then
attempting to replace, say, spaces with underscores using preg_replace, we
will get an empty string.



I believe that this is similar to bug #26391.

Test script:
---
$url = 'http://www.mysite.org/he/פרויקטים/ByYear.html';

$url = parse_url($url); //$url['path'] is now corrupt



$url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined

Expected result:

The correct portion of the url.

Actual result:
--
Corrupt string (or blank after using preg_replace).

-- 
Edit bug report at http://bugs.php.net/bug.php?id=52923edit=1
-- 
Try a snapshot (PHP 5.2):
http://bugs.php.net/fix.php?id=52923r=trysnapshot52
Try a snapshot (PHP 5.3):
http://bugs.php.net/fix.php?id=52923r=trysnapshot53
Try a snapshot (trunk):  
http://bugs.php.net/fix.php?id=52923r=trysnapshottrunk
Fixed in SVN:
http://bugs.php.net/fix.php?id=52923r=fixed
Fixed in SVN and need be documented: 
http://bugs.php.net/fix.php?id=52923r=needdocs
Fixed in release:
http://bugs.php.net/fix.php?id=52923r=alreadyfixed
Need backtrace:  
http://bugs.php.net/fix.php?id=52923r=needtrace
Need Reproduce Script:   
http://bugs.php.net/fix.php?id=52923r=needscript
Try newer version:   
http://bugs.php.net/fix.php?id=52923r=oldversion
Not developer issue: 
http://bugs.php.net/fix.php?id=52923r=support
Expected behavior:   
http://bugs.php.net/fix.php?id=52923r=notwrong
Not enough info: 
http://bugs.php.net/fix.php?id=52923r=notenoughinfo
Submitted twice: 
http://bugs.php.net/fix.php?id=52923r=submittedtwice
register_globals:
http://bugs.php.net/fix.php?id=52923r=globals
PHP 4 support discontinued:  http://bugs.php.net/fix.php?id=52923r=php4
Daylight Savings:http://bugs.php.net/fix.php?id=52923r=dst
IIS Stability:   
http://bugs.php.net/fix.php?id=52923r=isapi
Install GNU Sed: 
http://bugs.php.net/fix.php?id=52923r=gnused
Floating point limitations:  
http://bugs.php.net/fix.php?id=52923r=float
No Zend Extensions:  
http://bugs.php.net/fix.php?id=52923r=nozend
MySQL Configuration Error:   
http://bugs.php.net/fix.php?id=52923r=mysqlcfg



Req #52923 [Opn]: parse_url corrupts some UTF-8 strings

2010-09-25 Thread masteram at gmail dot com
Edit report at http://bugs.php.net/bug.php?id=52923edit=1

 ID: 52923
 User updated by:masteram at gmail dot com
 Reported by:masteram at gmail dot com
 Summary:parse_url corrupts some UTF-8 strings
 Status: Open
 Type:   Feature/Change Request
 Package:*URL Functions
 Operating System:   MS Windows XP
 PHP Version:5.3.3
 Block user comment: N

 New Comment:

I tend to agree with Pajoye.

Although RFC-3986 calls for the use of percent-encoding for URLs, I
believe that it also mentions the IDN format (and the way things look
today, there is a host of websites that use UTF-8 encoding, which
benefits the readability of internationalized urls). 

I admit not being an expert in URL encoding, but it seems to me that
corrupting a string, even if it does not meet the current standards, is
a bad habit.

In addition, utf-8 encoded URLs seem to be quite common on reality. Take
the international versions of Wikipedia as an example.

If I'm wrong about that, I would be more than happy to know it.



I am not sure that the encode-analyze-merge-decode procedure is really
the best choice. Perhaps the streamlined alternative should be
considered. It sure wouldn't hurt.

I, for one, am currently using 'ASCII-only' URLs.


Previous Comments:

[2010-09-25 14:34:34] paj...@php.net

It is not a bogus request. The idea would also to get the decoded (to
UTF-8) URL elements as result. It is also a good complement to IDN
support


[2010-09-25 14:19:40] cataphr...@php.net

I'd say this request/bug is bogus because such URL is not valid
according to RFC 3986. He should first percent-encode all the characters
that are unreserved (perhaps after doing some unicode normalization) and
only then parse the URL.


[2010-09-25 12:15:15] paj...@php.net

What's about a parse_url_utf8, like what we have for IDN? It could be
easy to implement it using either native OS APIs (when available) or
using external libraries (there is a couple of good one out there).


[2010-09-25 11:42:29] ras...@php.net

Reclassifying as a feature request.  parse_url has never been
multibyte-aware.


[2010-09-25 11:09:39] masteram at gmail dot com

Description:

I have tested this with PHP 5.2.9 and 5.3.3.

Some UTF-8 strings are not being processed correctly by parse_url.

In the given example, the result of the evaluation of strings which
contains the chars 'ם' or 'א' is corrupt, whereas the string
'מישהו'(which does not contain the above chars) is being processed
correctly.

The affected characters (in UTF-8) are comprised of the following
bytes:

ם - d7|9d

א - d7|90



Those are converted to a char which contains the following bytes:
d7|5f.



In addition to ruining the url, this char is not safe with
preg_replace.

Therefore, if we merge the result of parse_url back into a string, and
then attempting to replace, say, spaces with underscores using
preg_replace, we will get an empty string.



I believe that this is similar to bug #26391.

Test script:
---
$url = 'http://www.mysite.org/he/פרויקטים/ByYear.html';

$url = parse_url($url); //$url['path'] is now corrupt



$url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined

Expected result:

The correct portion of the url.

Actual result:
--
Corrupt string (or blank after using preg_replace).






-- 
Edit this bug report at http://bugs.php.net/bug.php?id=52923edit=1