from:"\"masteram at gmail dot com\""

Req #52923 [Opn]: parse_url corrupts some UTF-8 strings

2010-09-25 Thread masteram at gmail dot com

Edit report at http://bugs.php.net/bug.php?id=52923&edit=1

 ID: 52923
 User updated by:masteram at gmail dot com
 Reported by:masteram at gmail dot com
 Summary:parse_url corrupts some UTF-8 strings
 Status: Open
 Type:   Feature/Change Request
 Package:*URL Functions
 Operating System:   MS Windows XP
 PHP Version:5.3.3
 Block user comment: N

 New Comment:

I tend to agree with Pajoye.

Although RFC-3986 calls for the use of percent-encoding for URLs, I
believe that it also mentions the IDN format (and the way things look
today, there is a host of websites that use UTF-8 encoding, which
benefits the readability of internationalized urls). 

I admit not being an expert in URL encoding, but it seems to me that
corrupting a string, even if it does not meet the current standards, is
a bad habit.

In addition, utf-8 encoded URLs seem to be quite common on reality. Take
the international versions of Wikipedia as an example.

If I'm wrong about that, I would be more than happy to know it.



I am not sure that the encode-analyze-merge-decode procedure is really
the best choice. Perhaps the streamlined alternative should be
considered. It sure wouldn't hurt.

I, for one, am currently using 'ASCII-only' URLs.


Previous Comments:

[2010-09-25 14:34:34] paj...@php.net

It is not a bogus request. The idea would also to get the decoded (to
UTF-8) URL elements as result. It is also a good complement to IDN
support


[2010-09-25 14:19:40] cataphr...@php.net

I'd say this request/bug is bogus because such URL is not valid
according to RFC 3986. He should first percent-encode all the characters
that are unreserved (perhaps after doing some unicode normalization) and
only then parse the URL.


[2010-09-25 12:15:15] paj...@php.net

What's about a parse_url_utf8, like what we have for IDN? It could be
easy to implement it using either native OS APIs (when available) or
using external libraries (there is a couple of good one out there).


[2010-09-25 11:42:29] ras...@php.net

Reclassifying as a feature request.  parse_url has never been
multibyte-aware.

----
[2010-09-25 11:09:39] masteram at gmail dot com

Description:

I have tested this with PHP 5.2.9 and 5.3.3.

Some UTF-8 strings are not being processed correctly by parse_url.

In the given example, the result of the evaluation of strings which
contains the chars '×' or '×' is corrupt, whereas the string
'×××©××'(which does not contain the above chars) is being processed
correctly.

The affected characters (in UTF-8) are comprised of the following
bytes:

× - d7|9d

× - d7|90



Those are converted to a char which contains the following bytes:
d7|5f.



In addition to ruining the url, this char is not safe with
preg_replace.

Therefore, if we merge the result of parse_url back into a string, and
then attempting to replace, say, spaces with underscores using
preg_replace, we will get an empty string.



I believe that this is similar to bug #26391.

Test script:
---
$url = 'http://www.mysite.org/he/×¤×¨×××§×××/ByYear.html';

$url = parse_url($url); //$url['path'] is now corrupt



$url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined

Expected result:

The correct portion of the url.

Actual result:
--
Corrupt string (or blank after using preg_replace).






-- 
Edit this bug report at http://bugs.php.net/bug.php?id=52923&edit=1

[PHP-BUG] Bug #52923 [NEW]: parse_url corrupts some UTF-8 strings

2010-09-25 Thread masteram at gmail dot com

From: 
Operating system: MS Windows XP
PHP version:  5.3.3
Package:  *URL Functions
Bug Type: Bug
Bug description:parse_url corrupts some UTF-8 strings

Description:

I have tested this with PHP 5.2.9 and 5.3.3.

Some UTF-8 strings are not being processed correctly by parse_url.

In the given example, the result of the evaluation of strings which
contains the chars '×' or '×' is corrupt, whereas the string
'×××©××'(which does not contain the above chars) is being processed
correctly.

The affected characters (in UTF-8) are comprised of the following bytes:

× - d7|9d

× - d7|90



Those are converted to a char which contains the following bytes: d7|5f.



In addition to ruining the url, this char is not safe with preg_replace.

Therefore, if we merge the result of parse_url back into a string, and then
attempting to replace, say, spaces with underscores using preg_replace, we
will get an empty string.



I believe that this is similar to bug #26391.

Test script:
---
$url = 'http://www.mysite.org/he/×¤×¨×××§×××/ByYear.html';

$url = parse_url($url); //$url['path'] is now corrupt



$url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined

Expected result:

The correct portion of the url.

Actual result:
--
Corrupt string (or blank after using preg_replace).

-- 
Edit bug report at http://bugs.php.net/bug.php?id=52923&edit=1
-- 
Try a snapshot (PHP 5.2):
http://bugs.php.net/fix.php?id=52923&r=trysnapshot52
Try a snapshot (PHP 5.3):
http://bugs.php.net/fix.php?id=52923&r=trysnapshot53
Try a snapshot (trunk):  
http://bugs.php.net/fix.php?id=52923&r=trysnapshottrunk
Fixed in SVN:
http://bugs.php.net/fix.php?id=52923&r=fixed
Fixed in SVN and need be documented: 
http://bugs.php.net/fix.php?id=52923&r=needdocs
Fixed in release:
http://bugs.php.net/fix.php?id=52923&r=alreadyfixed
Need backtrace:  
http://bugs.php.net/fix.php?id=52923&r=needtrace
Need Reproduce Script:   
http://bugs.php.net/fix.php?id=52923&r=needscript
Try newer version:   
http://bugs.php.net/fix.php?id=52923&r=oldversion
Not developer issue: 
http://bugs.php.net/fix.php?id=52923&r=support
Expected behavior:   
http://bugs.php.net/fix.php?id=52923&r=notwrong
Not enough info: 
http://bugs.php.net/fix.php?id=52923&r=notenoughinfo
Submitted twice: 
http://bugs.php.net/fix.php?id=52923&r=submittedtwice
register_globals:
http://bugs.php.net/fix.php?id=52923&r=globals
PHP 4 support discontinued:  http://bugs.php.net/fix.php?id=52923&r=php4
Daylight Savings:http://bugs.php.net/fix.php?id=52923&r=dst
IIS Stability:   
http://bugs.php.net/fix.php?id=52923&r=isapi
Install GNU Sed: 
http://bugs.php.net/fix.php?id=52923&r=gnused
Floating point limitations:  
http://bugs.php.net/fix.php?id=52923&r=float
No Zend Extensions:  
http://bugs.php.net/fix.php?id=52923&r=nozend
MySQL Configuration Error:   
http://bugs.php.net/fix.php?id=52923&r=mysqlcfg

Req #52923 [Opn]: parse_url corrupts some UTF-8 strings

[PHP-BUG] Bug #52923 [NEW]: parse_url corrupts some UTF-8 strings

2 matches

Site Navigation

Mail list logo

Footer information