Edit report at http://bugs.php.net/bug.php?id=52923&edit=1
ID: 52923
User updated by:masteram at gmail dot com
Reported by:masteram at gmail dot com
Summary:parse_url corrupts some UTF-8 strings
Status: Open
Type: Feature/Change Request
Package:*URL Functions
Operating System: MS Windows XP
PHP Version:5.3.3
Block user comment: N
New Comment:
I tend to agree with Pajoye.
Although RFC-3986 calls for the use of percent-encoding for URLs, I
believe that it also mentions the IDN format (and the way things look
today, there is a host of websites that use UTF-8 encoding, which
benefits the readability of internationalized urls).
I admit not being an expert in URL encoding, but it seems to me that
corrupting a string, even if it does not meet the current standards, is
a bad habit.
In addition, utf-8 encoded URLs seem to be quite common on reality. Take
the international versions of Wikipedia as an example.
If I'm wrong about that, I would be more than happy to know it.
I am not sure that the encode-analyze-merge-decode procedure is really
the best choice. Perhaps the streamlined alternative should be
considered. It sure wouldn't hurt.
I, for one, am currently using 'ASCII-only' URLs.
Previous Comments:
[2010-09-25 14:34:34] paj...@php.net
It is not a bogus request. The idea would also to get the decoded (to
UTF-8) URL elements as result. It is also a good complement to IDN
support
[2010-09-25 14:19:40] cataphr...@php.net
I'd say this request/bug is bogus because such URL is not valid
according to RFC 3986. He should first percent-encode all the characters
that are unreserved (perhaps after doing some unicode normalization) and
only then parse the URL.
[2010-09-25 12:15:15] paj...@php.net
What's about a parse_url_utf8, like what we have for IDN? It could be
easy to implement it using either native OS APIs (when available) or
using external libraries (there is a couple of good one out there).
[2010-09-25 11:42:29] ras...@php.net
Reclassifying as a feature request. parse_url has never been
multibyte-aware.
----
[2010-09-25 11:09:39] masteram at gmail dot com
Description:
I have tested this with PHP 5.2.9 and 5.3.3.
Some UTF-8 strings are not being processed correctly by parse_url.
In the given example, the result of the evaluation of strings which
contains the chars '×' or '×' is corrupt, whereas the string
'××ש××'(which does not contain the above chars) is being processed
correctly.
The affected characters (in UTF-8) are comprised of the following
bytes:
× - d7|9d
× - d7|90
Those are converted to a char which contains the following bytes:
d7|5f.
In addition to ruining the url, this char is not safe with
preg_replace.
Therefore, if we merge the result of parse_url back into a string, and
then attempting to replace, say, spaces with underscores using
preg_replace, we will get an empty string.
I believe that this is similar to bug #26391.
Test script:
---
$url = 'http://www.mysite.org/he/פר××ק×××/ByYear.html';
$url = parse_url($url); //$url['path'] is now corrupt
$url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined
Expected result:
The correct portion of the url.
Actual result:
--
Corrupt string (or blank after using preg_replace).
--
Edit this bug report at http://bugs.php.net/bug.php?id=52923&edit=1