Req #52923 [Opn]: parse_url corrupts some UTF-8 strings
Edit report at http://bugs.php.net/bug.php?id=52923edit=1 ID: 52923 Updated by: cataphr...@php.net Reported by:masteram at gmail dot com Summary:parse_url corrupts some UTF-8 strings Status: Open Type: Feature/Change Request Package:*URL Functions Operating System: MS Windows XP PHP Version:5.3.3 Block user comment: N New Comment: The problem is that nothing guarantees a percent-encoded URL should be interpreted as containing UTF-8 data or that an (invalid) URL containing non-encoded unreserved characters should be converted to UTF-8 before being percent-encoded. In fact, while most browsers will use UTF-8 to build URLs entered in the address bar, in case of HTML anchors in HTML pages, they will prefer to use the encoding of the page instead if it's also an ASCII superset. That said, the corruption you describe seems uncalled for. In fact, I am unable to reproduce it. This is the value of $url I get in the end: string(32) /he/פר××ק×××/ByYear.html Previous Comments: [2010-09-25 16:22:19] masteram at gmail dot com I tend to agree with Pajoye. Although RFC-3986 calls for the use of percent-encoding for URLs, I believe that it also mentions the IDN format (and the way things look today, there is a host of websites that use UTF-8 encoding, which benefits the readability of internationalized urls). I admit not being an expert in URL encoding, but it seems to me that corrupting a string, even if it does not meet the current standards, is a bad habit. In addition, utf-8 encoded URLs seem to be quite common on reality. Take the international versions of Wikipedia as an example. If I'm wrong about that, I would be more than happy to know it. I am not sure that the encode-analyze-merge-decode procedure is really the best choice. Perhaps the streamlined alternative should be considered. It sure wouldn't hurt. I, for one, am currently using 'ASCII-only' URLs. [2010-09-25 14:34:34] paj...@php.net It is not a bogus request. The idea would also to get the decoded (to UTF-8) URL elements as result. It is also a good complement to IDN support [2010-09-25 14:19:40] cataphr...@php.net I'd say this request/bug is bogus because such URL is not valid according to RFC 3986. He should first percent-encode all the characters that are unreserved (perhaps after doing some unicode normalization) and only then parse the URL. [2010-09-25 12:15:15] paj...@php.net What's about a parse_url_utf8, like what we have for IDN? It could be easy to implement it using either native OS APIs (when available) or using external libraries (there is a couple of good one out there). [2010-09-25 11:42:29] ras...@php.net Reclassifying as a feature request. parse_url has never been multibyte-aware. The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/bug.php?id=52923 -- Edit this bug report at http://bugs.php.net/bug.php?id=52923edit=1
Bug-Req #52923 [Opn]: parse_url corrupts some UTF-8 strings
Edit report at http://bugs.php.net/bug.php?id=52923edit=1 ID: 52923 Updated by: ras...@php.net Reported by:masteram at gmail dot com Summary:parse_url corrupts some UTF-8 strings Status: Open -Type: Bug +Type: Feature/Change Request Package:*URL Functions Operating System: MS Windows XP PHP Version:5.3.3 Block user comment: N New Comment: Reclassifying as a feature request. parse_url has never been multibyte-aware. Previous Comments: [2010-09-25 11:09:39] masteram at gmail dot com Description: I have tested this with PHP 5.2.9 and 5.3.3. Some UTF-8 strings are not being processed correctly by parse_url. In the given example, the result of the evaluation of strings which contains the chars '×' or '×' is corrupt, whereas the string '××ש××'(which does not contain the above chars) is being processed correctly. The affected characters (in UTF-8) are comprised of the following bytes: × - d7|9d × - d7|90 Those are converted to a char which contains the following bytes: d7|5f. In addition to ruining the url, this char is not safe with preg_replace. Therefore, if we merge the result of parse_url back into a string, and then attempting to replace, say, spaces with underscores using preg_replace, we will get an empty string. I believe that this is similar to bug #26391. Test script: --- $url = 'http://www.mysite.org/he/פר××ק×××/ByYear.html'; $url = parse_url($url); //$url['path'] is now corrupt $url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined Expected result: The correct portion of the url. Actual result: -- Corrupt string (or blank after using preg_replace). -- Edit this bug report at http://bugs.php.net/bug.php?id=52923edit=1
Req #52923 [Opn]: parse_url corrupts some UTF-8 strings
Edit report at http://bugs.php.net/bug.php?id=52923edit=1 ID: 52923 Updated by: paj...@php.net Reported by:masteram at gmail dot com Summary:parse_url corrupts some UTF-8 strings Status: Open Type: Feature/Change Request Package:*URL Functions Operating System: MS Windows XP PHP Version:5.3.3 Block user comment: N New Comment: What's about a parse_url_utf8, like what we have for IDN? It could be easy to implement it using either native OS APIs (when available) or using external libraries (there is a couple of good one out there). Previous Comments: [2010-09-25 11:42:29] ras...@php.net Reclassifying as a feature request. parse_url has never been multibyte-aware. [2010-09-25 11:09:39] masteram at gmail dot com Description: I have tested this with PHP 5.2.9 and 5.3.3. Some UTF-8 strings are not being processed correctly by parse_url. In the given example, the result of the evaluation of strings which contains the chars '×' or '×' is corrupt, whereas the string '××ש××'(which does not contain the above chars) is being processed correctly. The affected characters (in UTF-8) are comprised of the following bytes: × - d7|9d × - d7|90 Those are converted to a char which contains the following bytes: d7|5f. In addition to ruining the url, this char is not safe with preg_replace. Therefore, if we merge the result of parse_url back into a string, and then attempting to replace, say, spaces with underscores using preg_replace, we will get an empty string. I believe that this is similar to bug #26391. Test script: --- $url = 'http://www.mysite.org/he/פר××ק×××/ByYear.html'; $url = parse_url($url); //$url['path'] is now corrupt $url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined Expected result: The correct portion of the url. Actual result: -- Corrupt string (or blank after using preg_replace). -- Edit this bug report at http://bugs.php.net/bug.php?id=52923edit=1
Req #52923 [Opn]: parse_url corrupts some UTF-8 strings
Edit report at http://bugs.php.net/bug.php?id=52923edit=1 ID: 52923 Updated by: cataphr...@php.net Reported by:masteram at gmail dot com Summary:parse_url corrupts some UTF-8 strings Status: Open Type: Feature/Change Request Package:*URL Functions Operating System: MS Windows XP PHP Version:5.3.3 Block user comment: N New Comment: I'd say this request/bug is bogus because such URL is not valid according to RFC 3986. He should first percent-encode all the characters that are unreserved (perhaps after doing some unicode normalization) and only then parse the URL. Previous Comments: [2010-09-25 12:15:15] paj...@php.net What's about a parse_url_utf8, like what we have for IDN? It could be easy to implement it using either native OS APIs (when available) or using external libraries (there is a couple of good one out there). [2010-09-25 11:42:29] ras...@php.net Reclassifying as a feature request. parse_url has never been multibyte-aware. [2010-09-25 11:09:39] masteram at gmail dot com Description: I have tested this with PHP 5.2.9 and 5.3.3. Some UTF-8 strings are not being processed correctly by parse_url. In the given example, the result of the evaluation of strings which contains the chars '×' or '×' is corrupt, whereas the string '××ש××'(which does not contain the above chars) is being processed correctly. The affected characters (in UTF-8) are comprised of the following bytes: × - d7|9d × - d7|90 Those are converted to a char which contains the following bytes: d7|5f. In addition to ruining the url, this char is not safe with preg_replace. Therefore, if we merge the result of parse_url back into a string, and then attempting to replace, say, spaces with underscores using preg_replace, we will get an empty string. I believe that this is similar to bug #26391. Test script: --- $url = 'http://www.mysite.org/he/פר××ק×××/ByYear.html'; $url = parse_url($url); //$url['path'] is now corrupt $url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined Expected result: The correct portion of the url. Actual result: -- Corrupt string (or blank after using preg_replace). -- Edit this bug report at http://bugs.php.net/bug.php?id=52923edit=1
Req #52923 [Opn]: parse_url corrupts some UTF-8 strings
Edit report at http://bugs.php.net/bug.php?id=52923edit=1 ID: 52923 Updated by: paj...@php.net Reported by:masteram at gmail dot com Summary:parse_url corrupts some UTF-8 strings Status: Open Type: Feature/Change Request Package:*URL Functions Operating System: MS Windows XP PHP Version:5.3.3 Block user comment: N New Comment: It is not a bogus request. The idea would also to get the decoded (to UTF-8) URL elements as result. It is also a good complement to IDN support Previous Comments: [2010-09-25 14:19:40] cataphr...@php.net I'd say this request/bug is bogus because such URL is not valid according to RFC 3986. He should first percent-encode all the characters that are unreserved (perhaps after doing some unicode normalization) and only then parse the URL. [2010-09-25 12:15:15] paj...@php.net What's about a parse_url_utf8, like what we have for IDN? It could be easy to implement it using either native OS APIs (when available) or using external libraries (there is a couple of good one out there). [2010-09-25 11:42:29] ras...@php.net Reclassifying as a feature request. parse_url has never been multibyte-aware. [2010-09-25 11:09:39] masteram at gmail dot com Description: I have tested this with PHP 5.2.9 and 5.3.3. Some UTF-8 strings are not being processed correctly by parse_url. In the given example, the result of the evaluation of strings which contains the chars '×' or '×' is corrupt, whereas the string '××ש××'(which does not contain the above chars) is being processed correctly. The affected characters (in UTF-8) are comprised of the following bytes: × - d7|9d × - d7|90 Those are converted to a char which contains the following bytes: d7|5f. In addition to ruining the url, this char is not safe with preg_replace. Therefore, if we merge the result of parse_url back into a string, and then attempting to replace, say, spaces with underscores using preg_replace, we will get an empty string. I believe that this is similar to bug #26391. Test script: --- $url = 'http://www.mysite.org/he/פר××ק×××/ByYear.html'; $url = parse_url($url); //$url['path'] is now corrupt $url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined Expected result: The correct portion of the url. Actual result: -- Corrupt string (or blank after using preg_replace). -- Edit this bug report at http://bugs.php.net/bug.php?id=52923edit=1
Req #52923 [Opn]: parse_url corrupts some UTF-8 strings
Edit report at http://bugs.php.net/bug.php?id=52923edit=1 ID: 52923 User updated by:masteram at gmail dot com Reported by:masteram at gmail dot com Summary:parse_url corrupts some UTF-8 strings Status: Open Type: Feature/Change Request Package:*URL Functions Operating System: MS Windows XP PHP Version:5.3.3 Block user comment: N New Comment: I tend to agree with Pajoye. Although RFC-3986 calls for the use of percent-encoding for URLs, I believe that it also mentions the IDN format (and the way things look today, there is a host of websites that use UTF-8 encoding, which benefits the readability of internationalized urls). I admit not being an expert in URL encoding, but it seems to me that corrupting a string, even if it does not meet the current standards, is a bad habit. In addition, utf-8 encoded URLs seem to be quite common on reality. Take the international versions of Wikipedia as an example. If I'm wrong about that, I would be more than happy to know it. I am not sure that the encode-analyze-merge-decode procedure is really the best choice. Perhaps the streamlined alternative should be considered. It sure wouldn't hurt. I, for one, am currently using 'ASCII-only' URLs. Previous Comments: [2010-09-25 14:34:34] paj...@php.net It is not a bogus request. The idea would also to get the decoded (to UTF-8) URL elements as result. It is also a good complement to IDN support [2010-09-25 14:19:40] cataphr...@php.net I'd say this request/bug is bogus because such URL is not valid according to RFC 3986. He should first percent-encode all the characters that are unreserved (perhaps after doing some unicode normalization) and only then parse the URL. [2010-09-25 12:15:15] paj...@php.net What's about a parse_url_utf8, like what we have for IDN? It could be easy to implement it using either native OS APIs (when available) or using external libraries (there is a couple of good one out there). [2010-09-25 11:42:29] ras...@php.net Reclassifying as a feature request. parse_url has never been multibyte-aware. [2010-09-25 11:09:39] masteram at gmail dot com Description: I have tested this with PHP 5.2.9 and 5.3.3. Some UTF-8 strings are not being processed correctly by parse_url. In the given example, the result of the evaluation of strings which contains the chars '×' or '×' is corrupt, whereas the string '××ש××'(which does not contain the above chars) is being processed correctly. The affected characters (in UTF-8) are comprised of the following bytes: × - d7|9d × - d7|90 Those are converted to a char which contains the following bytes: d7|5f. In addition to ruining the url, this char is not safe with preg_replace. Therefore, if we merge the result of parse_url back into a string, and then attempting to replace, say, spaces with underscores using preg_replace, we will get an empty string. I believe that this is similar to bug #26391. Test script: --- $url = 'http://www.mysite.org/he/פר××ק×××/ByYear.html'; $url = parse_url($url); //$url['path'] is now corrupt $url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined Expected result: The correct portion of the url. Actual result: -- Corrupt string (or blank after using preg_replace). -- Edit this bug report at http://bugs.php.net/bug.php?id=52923edit=1