ID:               48360
 Updated by:       lbarn...@php.net
 Reported By:      martin2007 at laposte dot net
-Status:           Bogus
+Status:           Open
 Bug Type:         URL related
 Operating System: Linux
 PHP Version:      5.2.9


Previous Comments:
------------------------------------------------------------------------

[2009-05-22 11:47:07] lbarn...@php.net

Sorry, but your problem does not imply a bug in PHP itself.  For a
list of more appropriate places to ask for help using PHP, please
visit http://www.php.net/support.php as this bug system is not the
appropriate forum for asking support questions.  Due to the volume
of reports we can not explain in detail here why your report is not
a bug.  The support channels will be able to provide an explanation
for you.

Thank you for your interest in PHP.

>From the RFC:
   Usually a URL has the same interpretation when an octet is
   represented by a character and when it encoded. [...]

   [...] characters that are not required to be encoded
   (including alphanumerics) may be encoded within the scheme-specific
   part of a URL, as long as they are not being used for a reserved
   purpose.


This means urlencode() may encode everything, including alphanumerics,
and still be RFC1738 compliant.

www.example.com/$!*'(), === www.example.com/%24%21%2A%27%28%29%2C
www.example.com/%24%21%2A%27%28%29%2C === www.example.com/$!*'(),

For your experiment, you may want to try linking twice times the same
page, encoded differently. Then check if Google indexes the page twice
with two different URLs.

Search engines are smart enough to canonicalize every URL they have to
work with. Two URLs encoded differently are still the same.

------------------------------------------------------------------------

[2009-05-22 10:17:17] martin2007 at laposte dot net

Description:
------------
urlencode and rawurlencode are not RFC-1738 compliant.

RFC-1738 states:
" Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL."
Later on, the grammar is as follows:

unreserved     = alpha | digit | safe | extra
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","


However, urlencode and rawurlencode encode $!*'(),

Note that, except for "$" and ",", this is also true for RFC-2396
(URI).

The main problem is that Google uses another encoding scheme. When you
have URLs containing these characters, your weblogs contain several
different URLs for the same resource. It might also confuse some web
server implementations.


See: http://www.monperrus.net/martin/googenc/


Reproduce code:
---------------
echo urlencode("$!*'(),");
echo rawurlencode("$!*'(),");

Expected result:
----------------
$!*'(),
$!*'(),

Actual result:
--------------
%24%21%2A%27%28%29%2C
%24%21%2A%27%28%29%2C


------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=48360&edit=1

Reply via email to