Hi all,
The functions utf8_encode and utf8_decode are historical oddities, which
almost certainly would not be accepted if proposed today:
* Their names do not describe their functionality, which is to convert
to/from one specific single-byte encoding. This leads to a common
confusion that they can be used to "fix" UTF-8 encoding problems, which
they generally make worse.
* That single-byte encoding is ISO 8859-1, not its common cousins
Windows-1252 or ISO 88159-15. This means, for instance, that they do not
handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)
not "\x80" (Windows-1252) or "\xA4" (8859-15)
On the other hand, they are commonly used, both correctly and
incorrectly, so removing them is not easy.
A previous proposal to remove them [1] resulted in Andrea making two
significant improvements: moving them from ext/xml to ext/standard [2]
and rewriting the documentation to explain them properly [3]. My genuine
thanks for that.
However, it hasn't stopped people misunderstanding them, and quite
reasonably: you shouldn't need to look up every function you use in the
manual, to make sure it actually does what its name suggests.
I can see three ways forward:
A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
a specific replacement, but recommend people look at iconv() or
mb_convert_encoding(). There is precedent for this, such as
convert_cyr_string(), but it may frustrate those who are using the
functions correctly.
B) Introduce new names, such as utf8_to_iso_8859_1 and
iso_8859_1_to_utf8; immediately make those the primary names in the
manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
notices for the old names, either immediately or in some future release.
This gives a smoother upgrade path, but commits us to having these
functions as outliers in our standard library.
C) Leave them alone forever. Treat it as the user's fault if they mess
things up by misunderstanding them.
I am happy to put together an RFC for either A or B, if it has a chance
of reaching consensus. I would really like to avoid option C.
[1] https://externals.io/message/95166
[2] https://github.com/php/php-src/pull/2160
[3]
https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238
Regards,
--
Rowan Tommins
[IMSoP]
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php