Hi all,

The functions utf8_encode and utf8_decode are historical oddities, which almost certainly would not be accepted if proposed today:

* Their names do not describe their functionality, which is to convert to/from one specific single-byte encoding. This leads to a common confusion that they can be used to "fix" UTF-8 encoding problems, which they generally make worse. * That single-byte encoding is ISO 8859-1, not its common cousins Windows-1252 or ISO 88159-15. This means, for instance, that they do not handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)  not "\x80" (Windows-1252) or "\xA4" (8859-15)

On the other hand, they are commonly used, both correctly and incorrectly, so removing them is not easy.

A previous proposal to remove them [1] resulted in Andrea making two significant improvements: moving them from ext/xml to ext/standard [2] and rewriting the documentation to explain them properly [3]. My genuine thanks for that.

However, it hasn't stopped people misunderstanding them, and quite reasonably: you shouldn't need to look up every function you use in the manual, to make sure it actually does what its name suggests.


I can see three ways forward:

A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide a specific replacement, but recommend people look at iconv() or mb_convert_encoding(). There is precedent for this, such as convert_cyr_string(), but it may frustrate those who are using the functions correctly.

B) Introduce new names, such as utf8_to_iso_8859_1 and iso_8859_1_to_utf8; immediately make those the primary names in the manual, with utf8_encode / utf8_decode as aliases. Raise deprecation notices for the old names, either immediately or in some future release. This gives a smoother upgrade path, but commits us to having these functions as outliers in our standard library.

C) Leave them alone forever. Treat it as the user's fault if they mess things up by misunderstanding them.


I am happy to put together an RFC for either A or B, if it has a chance of reaching consensus. I would really like to avoid option C.


[1] https://externals.io/message/95166
[2] https://github.com/php/php-src/pull/2160
[3] https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238

Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to