On Tue, Dec 21, 2021 at 3:21 PM Wade Rossmann <wrossm...@gmail.com> wrote:
> On Sun, Mar 21, 2021 at 9:52 AM Larry Garfield <la...@garfieldtech.com> > wrote: > > > On Sun, Mar 21, 2021, at 9:18 AM, Rowan Tommins wrote: > > > Hi all, > > > > > > The functions utf8_encode and utf8_decode are historical oddities, > which > > > almost certainly would not be accepted if proposed today: > > > > > > * Their names do not describe their functionality, which is to convert > > > to/from one specific single-byte encoding. This leads to a common > > > confusion that they can be used to "fix" UTF-8 encoding problems, which > > > they generally make worse. > > > * That single-byte encoding is ISO 8859-1, not its common cousins > > > Windows-1252 or ISO 88159-15. This means, for instance, that they do > not > > > handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable) > > > not "\x80" (Windows-1252) or "\xA4" (8859-15) > > > > > > On the other hand, they are commonly used, both correctly and > > > incorrectly, so removing them is not easy. > > > > > > A previous proposal to remove them [1] resulted in Andrea making two > > > significant improvements: moving them from ext/xml to ext/standard [2] > > > and rewriting the documentation to explain them properly [3]. My > genuine > > > thanks for that. > > > > > > However, it hasn't stopped people misunderstanding them, and quite > > > reasonably: you shouldn't need to look up every function you use in the > > > manual, to make sure it actually does what its name suggests. > > > > > > > > > I can see three ways forward: > > > > > > A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide > > > a specific replacement, but recommend people look at iconv() or > > > mb_convert_encoding(). There is precedent for this, such as > > > convert_cyr_string(), but it may frustrate those who are using the > > > functions correctly. > > > > > > B) Introduce new names, such as utf8_to_iso_8859_1 and > > > iso_8859_1_to_utf8; immediately make those the primary names in the > > > manual, with utf8_encode / utf8_decode as aliases. Raise deprecation > > > notices for the old names, either immediately or in some future > release. > > > This gives a smoother upgrade path, but commits us to having these > > > functions as outliers in our standard library. > > > > > > C) Leave them alone forever. Treat it as the user's fault if they mess > > > things up by misunderstanding them. > > > > > > > > > I am happy to put together an RFC for either A or B, if it has a chance > > > of reaching consensus. I would really like to avoid option C. > > > > > > > > > [1] https://externals.io/message/95166 > > > [2] https://github.com/php/php-src/pull/2160 > > > [3] > > > > > > https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238 > > > > > > Regards, > > > > I lost several days of my life to exactly this problem, many years ago. > I > > am still triggered by it. > > > > I am mostly OK with option A, but with a big caveat: > > > > The root problem here is "You keep using that function. I do not think > it > > means what you think it means." > > > > As Rowan notes, what people actually *want* most of the time is "I got > > this string from a user and have NFI what it's encoding is, but my system > > needs UTF-8, so gimmie this string in UTF-8." And they use > utf8_encode(), > > which then fails *sometimes* in exciting and mysterious ways, because > > that's not what it is. > > > > Removing utf8_encode() may keep people from misusing it, but that doesn't > > mean the problem space they were trying to solve goes away. If anything, > > people who still don't realize that it's the wrong solution will get > angry > > that we're taking away a "useful" tool and replacing it with "meh, go > look > > at library X," which is admittedly a pretty rude answer. > > > > If we're removing a bad answer to the problem, we should also replace it > > with a good answer. > > > > Someone will, I'm sure, pop in at this point and declare "if you don't > > know the character encoding you're receiving, then you're doing it wrong > > and are already lost and we can't help you." While that may be > technically > > correct, it's also an entirely useless answer because strings received > over > > HTTP very frequently do not tell you what their encoding is, or they lie > > about what their encoding is. (The header may say it's ISO8859, or UTF8, > > or whatever, but someone copy-pasted from MS Word into a text box and now > > it's Windows-1252 within a wrapper that says ISO8859 but is mostly UTF8 > > except for the Windows-1252 part. Like, that's literally the problem I > > lost several days to.) "Your own fault" is not even an accurate answer > at > > that point. > > > > So if we're going to take away people's broken hammer, we need to be very > > clear about what hammer to use instead. > > > > The initial answer is probably "here's how to use a series of mb_string > > functions together to produce a reasonably good > > guess-my-encoding-and-convert-to-utf8 routine" documentation. Which... > may > > exist, but if it does I've never found it. So at bare minimum the > > encode_utf8() documentation needs to include a "use this code snippet > > instead" description, and not just link to the mbstring extension. > > Glancing through the mbstring docs right now, it looks like it's not > > already a single function call, but some combination of several, and has > > some global flags that get set (via mb_detect_order()), I think. It's > not > > as easy to use as utf8_encode(), even if utf8_encode() is wrong. That > > suggests we may want to try and simplify the mbstring API, or internalize > > some function that handles the most common case in a way that doesn't > rely > > on global flags. > > > > So, let's make that easier to use, so that we can change "this function > is > > wrong, we're taking it away from you" to "this function is wrong, here's > a > > way better alternative that you can use instead (while we quietly take > the > > wrong one away from you while you're distracted by the new shiny)." > > > > I don't know the mbstring API well enough to say what that alternative > > ideally looks like, but if we can answer that it would make killing off > the > > old functions much more palatable. > > > > --Larry Garfield > > > > -- > > PHP Internals - PHP Runtime Development Mailing List > > To unsubscribe, visit: https://www.php.net/unsub.php > > > > > As an encoding nerd and perennial complainer regarding these functions I > would like nothing more than to see them immediately disappear, but I do > recognize the BC-breaking potential for something like that. However, I do > have a suggestion that I've not seen mentioned yet that should at least > address some of the misconceptions that people get from the current > functions. > > I would suggest adding optional source/destination encoding parameters to > the functions, eg: > > utf8_encode(string $string, string $source_encoding = "ISO-8859-1") > utf8_decode(string $string, string $destination_encoding = "ISO-8859-1") > > and, if you'll forgive the hand-waving due to my unfamiliarity with PHP > internals, they could simply be passed through to an underlying > mb_convert_encoding() call. Eg: > > mb_convert_encoding($string, 'UTF-8', $source_encoding) > mb_convert_encoding($string, $destination_encoding, 'UTF-8') > > This would preserve BC while also making the function header and > documentation much more descriptive of what the function actually does, > allow more flexible use of the functions, and potentially drive people to > use the mb_* functions instead. This could also be used as a gradual > pathway to deprecating the functions, where, for example, a deprecation > notice could be raised when the function is called without the > source/destination encoding explicitly given. > > I know that there is also some resistance to the idea of requiring mbstring > as it is an optional extension, as well as resistance to bringing mbstring > into core due to design and/or history. This could be worked around by > [once again, apology for handwaving] only requiring mbstring for > conversions involving an encoding other than ISO-8859-1 and falling back to > the existing implementation otherwise. > Now might be a good time to make this into an RFC. :) --Kris