Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-12-22 Thread Rowan Tommins

On 22/12/2021 14:45, Hans Henrik Bergan wrote:

I wonder if anyone depends on utf8_* without also depending on mb_* ? I
imagine that is exceedingly rare



On the contrary, anyone who uses mb_* functions is likely to use 
mb_convert_encoding rather than utf8_encode and utf8_decode.


In fact, the only legitimate uses of the functions I've seen are as a 
fallback for when ext/mbstring is not loaded, since they are always 
available (since PHP 7.2; before that, they were oddly part of ext/xml). 
There is a very small set of use cases where you really do know you have 
or want ISO 8859-1, and they are the most portable implementation.


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-12-22 Thread Hans Henrik Bergan
I wonder if anyone depends on utf8_* without also depending on mb_* ? I
imagine that is exceedingly rare

On Wed, Dec 22, 2021, 15:26 Rowan Tommins  wrote:

> On 22/12/2021 10:45, Andreas Heigl wrote:
> > I just dug a bit deeper on the subject and found this RFC from 2016:
> >
> > https://wiki.php.net/rfc/remove_utf_8_decode_encode
> >
> > Perhaps we can just revive that one!
>
>
> As I say, I have a draft with lots more detail in, which I will tidy up
> after Christmas. I deliberately didn't link to it, because I want to
> re-read it myself before letting other people comment on it, and don't
> have the time right now.
>
> My current inclination is to deprecate in 8.next, and remove in 9.0, but
> I want to make sure the argument for that is solid before putting it to
> a vote.
>
> Regards,
>
> --
> Rowan Tommins
> [IMSoP]
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php
>
>


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-12-22 Thread Rowan Tommins

On 22/12/2021 10:45, Andreas Heigl wrote:

I just dug a bit deeper on the subject and found this RFC from 2016:

https://wiki.php.net/rfc/remove_utf_8_decode_encode

Perhaps we can just revive that one! 



As I say, I have a draft with lots more detail in, which I will tidy up 
after Christmas. I deliberately didn't link to it, because I want to 
re-read it myself before letting other people comment on it, and don't 
have the time right now.


My current inclination is to deprecate in 8.next, and remove in 9.0, but 
I want to make sure the argument for that is solid before putting it to 
a vote.


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-12-22 Thread Andreas Heigl

Hey All.

On 22.12.21 10:08, Andreas Heigl wrote:

Hey all.

On 22.12.21 10:00, Rowan Tommins wrote:

[...]


On 22/12/2021 00:31, Kris Craig wrote:

Now might be a good time to make this into an RFC.  :)



I have a draft kicking around with a lot of analysis of current usage. 
I will try to pick it back up after Christmas.


I just dug a bit deeper on the subject and found this RFC from 2016:

https://wiki.php.net/rfc/remove_utf_8_decode_encode

Perhaps we can just revive that one!

Cheers

Andreas
--
  ,,,
 (o o)
+-ooO-(_)-Ooo-+
| Andreas Heigl   |
| mailto:andr...@heigl.org  N 50°22'59.5" E 08°23'58" |
| https://andreas.heigl.org   |
+-+
| https://hei.gl/appointmentwithandreas   |
+-+


OpenPGP_0xA8D5437ECE724FE5.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-12-22 Thread Andreas Heigl

Hey all.

On 22.12.21 10:00, Rowan Tommins wrote:

On 21/12/2021 23:20, Wade Rossmann wrote:

I would suggest adding optional source/destination encoding parameters to
the functions, eg:

utf8_encode(string $string, string $source_encoding = "ISO-8859-1")
utf8_decode(string $string, string $destination_encoding = "ISO-8859-1")



That's an interesting idea, and definitely worth considering. In the 
much longer term, we could make the parameter mandatory rather than 
deprecating the entire function.


As you say, the challenge is how to implement the other encodings / what 
to do if ext/mbstring is not installed. It would be very tempting to 
support Windows-1252 directly, because it's just a few characters on top 
of the existing mappings, and is so commonly mistaken for ISO-8859-1. 
Anything else could then perhaps give a run-time error if ext/mbstring 
wasn't found. >


On 22/12/2021 00:31, Kris Craig wrote:

Now might be a good time to make this into an RFC.  :)



I have a draft kicking around with a lot of analysis of current usage. I 
will try to pick it back up after Christmas.



Regards,

To be quite honest: Despite the huge outcry that might provoke: I'd 
rather remove them today than keep them or deprecate them. And I'd 
declare the removal as a bug-fix!


Due to the way those functions are currently working they have caused 
more harm than actually good. One had to very explicitly know what they 
are doing to use them in the right way. And most certainly when they 
worked as expected that was more likely due to sheer luck than because 
someone knew what they were doing.


So giving those functions a continued lifetime either as an alias to 
mb_convert_encoding or by implementing the conversion to/from 
Windows-1252 would still leave people under the impression that it is a 
magic function.


I'd rather prefer to get rid of them and point people to the proper way 
of converting one character set to another one with all the possible 
mishaps that will occur.


Just my 0.02€

Cheers

Andreas
--
  ,,,
 (o o)
+-ooO-(_)-Ooo-+
| Andreas Heigl   |
| mailto:andr...@heigl.org  N 50°22'59.5" E 08°23'58" |
| https://andreas.heigl.org   |
+-+
| https://hei.gl/appointmentwithandreas   |
+-+


OpenPGP_0xA8D5437ECE724FE5.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-12-22 Thread Rowan Tommins

On 21/12/2021 23:20, Wade Rossmann wrote:

I would suggest adding optional source/destination encoding parameters to
the functions, eg:

utf8_encode(string $string, string $source_encoding = "ISO-8859-1")
utf8_decode(string $string, string $destination_encoding = "ISO-8859-1")



That's an interesting idea, and definitely worth considering. In the 
much longer term, we could make the parameter mandatory rather than 
deprecating the entire function.


As you say, the challenge is how to implement the other encodings / what 
to do if ext/mbstring is not installed. It would be very tempting to 
support Windows-1252 directly, because it's just a few characters on top 
of the existing mappings, and is so commonly mistaken for ISO-8859-1. 
Anything else could then perhaps give a run-time error if ext/mbstring 
wasn't found.



On 22/12/2021 00:31, Kris Craig wrote:

Now might be a good time to make this into an RFC.  :)



I have a draft kicking around with a lot of analysis of current usage. I 
will try to pick it back up after Christmas.



Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-12-21 Thread Kris Craig
On Tue, Dec 21, 2021 at 3:21 PM Wade Rossmann  wrote:

> On Sun, Mar 21, 2021 at 9:52 AM Larry Garfield 
> wrote:
>
> > On Sun, Mar 21, 2021, at 9:18 AM, Rowan Tommins wrote:
> > > Hi all,
> > >
> > > The functions utf8_encode and utf8_decode are historical oddities,
> which
> > > almost certainly would not be accepted if proposed today:
> > >
> > > * Their names do not describe their functionality, which is to convert
> > > to/from one specific single-byte encoding. This leads to a common
> > > confusion that they can be used to "fix" UTF-8 encoding problems, which
> > > they generally make worse.
> > > * That single-byte encoding is ISO 8859-1, not its common cousins
> > > Windows-1252 or ISO 88159-15. This means, for instance, that they do
> not
> > > handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)
> > > not "\x80" (Windows-1252) or "\xA4" (8859-15)
> > >
> > > On the other hand, they are commonly used, both correctly and
> > > incorrectly, so removing them is not easy.
> > >
> > > A previous proposal to remove them [1] resulted in Andrea making two
> > > significant improvements: moving them from ext/xml to ext/standard [2]
> > > and rewriting the documentation to explain them properly [3]. My
> genuine
> > > thanks for that.
> > >
> > > However, it hasn't stopped people misunderstanding them, and quite
> > > reasonably: you shouldn't need to look up every function you use in the
> > > manual, to make sure it actually does what its name suggests.
> > >
> > >
> > > I can see three ways forward:
> > >
> > > A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
> > > a specific replacement, but recommend people look at iconv() or
> > > mb_convert_encoding(). There is precedent for this, such as
> > > convert_cyr_string(), but it may frustrate those who are using the
> > > functions correctly.
> > >
> > > B) Introduce new names, such as utf8_to_iso_8859_1 and
> > > iso_8859_1_to_utf8; immediately make those the primary names in the
> > > manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
> > > notices for the old names, either immediately or in some future
> release.
> > > This gives a smoother upgrade path, but commits us to having these
> > > functions as outliers in our standard library.
> > >
> > > C) Leave them alone forever. Treat it as the user's fault if they mess
> > > things up by misunderstanding them.
> > >
> > >
> > > I am happy to put together an RFC for either A or B, if it has a chance
> > > of reaching consensus. I would really like to avoid option C.
> > >
> > >
> > > [1] https://externals.io/message/95166
> > > [2] https://github.com/php/php-src/pull/2160
> > > [3]
> > >
> >
> https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238
> > >
> > > Regards,
> >
> > I lost several days of my life to exactly this problem, many years ago.
> I
> > am still triggered by it.
> >
> > I am mostly OK with option A, but with a big caveat:
> >
> > The root problem here is "You keep using that function.  I do not think
> it
> > means what you think it means."
> >
> > As Rowan notes, what people actually *want* most of the time is "I got
> > this string from a user and have NFI what it's encoding is, but my system
> > needs UTF-8, so gimmie this string in UTF-8."  And they use
> utf8_encode(),
> > which then fails *sometimes* in exciting and mysterious ways, because
> > that's not what it is.
> >
> > Removing utf8_encode() may keep people from misusing it, but that doesn't
> > mean the problem space they were trying to solve goes away.  If anything,
> > people who still don't realize that it's the wrong solution will get
> angry
> > that we're taking away a "useful" tool and replacing it with "meh, go
> look
> > at library X," which is admittedly a pretty rude answer.
> >
> > If we're removing a bad answer to the problem, we should also replace it
> > with a good answer.
> >
> > Someone will, I'm sure, pop in at this point and declare "if you don't
> > know the character encoding you're receiving, then you're doing it wrong
> > and are already lost and we can't help you."  While that may be
> technically
> > correct, it's also an entirely useless answer because strings received
> over
> > HTTP very frequently do not tell you what their encoding is, or they lie
> > about what their encoding is.  (The header may say it's ISO8859, or UTF8,
> > or whatever, but someone copy-pasted from MS Word into a text box and now
> > it's Windows-1252 within a wrapper that says ISO8859 but is mostly UTF8
> > except for the Windows-1252 part.  Like, that's literally the problem I
> > lost several days to.)  "Your own fault" is not even an accurate answer
> at
> > that point.
> >
> > So if we're going to take away people's broken hammer, we need to be very
> > clear about what hammer to use instead.
> >
> > The initial answer is probably "here's how to use a series of mb_string
> > functions together to produce a reasonably good
> > 

Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-12-21 Thread Wade Rossmann
On Sun, Mar 21, 2021 at 9:52 AM Larry Garfield 
wrote:

> On Sun, Mar 21, 2021, at 9:18 AM, Rowan Tommins wrote:
> > Hi all,
> >
> > The functions utf8_encode and utf8_decode are historical oddities, which
> > almost certainly would not be accepted if proposed today:
> >
> > * Their names do not describe their functionality, which is to convert
> > to/from one specific single-byte encoding. This leads to a common
> > confusion that they can be used to "fix" UTF-8 encoding problems, which
> > they generally make worse.
> > * That single-byte encoding is ISO 8859-1, not its common cousins
> > Windows-1252 or ISO 88159-15. This means, for instance, that they do not
> > handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)
> > not "\x80" (Windows-1252) or "\xA4" (8859-15)
> >
> > On the other hand, they are commonly used, both correctly and
> > incorrectly, so removing them is not easy.
> >
> > A previous proposal to remove them [1] resulted in Andrea making two
> > significant improvements: moving them from ext/xml to ext/standard [2]
> > and rewriting the documentation to explain them properly [3]. My genuine
> > thanks for that.
> >
> > However, it hasn't stopped people misunderstanding them, and quite
> > reasonably: you shouldn't need to look up every function you use in the
> > manual, to make sure it actually does what its name suggests.
> >
> >
> > I can see three ways forward:
> >
> > A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
> > a specific replacement, but recommend people look at iconv() or
> > mb_convert_encoding(). There is precedent for this, such as
> > convert_cyr_string(), but it may frustrate those who are using the
> > functions correctly.
> >
> > B) Introduce new names, such as utf8_to_iso_8859_1 and
> > iso_8859_1_to_utf8; immediately make those the primary names in the
> > manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
> > notices for the old names, either immediately or in some future release.
> > This gives a smoother upgrade path, but commits us to having these
> > functions as outliers in our standard library.
> >
> > C) Leave them alone forever. Treat it as the user's fault if they mess
> > things up by misunderstanding them.
> >
> >
> > I am happy to put together an RFC for either A or B, if it has a chance
> > of reaching consensus. I would really like to avoid option C.
> >
> >
> > [1] https://externals.io/message/95166
> > [2] https://github.com/php/php-src/pull/2160
> > [3]
> >
> https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238
> >
> > Regards,
>
> I lost several days of my life to exactly this problem, many years ago.  I
> am still triggered by it.
>
> I am mostly OK with option A, but with a big caveat:
>
> The root problem here is "You keep using that function.  I do not think it
> means what you think it means."
>
> As Rowan notes, what people actually *want* most of the time is "I got
> this string from a user and have NFI what it's encoding is, but my system
> needs UTF-8, so gimmie this string in UTF-8."  And they use utf8_encode(),
> which then fails *sometimes* in exciting and mysterious ways, because
> that's not what it is.
>
> Removing utf8_encode() may keep people from misusing it, but that doesn't
> mean the problem space they were trying to solve goes away.  If anything,
> people who still don't realize that it's the wrong solution will get angry
> that we're taking away a "useful" tool and replacing it with "meh, go look
> at library X," which is admittedly a pretty rude answer.
>
> If we're removing a bad answer to the problem, we should also replace it
> with a good answer.
>
> Someone will, I'm sure, pop in at this point and declare "if you don't
> know the character encoding you're receiving, then you're doing it wrong
> and are already lost and we can't help you."  While that may be technically
> correct, it's also an entirely useless answer because strings received over
> HTTP very frequently do not tell you what their encoding is, or they lie
> about what their encoding is.  (The header may say it's ISO8859, or UTF8,
> or whatever, but someone copy-pasted from MS Word into a text box and now
> it's Windows-1252 within a wrapper that says ISO8859 but is mostly UTF8
> except for the Windows-1252 part.  Like, that's literally the problem I
> lost several days to.)  "Your own fault" is not even an accurate answer at
> that point.
>
> So if we're going to take away people's broken hammer, we need to be very
> clear about what hammer to use instead.
>
> The initial answer is probably "here's how to use a series of mb_string
> functions together to produce a reasonably good
> guess-my-encoding-and-convert-to-utf8 routine" documentation.  Which... may
> exist, but if it does I've never found it.  So at bare minimum the
> encode_utf8() documentation needs to include a "use this code snippet
> instead" description, and not just link to the mbstring extension.
> Glancing 

Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Sara Golemon
On Mon, Mar 22, 2021 at 10:04 AM Aleksander Machniak  wrote:

> $str = "グーグル谷歌中信фδοκιμήóźdźрöß";
>
> $this->assertSame($str, utf8_decode(utf8_encode($str)));
>
>
Woah. Yeah. No.  Don't do that.
Doing that is what's wrong with utf8_en/decode().
Doing that convinces me that Rowan is right and we should deprecate then
remove those functions without offering a simple replacement.
Christ's sake... no.


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Rowan Tommins

On 22/03/2021 18:18, Chase Peeler wrote:


Even if it is by accident, removing or changing the behavior of the 
function is guaranteed to make something that currently works (by 
skill or by luck) and risk it no longer working.



This is absolutely true. However, at some point you have to draw the 
line between supported use cases, and requests to re-enable spacebar 
heating: https://xkcd.com/1172/


I think using utf8_encode to store binary data in a text column crosses 
that line: the code was added because of a misunderstanding of the 
function, it works by accident, and there are plenty of better ways to 
solve the actual problem.


Just to be clear, the trick Aleksander and Alexandru stumbled on doesn't 
just work for "corrupted UTF-8"; you could store a JPEG in a text column 
by using utf8_encode(file_get_contents($image_file)). It's probably best 
not to, though.


I *also* agree that users should have a clear guide to how to replace 
their current usages. Fortunately, there are at least 4 other ways of 
writing this functionality in PHP (iconv, mbstring, intl, and the 
Symfony polyfill).


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Chase Peeler
On Mon, Mar 22, 2021 at 1:22 PM Rowan Tommins 
wrote:

> On 22/03/2021 16:52, Aleksander Machniak wrote:
> > On 22.03.2021 16:41, Rowan Tommins wrote:
> >> That code will never do anything useful.
> > I already proved it is useful, regardless of it's name/intention.
>
>
> You have proven no such thing. If that function is saving you from
> errors, it is completely by accident.
>
>
Even if it is by accident, removing or changing the behavior of the
function is guaranteed to make something that currently works (by skill or
by luck) and risk it no longer working.


> The same effect can be achieved using base64_encode() and
> base64_decode(), or bin2hex() and hex2bin(), or any other function that
> takes a series of bytes and applies an arbitrary encoding to it.
>
> It could also be achieved by using a binary column type in the database,
> because the values you have stored are not useful as strings; they might
> as well be encrypted.
>
> Given the sequence of bytes "\xE3\x82\zB0", which is a valid UTF-8
> string representing U+30B0 KATAKANA LETTER GU グ calling utf8_encode()
> will result in the sequence of bytes "\xC3\xA3\xC2\x82\xC2\xB0", which
> is the UTF-8 representation of the following Unicode code points:
>
> - U+00E3 LATIN SMALL LETTER A WITH TILDE ã
> - U+0082 CONTROL: BREAK PERMITTED HERE
> - U+00B0 DEGREE SIGN °
>
> This is clearly gibberish, and bears no relationship to the original
> string; it is what is generally referred to as "mojibake".
>
> Regards,
>
> --
> Rowan Tommins
> [IMSoP]
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php
>
>

-- 
Chase Peeler
chasepee...@gmail.com


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Rowan Tommins

On 22/03/2021 17:38, Alexandru Pătrănescu wrote:

As Rowan mentioned, base64_encode would have worked. But that means one
quarter of the available max column space would be lost as a downside.



Depending on the data, abusing Latin1-to-UTF8 translation can easily 
result in a longer string than base64.



$str = '嵐嵐';

echo strlen(base64_encode($str));
// 12

echo strlen(utf8_encode($str));
// 16


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Alexandru Pătrănescu
On Mon, Mar 22, 2021 at 7:24 PM Alexandru Pătrănescu 
wrote:

>
> There could have been better ways to fix it.
> json_encode / json_decode would have worked just the same.
>
> Nope, strings in a json object must be UTF-8.
As Rowan mentioned, base64_encode would have worked. But that means one
quarter of the available max column space would be lost as a downside.

>
> Regards,
> Alex
>


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Alexandru Pătrănescu
On Mon, Mar 22, 2021 at 6:52 PM Aleksander Machniak  wrote:

> On 22.03.2021 16:41, Rowan Tommins wrote:
> > That code will never do anything useful.
>
> I already proved it is useful, regardless of it's name/intention.
>
> This is old code, not even mine, so maybe when it's been written the PHP
> documentation wasn't that clear about the function(s) intention. Or the
> intention was different.
>
> ps. to Kamil,
>
> We use utf8_encode() to make the string safe to be put in utf-8 database
> column/table. We use utf8_decode() to convert that back to what it was
> before.
>

I just searched and found a hotfix I did a few years ago (when I was also
dumber) and the fix was just adding a utf8_encode to some data received in
$_POST before being sent to a logging service. And a utf8_decode after
reading it for further parsing.
The logging service storage was using a mysql database and the specific
column was declared `TEXT` instead of `BLOB`.
Apparently the fix is still in place.


>
> The tests prove that the conversion is lossless.
>

There could have been better ways to fix it.
json_encode / json_decode would have worked just the same.

The problem was that the quickly identified cause was a non-utf8 string
trying to be stored in an utf8 text column and the solution was implemented
based on the fact that utf8_decode/encode sounded like a good idea when
time is limited; and also knowledge in my case.
I think it would be great to deprecate them somehow.

Regards,
Alex


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Rowan Tommins

On 22/03/2021 16:52, Aleksander Machniak wrote:

On 22.03.2021 16:41, Rowan Tommins wrote:

That code will never do anything useful.

I already proved it is useful, regardless of it's name/intention.



You have proven no such thing. If that function is saving you from 
errors, it is completely by accident.


The same effect can be achieved using base64_encode() and 
base64_decode(), or bin2hex() and hex2bin(), or any other function that 
takes a series of bytes and applies an arbitrary encoding to it.


It could also be achieved by using a binary column type in the database, 
because the values you have stored are not useful as strings; they might 
as well be encrypted.


Given the sequence of bytes "\xE3\x82\zB0", which is a valid UTF-8 
string representing U+30B0 KATAKANA LETTER GU グ calling utf8_encode() 
will result in the sequence of bytes "\xC3\xA3\xC2\x82\xC2\xB0", which 
is the UTF-8 representation of the following Unicode code points:


- U+00E3 LATIN SMALL LETTER A WITH TILDE ã
- U+0082 CONTROL: BREAK PERMITTED HERE
- U+00B0 DEGREE SIGN °

This is clearly gibberish, and bears no relationship to the original 
string; it is what is generally referred to as "mojibake".


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Aleksander Machniak
On 22.03.2021 16:41, Rowan Tommins wrote:
> That code will never do anything useful.

I already proved it is useful, regardless of it's name/intention.

This is old code, not even mine, so maybe when it's been written the PHP
documentation wasn't that clear about the function(s) intention. Or the
intention was different.

ps. to Kamil,

We use utf8_encode() to make the string safe to be put in utf-8 database
column/table. We use utf8_decode() to convert that back to what it was
before.

The tests prove that the conversion is lossless.

-- 
Aleksander Machniak
Kolab Groupware Developer[https://kolab.org]
Roundcube Webmail Developer  [https://roundcube.net]

PGP: 19359DC1 # Blog: https://kolabian.wordpress.com

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Rowan Tommins

On 22/03/2021 15:04, Aleksander Machniak wrote:


I'm using utf8_encode()/utf8_decode() to make input string safe to be
stored in DB, and back. In most cases the input is utf-8, but it
occasionally may contain "broken characters".



That is not what this function does, at all. The fact that its name 
makes you think that is exactly why I want to get rid of that name.




 $str = "グーグル谷歌中信фδοκιμήóźdźрöß";

 $this->assertSame($str, utf8_decode(utf8_encode($str)));



Let's write that out with a more descriptive function name:

$str = "グーグル谷歌中信фδοκιμήóźdźрöß";

$this->assertSame($str, utf8_to_latin1(latin1_to_utf8($str)));


Since Latin-1 does not contain any Chinese, Japanese, or Emoji 
characters, running latin1_to_uft8 on that string is clearly nonsensical.


The only reason it doesn't give you any errors is that every possible 
byte is a valid character in Latin1, and every Latin1 character has a 
Unicode code point. So the "グ" is interpreted as three Latin-1 
characters: E3, 82, and B0; those then become the corresponding Unicode 
code points U+00E3, U+00821, and U+00B0, represented in UTF-8. You then 
run utf8_to_latin1, and they get converted back.


That code will never do anything useful.

Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Kamil Tekiela
>
> I'm using utf8_encode()/utf8_decode() to make input string safe to be
> stored in DB, and back. In most cases the input is utf-8, but it
> occasionally may contain "broken characters".
>

What exactly do you mean by making the input string safe? If I understand
correctly utf8_decode(utf8_encode($str)) should just be an identity
function. Could you please explain what is the purpose of using these
functions in such a way?


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Aleksander Machniak
On 22.03.2021 15:30, Rowan Tommins wrote:
> - Make utf8_decode() throw errors for unrepresentable characters.

I'm not sure I understand this, but it sounds like it would be a BC
break for my case.

I'm using utf8_encode()/utf8_decode() to make input string safe to be
stored in DB, and back. In most cases the input is utf-8, but it
occasionally may contain "broken characters".

$str = '';
for ($x=0; $x<256; $x++) {
$str .= chr($x);
}

$this->assertSame($str, utf8_decode(utf8_encode($str)));

$str = "グーグル谷歌中信фδοκιμήóźdźрöß";

$this->assertSame($str, utf8_decode(utf8_encode($str)));

Could anyone point to a sample input that will not work with my use-case?

-- 
Aleksander Machniak
Kolab Groupware Developer[https://kolab.org]
Roundcube Webmail Developer  [https://roundcube.net]

PGP: 19359DC1 # Blog: https://kolabian.wordpress.com

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Rowan Tommins

On 22/03/2021 13:18, Nicolas Grekas wrote:


Shameless plug: the polyfill exists, without any dependency, see
https://github.com/symfony/polyfill-php72/blob/main/Php72.php 




Ah, thanks for sharing that. I realised while trying to get to sleep 
that a pure-PHP implementation would be fairly straight-forward because 
of the relationship between Latin1 and Unicode.


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Rowan Tommins

On 22/03/2021 13:10, Sara Golemon wrote:


> * People who just want to replace calls to utf8_decode won't want to go
> through every call and make it exception safe.
>

Then they shouldn't use these replacements, it's not for them. It's 
for people using iso-8859-1.



This is a non-sequitur. Someone using the function correctly to convert 
to ISO 8859-1 may also be relying on the documented and consistent 
error-handling behaviour. Substituting the character may not always be 
the best approach, but in some cases it's more useful than discarding 
the entire string, let alone aborting the entire process with an 
unhandled Throwable.



The goal is only to not punish users by taking away a valid API that 
they were using correctly (for those users who were using it correctly).



I'm sympathetic to that aim, but if the new function is not the same, 
you *are* taking away the existing API, and introducing a new one. 
Neither of the following seems like it would be accepted:


- Make utf8_decode() throw errors for unrepresentable characters.
- Introduce a function specifically for converting from UTF-8 to 
Latin-1, if we didn't already have one.


So it feels questionable to me to design a new function, which is 
neither compatible with what we have, nor a reasonable addition on its 
own merits.



Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Björn Larsson

Den 2021-03-22 kl. 14:10, skrev Sara Golemon:

On Mon, Mar 22, 2021 at 5:24 AM Rowan Tommins 
wrote:

I'm strongly against any concept of "indefinite deprecation". I consider
any deprecation notice a commitment to remove the feature in the future,
even if a specific timeline for that removal is not given.



I don't feel strongly about indefinite deprecation.  If you wanna nuke it
in 9.0, have fun.  I'm just saying I don't necessarily see the need to do
so.  The problem being addressed here is that *some* users of this function
are probably misusing it, so it's worth putting guiderails on.  I'm
hesitant to punish the ones who know exactly what they're doing as a result
of that well-meaning intention.


* People who just want to replace calls to utf8_decode won't want to go
through every call and make it exception safe.



Then they shouldn't use these replacements, it's not for them. It's for
people using iso-8859-1.


* People who want to write a polyfill couldn't use it, because they
wouldn't be able to recover the remainder of the string after an error
is thrown.



If you're writing a polyfill, then write a polyfill.   The polyfill for the
old functions is trivial, I could have written it a dozen times in the
course of writing this email reply.
So this replacement is also not for them.


* People who want transcoding without any optional extensions will be
disappointed to find only this one encoding supported.


This function isn't for them.It's for people using iso-8859-1.

There's a theme in here. :)


You'd effectively be adding a completely new core function just for
those people who work with Latin1 text, and are confident that it's not
Windows-1252 in disguise.



Yes.  I'm specifically addressing the people who have been using
utf8_en/decode() correctly all this time.  They shouldn't be punished for
the stupidity of others.


It's tempting to make any C1 control characters an error as well -
although technically valid in Latin1, these are very rarely used, and
it's much more likely that any bytes in that range are intended as
characters in Windows-1252. But that would feel very odd without having
a corresponding utf8_from_windows1252 function to use instead, at which
point we're into designing a whole new conversion library. And of
course, once you've got that UTF-8 string, you can't do much with it,
because PHP's native string functions are all byte-based, so you've
basically got to re-invent large chunks of ext/mbstring...



I disagree that you'd need to add utf8_from/to_windows1252 "for
completeness".  The goal isn't to provide all possible conversion
utilities.  The goal is only to not punish users by taking away a valid API
that they were using correctly (for those users who were using it
correctly).


-Sara




Think I'm one such user :-) So keeping them and improving a little would
be fine with me!

r//Björn L

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Nicolas Grekas
Le lun. 22 mars 2021 à 14:14, Sara Golemon  a écrit :

> On Mon, Mar 22, 2021 at 6:12 AM Rowan Tommins 
> wrote:
> > I realise you can't speak for anyone else, but as a point of interest,
> > would you be OK with a polyfill having a requirement on ext/mbstring or
> > ext/iconv, or would you have a strong preference for a replacement built
> > into the core (i.e. guaranteed available without any optional packages)?
> >
>
> Can you clarify what *YOU* mean by a polyfill?  Because you're talking
> about dependence on iconv/mbstring/icu which implies you want a polyfill
> that does something other than what the original utf8_en/decode() functions
> do, and those functions certainly do not need external dependencies.
> They're really just not that complex.
>

Shameless plug: the polyfill exists, without any dependency, see
https://github.com/symfony/polyfill-php72/blob/main/Php72.php

;)
Nicolas


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Sara Golemon
On Mon, Mar 22, 2021 at 6:12 AM Rowan Tommins 
wrote:
> I realise you can't speak for anyone else, but as a point of interest,
> would you be OK with a polyfill having a requirement on ext/mbstring or
> ext/iconv, or would you have a strong preference for a replacement built
> into the core (i.e. guaranteed available without any optional packages)?
>

Can you clarify what *YOU* mean by a polyfill?  Because you're talking
about dependence on iconv/mbstring/icu which implies you want a polyfill
that does something other than what the original utf8_en/decode() functions
do, and those functions certainly do not need external dependencies.
They're really just not that complex.

-Sara


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Sara Golemon
On Mon, Mar 22, 2021 at 5:24 AM Rowan Tommins 
wrote:
> I'm strongly against any concept of "indefinite deprecation". I consider
> any deprecation notice a commitment to remove the feature in the future,
> even if a specific timeline for that removal is not given.
>

I don't feel strongly about indefinite deprecation.  If you wanna nuke it
in 9.0, have fun.  I'm just saying I don't necessarily see the need to do
so.  The problem being addressed here is that *some* users of this function
are probably misusing it, so it's worth putting guiderails on.  I'm
hesitant to punish the ones who know exactly what they're doing as a result
of that well-meaning intention.

> * People who just want to replace calls to utf8_decode won't want to go
> through every call and make it exception safe.
>

Then they shouldn't use these replacements, it's not for them. It's for
people using iso-8859-1.

> * People who want to write a polyfill couldn't use it, because they
> wouldn't be able to recover the remainder of the string after an error
> is thrown.
>

If you're writing a polyfill, then write a polyfill.   The polyfill for the
old functions is trivial, I could have written it a dozen times in the
course of writing this email reply.
So this replacement is also not for them.

> * People who want transcoding without any optional extensions will be
> disappointed to find only this one encoding supported.
>
This function isn't for them.It's for people using iso-8859-1.

There's a theme in here. :)

> You'd effectively be adding a completely new core function just for
> those people who work with Latin1 text, and are confident that it's not
> Windows-1252 in disguise.
>

Yes.  I'm specifically addressing the people who have been using
utf8_en/decode() correctly all this time.  They shouldn't be punished for
the stupidity of others.

> It's tempting to make any C1 control characters an error as well -
> although technically valid in Latin1, these are very rarely used, and
> it's much more likely that any bytes in that range are intended as
> characters in Windows-1252. But that would feel very odd without having
> a corresponding utf8_from_windows1252 function to use instead, at which
> point we're into designing a whole new conversion library. And of
> course, once you've got that UTF-8 string, you can't do much with it,
> because PHP's native string functions are all byte-based, so you've
> basically got to re-invent large chunks of ext/mbstring...
>

I disagree that you'd need to add utf8_from/to_windows1252 "for
completeness".  The goal isn't to provide all possible conversion
utilities.  The goal is only to not punish users by taking away a valid API
that they were using correctly (for those users who were using it
correctly).

> -Sara


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Björn Larsson

Den 2021-03-22 kl. 12:12, skrev Rowan Tommins:

Hi Björn,

On 22/03/2021 10:28, Björn Larsson wrote:

In our case we use the utf8_decode functions to convert from UTF8 in
the client to ISO-8859-1 on the server, since the site is encoded in
latin1.

Our usage of that function is working flawlessly, so for us it's super
important to have a clear migration path with a good polyfill! 



I realise you can't speak for anyone else, but as a point of interest, 
would you be OK with a polyfill having a requirement on ext/mbstring or 
ext/iconv, or would you have a strong preference for a replacement built 
into the core (i.e. guaranteed available without any optional packages)?


Regards,



Well, both these extensions are part of our environment so I presume it
will also be so in the future.

Now if we have a polyfill dependent on these libraries it's a question
on how these libraries are maintained and that they are not EOL. Just 
speaking from a general point here. I'm in slight favour of mbstring,

since I have a small experience of it.

What's important for us is that the polyfill has a simple API and 
doesn't have any surprises / side effects. I think though there is

a case for improving these functions and keep them in the core.

We wrap these functions in one place so it's relatively easy to change 
the wrapper to accomodate new functionality in the utf8_* functions as

long as we get the same end result.

I also think one should consider which opensource libraries that are
using these functions. E.g. the Revive ad server v5.2 are using both.

r//Björn L

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Rowan Tommins

Hi Björn,

On 22/03/2021 10:28, Björn Larsson wrote:

In our case we use the utf8_decode functions to convert from UTF8 in
the client to ISO-8859-1 on the server, since the site is encoded in
latin1.

Our usage of that function is working flawlessly, so for us it's super
important to have a clear migration path with a good polyfill! 



I realise you can't speak for anyone else, but as a point of interest, 
would you be OK with a polyfill having a requirement on ext/mbstring or 
ext/iconv, or would you have a strong preference for a replacement built 
into the core (i.e. guaranteed available without any optional packages)?


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Björn Larsson

Den 2021-03-21 kl. 22:39, skrev Rowan Tommins:

On 21/03/2021 21:00, Max Semenik wrote:
Just a quick reminder that it's possible to compile PHP without 
mbstring and intl, which means that some hosts will provide PHP 
without these extensions, and some packagers make them available as 
separate packages that users can't or don't know how to install. Maybe 
we've got an opportunity to think about making these extensions 
mandatory?



It's somewhat relevant that until PHP 7.2, it was also possible for 
utf8_encode and utf8_decode to be missing, because they were in ext/xml, 
which is also optional.


Bundling mbstring sounds great, until you look into the details of 
what's in there and how it works. Its origin as a PHP 4 extension for 
handling Japanese-specific character encodings is visible in parts of 
its design - there's a lot of global state, and very little support for 
the nuances of Unicode.


Bundling intl would be great, but it's a wrapper around ICU, which is 
huge (because Unicode is complicated). I have read that incorporating 
that into core was one of the icebergs that sunk PHP 6. It's also 
extremely sparsely documented (if someone's looking for a project, it 
would be great to fill in all the manual stubs with a few details from 
the corresponding ICU documentation).


For what its worth, it seems these would be the relevant polyfills:

function utf8_encode(string $string) { return 
UConverter::transcode($string, 'UTF8', 'ISO-8859-1'); }
function utf8_decode(string $string) { return 
UConverter::transcode($string, 'ISO-8859-1', 'UTF8'); }



Regards,



In our case we use the utf8_decode functions to convert from UTF8 in
the client to ISO-8859-1 on the server, since the site is encoded in
latin1.

Our usage of that function is working flawlessly, so for us it's super
important to have a clear migration path with a good polyfill!

r//Björn L

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Rowan Tommins

On 22/03/2021 01:15, Sara Golemon wrote:
My preference is for a deprecation notice (but not necessarily removal 
ever -- We can argue that part a little).



I'm strongly against any concept of "indefinite deprecation". I consider 
any deprecation notice a commitment to remove the feature in the future, 
even if a specific timeline for that removal is not given.


If we want to have a separate status of "will be kept indefinitely, but 
you shouldn't use it", then we need a separate E_DISCOURAGED, or some 
boilerplate in the manual which doesn't use the word "deprecated".



As for details, I don't love iso_8859_1_to_utf8(), but we can use the 
common alias for iso-8859-1 known as latin1 and call the new 
functions: utf8_from_latin1() and utf8_to_latin1() with the caveat 
that the later will throw a ValueError for codepoints which are out of 
range (one of the more problematic issues with utf8_decode()). That 
makes this not just a simple rename for clarity, but what I'd consider 
a bug-fix for an unfortunately unfixable function.



While I can see the temptation here, I'm not sure who the target 
audience for the new function would be:


* People who just want to replace calls to utf8_decode won't want to go 
through every call and make it exception safe.
* People who want to write a polyfill couldn't use it, because they 
wouldn't be able to recover the remainder of the string after an error 
is thrown.
* People who want transcoding without any optional extensions will be 
disappointed to find only this one encoding supported.


You'd effectively be adding a completely new core function just for 
those people who work with Latin1 text, and are confident that it's not 
Windows-1252 in disguise.


It's tempting to make any C1 control characters an error as well - 
although technically valid in Latin1, these are very rarely used, and 
it's much more likely that any bytes in that range are intended as 
characters in Windows-1252. But that would feel very odd without having 
a corresponding utf8_from_windows1252 function to use instead, at which 
point we're into designing a whole new conversion library. And of 
course, once you've got that UTF-8 string, you can't do much with it, 
because PHP's native string functions are all byte-based, so you've 
basically got to re-invent large chunks of ext/mbstring...



Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-22 Thread Hans Henrik Bergan
i would prefer to soft-deprecate them like we did with the mysql_ api,
where they do not generate E_DEPRECATED for quite some time, but the
documentation say
"this function is deprecated, instead use mb_convert_encoding ( $str ,
"UTF-8", "ISO-8859-1" );  or iconv("ISO-8859-1","UTF-8", $str)"
and.. make it go E_DEPRECATED in the distant future..


Rowan said "they are commonly used, both correctly and
incorrectly", in my experience, no it's not used correctly, people who are
using it, are using it incorrectly to convert Windows-1252 to utf-8, not
ISO-8859-1...



On Mon, 22 Mar 2021 at 02:15, Sara Golemon  wrote:

> On Sun, Mar 21, 2021 at 9:18 AM Rowan Tommins 
> wrote:
>
> > A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
> > a specific replacement, but recommend people look at iconv() or
> > mb_convert_encoding(). There is precedent for this, such as
> > convert_cyr_string(), but it may frustrate those who are using the
> > functions correctly.
> >
> > B) Introduce new names, such as utf8_to_iso_8859_1 and
> > iso_8859_1_to_utf8; immediately make those the primary names in the
> > manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
> > notices for the old names, either immediately or in some future release.
> > This gives a smoother upgrade path, but commits us to having these
> > functions as outliers in our standard library.
> >
> > C) Leave them alone forever. Treat it as the user's fault if they mess
> > things up by misunderstanding them.
> >
> >
> My preference is for a deprecation notice (but not necessarily removal ever
> -- We can argue that part a little).
>
> As for what users should use instead, obviously there are multiple options
> already in core (which you referenced), but those all have third party deps
> and can't be guaranteed the way utf8_en/decode() can (this was the point of
> moving them from xml).
>
> While I'm normally in favor of userspace things belonging in userspace
> (this particular conversion is trivial since it's a 1:1 mapping), I'm
> actually willing to see this added under a new, clearer name in
> ext/standard since this is something that's in long use, but used
> incorrectly.
>
> As for details, I don't love iso_8859_1_to_utf8(), but we can use the
> common alias for iso-8859-1 known as latin1 and call the new functions:
> utf8_from_latin1() and utf8_to_latin1() with the caveat that the later will
> throw a ValueError for codepoints which are out of range (one of the more
> problematic issues with utf8_decode()).  That makes this not just a simple
> rename for clarity, but what I'd consider a bug-fix for an unfortunately
> unfixable function.
>
> -Sara
>


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-21 Thread Sara Golemon
On Sun, Mar 21, 2021 at 9:18 AM Rowan Tommins 
wrote:

> A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
> a specific replacement, but recommend people look at iconv() or
> mb_convert_encoding(). There is precedent for this, such as
> convert_cyr_string(), but it may frustrate those who are using the
> functions correctly.
>
> B) Introduce new names, such as utf8_to_iso_8859_1 and
> iso_8859_1_to_utf8; immediately make those the primary names in the
> manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
> notices for the old names, either immediately or in some future release.
> This gives a smoother upgrade path, but commits us to having these
> functions as outliers in our standard library.
>
> C) Leave them alone forever. Treat it as the user's fault if they mess
> things up by misunderstanding them.
>
>
My preference is for a deprecation notice (but not necessarily removal ever
-- We can argue that part a little).

As for what users should use instead, obviously there are multiple options
already in core (which you referenced), but those all have third party deps
and can't be guaranteed the way utf8_en/decode() can (this was the point of
moving them from xml).

While I'm normally in favor of userspace things belonging in userspace
(this particular conversion is trivial since it's a 1:1 mapping), I'm
actually willing to see this added under a new, clearer name in
ext/standard since this is something that's in long use, but used
incorrectly.

As for details, I don't love iso_8859_1_to_utf8(), but we can use the
common alias for iso-8859-1 known as latin1 and call the new functions:
utf8_from_latin1() and utf8_to_latin1() with the caveat that the later will
throw a ValueError for codepoints which are out of range (one of the more
problematic issues with utf8_decode()).  That makes this not just a simple
rename for clarity, but what I'd consider a bug-fix for an unfortunately
unfixable function.

-Sara


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-21 Thread Rowan Tommins

On 21/03/2021 21:00, Max Semenik wrote:
Just a quick reminder that it's possible to compile PHP without 
mbstring and intl, which means that some hosts will provide PHP 
without these extensions, and some packagers make them available as 
separate packages that users can't or don't know how to install. Maybe 
we've got an opportunity to think about making these extensions mandatory?



It's somewhat relevant that until PHP 7.2, it was also possible for 
utf8_encode and utf8_decode to be missing, because they were in ext/xml, 
which is also optional.


Bundling mbstring sounds great, until you look into the details of 
what's in there and how it works. Its origin as a PHP 4 extension for 
handling Japanese-specific character encodings is visible in parts of 
its design - there's a lot of global state, and very little support for 
the nuances of Unicode.


Bundling intl would be great, but it's a wrapper around ICU, which is 
huge (because Unicode is complicated). I have read that incorporating 
that into core was one of the icebergs that sunk PHP 6. It's also 
extremely sparsely documented (if someone's looking for a project, it 
would be great to fill in all the manual stubs with a few details from 
the corresponding ICU documentation).


For what its worth, it seems these would be the relevant polyfills:

function utf8_encode(string $string) { return 
UConverter::transcode($string, 'UTF8', 'ISO-8859-1'); }
function utf8_decode(string $string) { return 
UConverter::transcode($string, 'ISO-8859-1', 'UTF8'); }



Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-21 Thread Max Semenik
On Sun, Mar 21, 2021 at 10:08 PM Kamil Tekiela  wrote:

> I think we really do not need to keep these functions. As for the
> alternative that we can offer, iconv seems to be doing exactly the same
> thing and even better. mb_convert_encoding does the same but also silently
> ignores invalid characters. So we already offer plenty of alternatives. We
> don't need to add anything new.
>

Just a quick reminder that it's possible to compile PHP without mbstring
and intl, which means that some hosts will provide PHP without these
extensions, and some packagers make them available as separate packages
that users can't or don't know how to install. Maybe we've got an
opportunity to think about making these extensions mandatory?

-- 
Best regards,
Max Semenik


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-21 Thread Kamil Tekiela
Option A, please.

I have never had a reason to use either of these two functions. I assume
there's plenty of valid applications for converting between ISO-8859-1 and
UTF-8, but that function causes more harm than good.
I have seen plenty of people use it, but I have never seen anyone use it
properly. Most of the time people use it to fix their mojibake text when
they forget to set the connection charset in PDO or mysqli. I was a little
surprised to learn that these functions had something to do with XML.

The reason why I consider them dangerous is that people using them are most
likely solving the wrong problem. The problem isn't the conversion from ISO
to UTF but having the text in the wrong format in the first place. They are
used as some kind of magical solution that fixes an annoying problem. I
would have no quarrel with them if they were named correctly though.
Another reason why I do not like these functions is that they let you shoot
yourself in the foot very easily. They don't warn about invalid or missing
code points, which often leads to more data corruption. When doing the same
with ICONV you at least get a notice.

I think we really do not need to keep these functions. As for the
alternative that we can offer, iconv seems to be doing exactly the same
thing and even better. mb_convert_encoding does the same but also silently
ignores invalid characters. So we already offer plenty of alternatives. We
don't need to add anything new.

-- Kamil


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-21 Thread Rowan Tommins

On 21/03/2021 16:51, Larry Garfield wrote:

As Rowan notes, what people actually*want*  most of the time is "I got this string 
from a user and have NFI what it's encoding is, but my system needs UTF-8, so gimmie this 
string in UTF-8."  And they use utf8_encode(), which then fails*sometimes*  in 
exciting and mysterious ways, because that's not what it is.

[...]

If we're removing a bad answer to the problem, we should also replace it with a 
good answer.



This is indeed my main concern with complete deprecation. The problem is 
that detecting string encoding is a Really Hard Problem™


The fundamental problem is that any sequence of bytes is valid in any 
single-byte encoding. If you're expecting printable characters only, you 
can rule out some candidates if you're lucky - e.g. if your string 
contains a byte in the range 0x80 to 0x9F, it's not any part of ISO 8859 
- but the string "\xB0\xC0\xD0" is both valid and printable in any of 
dozens of 8-bit encodings.


I recently came across a Python library implementing a clever approach 
to the problem, which originated at Mozilla. Its concise FAQ is worth 
reading: https://chardet.readthedocs.io/en/latest/faq.html The approach 
Mozilla came up with is to decide which encoding leads to something most 
likely to be natural human text - e.g. don't suggest an encoding common 
for Cyrillic if the result would be completely unpronounceable in Russian.



The only function I know of which even attempts encoding detection in 
PHP is mb_detect_encoding, and it does a pretty bad job. For instance:


echo mb_detect_encoding("\x80500", ['Windows-1252', 'ISO-8859-15', 
'ISO-8859-1']);


...picks ISO-8859-15, where 0x80 is a rarely-used control character, 
rather than Windows-1252, where it's the Euro symbol.



On the other hand, if you know what encoding you do have, either of the 
following will work fine:


echo mb_convert_encoding("\x80500", 'UTF-8', 'Windows-1252');
echo iconv('Windows-1252', 'UTF-8', "\x80500");

Either of these functions (passed ISO-8859-1) can be used as a polyfill 
for correct uses of utf8_encode/utf8_decode, but neither is going to do 
the magic trick which people always *hope* those functions will.



Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-21 Thread Larry Garfield
On Sun, Mar 21, 2021, at 9:18 AM, Rowan Tommins wrote:
> Hi all,
> 
> The functions utf8_encode and utf8_decode are historical oddities, which 
> almost certainly would not be accepted if proposed today:
> 
> * Their names do not describe their functionality, which is to convert 
> to/from one specific single-byte encoding. This leads to a common 
> confusion that they can be used to "fix" UTF-8 encoding problems, which 
> they generally make worse.
> * That single-byte encoding is ISO 8859-1, not its common cousins 
> Windows-1252 or ISO 88159-15. This means, for instance, that they do not 
> handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)  
> not "\x80" (Windows-1252) or "\xA4" (8859-15)
> 
> On the other hand, they are commonly used, both correctly and 
> incorrectly, so removing them is not easy.
> 
> A previous proposal to remove them [1] resulted in Andrea making two 
> significant improvements: moving them from ext/xml to ext/standard [2] 
> and rewriting the documentation to explain them properly [3]. My genuine 
> thanks for that.
> 
> However, it hasn't stopped people misunderstanding them, and quite 
> reasonably: you shouldn't need to look up every function you use in the 
> manual, to make sure it actually does what its name suggests.
> 
> 
> I can see three ways forward:
> 
> A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide 
> a specific replacement, but recommend people look at iconv() or 
> mb_convert_encoding(). There is precedent for this, such as 
> convert_cyr_string(), but it may frustrate those who are using the 
> functions correctly.
> 
> B) Introduce new names, such as utf8_to_iso_8859_1 and 
> iso_8859_1_to_utf8; immediately make those the primary names in the 
> manual, with utf8_encode / utf8_decode as aliases. Raise deprecation 
> notices for the old names, either immediately or in some future release. 
> This gives a smoother upgrade path, but commits us to having these 
> functions as outliers in our standard library.
> 
> C) Leave them alone forever. Treat it as the user's fault if they mess 
> things up by misunderstanding them.
> 
> 
> I am happy to put together an RFC for either A or B, if it has a chance 
> of reaching consensus. I would really like to avoid option C.
> 
> 
> [1] https://externals.io/message/95166
> [2] https://github.com/php/php-src/pull/2160
> [3] 
> https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238
> 
> Regards,

I lost several days of my life to exactly this problem, many years ago.  I am 
still triggered by it.

I am mostly OK with option A, but with a big caveat:

The root problem here is "You keep using that function.  I do not think it 
means what you think it means."

As Rowan notes, what people actually *want* most of the time is "I got this 
string from a user and have NFI what it's encoding is, but my system needs 
UTF-8, so gimmie this string in UTF-8."  And they use utf8_encode(), which then 
fails *sometimes* in exciting and mysterious ways, because that's not what it 
is.

Removing utf8_encode() may keep people from misusing it, but that doesn't mean 
the problem space they were trying to solve goes away.  If anything, people who 
still don't realize that it's the wrong solution will get angry that we're 
taking away a "useful" tool and replacing it with "meh, go look at library X," 
which is admittedly a pretty rude answer.

If we're removing a bad answer to the problem, we should also replace it with a 
good answer.

Someone will, I'm sure, pop in at this point and declare "if you don't know the 
character encoding you're receiving, then you're doing it wrong and are already 
lost and we can't help you."  While that may be technically correct, it's also 
an entirely useless answer because strings received over HTTP very frequently 
do not tell you what their encoding is, or they lie about what their encoding 
is.  (The header may say it's ISO8859, or UTF8, or whatever, but someone 
copy-pasted from MS Word into a text box and now it's Windows-1252 within a 
wrapper that says ISO8859 but is mostly UTF8 except for the Windows-1252 part.  
Like, that's literally the problem I lost several days to.)  "Your own fault" 
is not even an accurate answer at that point.

So if we're going to take away people's broken hammer, we need to be very clear 
about what hammer to use instead.

The initial answer is probably "here's how to use a series of mb_string 
functions together to produce a reasonably good 
guess-my-encoding-and-convert-to-utf8 routine" documentation.  Which... may 
exist, but if it does I've never found it.  So at bare minimum the 
encode_utf8() documentation needs to include a "use this code snippet instead" 
description, and not just link to the mbstring extension.  Glancing through the 
mbstring docs right now, it looks like it's not already a single function call, 
but some combination of several, and has some global flags that get set (via 

Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-21 Thread Ben Ramsey
> On Mar 21, 2021, at 09:32, Benjamin Morel  wrote:
> 
> On Sun, 21 Mar 2021 at 15:18, Rowan Tommins  wrote:
> 
>> I can see three ways forward:
>> 
>> A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
>> a specific replacement, but recommend people look at iconv() or
>> mb_convert_encoding(). There is precedent for this, such as
>> convert_cyr_string(), but it may frustrate those who are using the
>> functions correctly.
>> 
>> B) Introduce new names, such as utf8_to_iso_8859_1 and
>> iso_8859_1_to_utf8; immediately make those the primary names in the
>> manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
>> notices for the old names, either immediately or in some future release.
>> This gives a smoother upgrade path, but commits us to having these
>> functions as outliers in our standard library.
>> 
> 
> Hi, I'm personally fine with A or B, both of which have pros & cons:
> 
> - A is probably the cleanest way as, as you said, these functions should
> never have existed (locked to a single encoding that will only benefit a
> portion of users), but that's quite a BC break
> - B has is less of a BC break as it gives users a chance to rename their
> function calls, but leaves an oddity in the standard library
> 
> I'm a bit worried that either way, we'll start seeing some "polyfills"
> appear on Packagist to re-introduce the old functions, but at least they
> will be gone from the core.


I prefer option A, and the emergence of userland polyfills doesn’t worry
me. IMO, that’s the right way for the community to handle the BC break.

Cheers,
Ben





signature.asc
Description: Message signed with OpenPGP


Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-21 Thread Ayesh Karunaratne
Thank you for opening this conversation, these functions have stung me
in the past, and I would be so happy to see them gone :)

Personally, I would very much like to go with Plan A.

- XML parsers that often deal with non-UTF-8 character encodings
frequently use these functions. However, any parser worth their salt
is better off using mbstring or iconv because of the lack of
Windows-1252 support that is assumed elsewhere for ISO-8859. If we
have a `utf8_encode` that supports Windows-1252 as often expected, I
think plan B would be the more smoother upgrade.

 - On Packagist top 1000 downloads, stripe-php, phpcpd, pdepend,
carbon, monolog, php-cs-fixer, htmlpurifier, and aws-php-sdk use
`utf8_encode`. Some of these libraries depend on `ext-mbstring` or
Symfony mbstring polyfill, so we are left with even fewer libraries
that cannot assume `iconv()` or `mb_convert_encoding` availability.

On Sun, Mar 21, 2021 at 7:48 PM Rowan Tommins  wrote:
>
> Hi all,
>
> The functions utf8_encode and utf8_decode are historical oddities, which
> almost certainly would not be accepted if proposed today:
>
> * Their names do not describe their functionality, which is to convert
> to/from one specific single-byte encoding. This leads to a common
> confusion that they can be used to "fix" UTF-8 encoding problems, which
> they generally make worse.
> * That single-byte encoding is ISO 8859-1, not its common cousins
> Windows-1252 or ISO 88159-15. This means, for instance, that they do not
> handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)
> not "\x80" (Windows-1252) or "\xA4" (8859-15)
>
> On the other hand, they are commonly used, both correctly and
> incorrectly, so removing them is not easy.
>
> A previous proposal to remove them [1] resulted in Andrea making two
> significant improvements: moving them from ext/xml to ext/standard [2]
> and rewriting the documentation to explain them properly [3]. My genuine
> thanks for that.
>
> However, it hasn't stopped people misunderstanding them, and quite
> reasonably: you shouldn't need to look up every function you use in the
> manual, to make sure it actually does what its name suggests.
>
>
> I can see three ways forward:
>
> A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
> a specific replacement, but recommend people look at iconv() or
> mb_convert_encoding(). There is precedent for this, such as
> convert_cyr_string(), but it may frustrate those who are using the
> functions correctly.
>
> B) Introduce new names, such as utf8_to_iso_8859_1 and
> iso_8859_1_to_utf8; immediately make those the primary names in the
> manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
> notices for the old names, either immediately or in some future release.
> This gives a smoother upgrade path, but commits us to having these
> functions as outliers in our standard library.
>
> C) Leave them alone forever. Treat it as the user's fault if they mess
> things up by misunderstanding them.
>
>
> I am happy to put together an RFC for either A or B, if it has a chance
> of reaching consensus. I would really like to avoid option C.
>
>
> [1] https://externals.io/message/95166
> [2] https://github.com/php/php-src/pull/2160
> [3]
> https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238
>
> Regards,
>
> --
> Rowan Tommins
> [IMSoP]
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php
>

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-21 Thread Benjamin Morel
On Sun, 21 Mar 2021 at 15:18, Rowan Tommins  wrote:

> I can see three ways forward:
>
> A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
> a specific replacement, but recommend people look at iconv() or
> mb_convert_encoding(). There is precedent for this, such as
> convert_cyr_string(), but it may frustrate those who are using the
> functions correctly.
>
> B) Introduce new names, such as utf8_to_iso_8859_1 and
> iso_8859_1_to_utf8; immediately make those the primary names in the
> manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
> notices for the old names, either immediately or in some future release.
> This gives a smoother upgrade path, but commits us to having these
> functions as outliers in our standard library.
>

Hi, I'm personally fine with A or B, both of which have pros & cons:

- A is probably the cleanest way as, as you said, these functions should
never have existed (locked to a single encoding that will only benefit a
portion of users), but that's quite a BC break
- B has is less of a BC break as it gives users a chance to rename their
function calls, but leaves an oddity in the standard library

I'm a bit worried that either way, we'll start seeing some "polyfills"
appear on Packagist to re-introduce the old functions, but at least they
will be gone from the core.

— Benjamin


[PHP-DEV] What should we do with utf8_encode and utf8_decode?

2021-03-21 Thread Rowan Tommins

Hi all,

The functions utf8_encode and utf8_decode are historical oddities, which 
almost certainly would not be accepted if proposed today:


* Their names do not describe their functionality, which is to convert 
to/from one specific single-byte encoding. This leads to a common 
confusion that they can be used to "fix" UTF-8 encoding problems, which 
they generally make worse.
* That single-byte encoding is ISO 8859-1, not its common cousins 
Windows-1252 or ISO 88159-15. This means, for instance, that they do not 
handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)  
not "\x80" (Windows-1252) or "\xA4" (8859-15)


On the other hand, they are commonly used, both correctly and 
incorrectly, so removing them is not easy.


A previous proposal to remove them [1] resulted in Andrea making two 
significant improvements: moving them from ext/xml to ext/standard [2] 
and rewriting the documentation to explain them properly [3]. My genuine 
thanks for that.


However, it hasn't stopped people misunderstanding them, and quite 
reasonably: you shouldn't need to look up every function you use in the 
manual, to make sure it actually does what its name suggests.



I can see three ways forward:

A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide 
a specific replacement, but recommend people look at iconv() or 
mb_convert_encoding(). There is precedent for this, such as 
convert_cyr_string(), but it may frustrate those who are using the 
functions correctly.


B) Introduce new names, such as utf8_to_iso_8859_1 and 
iso_8859_1_to_utf8; immediately make those the primary names in the 
manual, with utf8_encode / utf8_decode as aliases. Raise deprecation 
notices for the old names, either immediately or in some future release. 
This gives a smoother upgrade path, but commits us to having these 
functions as outliers in our standard library.


C) Leave them alone forever. Treat it as the user's fault if they mess 
things up by misunderstanding them.



I am happy to put together an RFC for either A or B, if it has a chance 
of reaching consensus. I would really like to avoid option C.



[1] https://externals.io/message/95166
[2] https://github.com/php/php-src/pull/2160
[3] 
https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php