subject:"\[OT\] distinction between resource charset and format octet decoding"

Re: [OT] distinction between resource charset and format octet decoding

2020-02-06 Thread Garret Wilson


On 2/6/2020 12:43 PM, Christopher Schultz wrote:

…

* Therefore `web.xml` settings, HTTP headers, etc. are all
irrelevant, as this is an issue dealing with the file format
itself, and the latest spec for the file format says to use UTF-8,
so everyone should use UTF-8 already.

Except for everyone who already uses something else and expects
everything to be backward-compatible.


I think there comes a time where we have to more forward after some 
critical level of usage is reached. I think we've passed that point.


Modern browsers in the sense that you mention are not 
backwards-compatible for `application/x-www-form-urlencoded`. So what 
are we being compatible with by not using UTF-8 decoding? Do we have 
anything besides browsers consuming output from legacy JSP apps? As 
noted the browsers break when we try to be "backwards-compatible" in the 
sense you mention.



The problem is that you don't get to declare what's "best" for
everyone and then the whole world does what you want.


But here I would imagine that already agrees what's best; the debate is 
whether we should do different than what we know is best because of some 
outdated specs. (And I say that as a huge proponent of following standards.)


I'll give you an example that is directly relevant. Over 10 years ago I 
strongly advocated to the RDF group that the Internet should abandon the 
outdated practice of requiring that `text/*` media types default to 
US-ASCII; otherwise there would be no point in using `text/*` for 
anything going forward! (That's why we went through a sad phase where 
everyone was using `application/*` for text formats because they wanted 
to default to something other than US-ASCII.)


 * https://www.w3.org/2008/01/rdf-media-types
 * https://lists.w3.org/Archives/Public/www-archive/2007Dec/0059.html

Sure enough, eventually someone saw the light (I won't claim I had 
anything to do with it, but it is exactly what I was arguing for) and 
created https://tools.ietf.org/html/rfc6657, which says that individual 
`text/*` types can choose a default other than ASCII. Finally we're not 
stuck in the past anymore!


I would say that someone needs to create an updated 
`application/x-www-form-urlencoded` specification prescribing UTF-8 
decoding of encoded octets, except that the WhatWG has already done 
that! So I'm not declaring that everyone should do it "my" way. I'm 
saying everyone should follow the latest spec which already exists.


Anyway, thanks for listening. I think it's a fun discussion, and I 
wasn't being combative---I just wanted to tell a bit of the story. I 
need to get back to work now. :)


Thanks again for the change in Tomcat 10!

Garret

Re: [OT] distinction between resource charset and format octet decoding

2020-02-06 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Garret,

On 2/6/20 10:25 AM, Garret Wilson wrote:
> On 2/6/2020 11:46 AM, André Warnier (tomcat/perl) wrote:
>> …
>>> As of Tomcat 10, conf/web.xml contains the following:
>>> 
>>>  
>>> UTF-8 
>>> UTF-8
>>>
>>>
>>> 
That *should* have the effect you are looking for but I confess I
>>> haven't tested it in any great detail.
>>> 
>> 
>> As I am sure many people (Christopher included) would agree, the
>> real solution would be for browsers and other HTTP clients to
>> indicate clearly in the request, the charset/encoding of each
>> text parameter that they are sending. There are even HTTP headers
>> already defined for that.
> 
> 
> Which HTTP headers are you referring to? `Content-Type`? It is my 
> opinion that this is irrelevant and not applicable.
> 
> As I explained (extensively) in my original post for this thread
> back on 2019-01-08, the issue is not the charset of 
> `application/x-www-form-urlencoded`. That media type is made up of
> ASCII characters. It doesn't matter whether you say it's ASCII,
> ISO-8859-1, UTF-8, or whatever, the actual characters stay 100% the
> same.

Hmm. Not always. While it may be true that:

1. ASCII, ISO-8859-1, and UTF-8 are very common
2. ASCII, ISO-8859-1, and UTF-8 share the first 127 code points

It is not true that:

3. All character encodings share the first 127 code points.

UTF-16 doesn't follow that pattern.

> At issue is when certain octets are encoded (as specified by the 
> `application/x-www-form-urlencoded` media type itself), what
> charset to use when decoding them. This is independent of the
> encoding of the media type itself; rather this is defined by the
> specification for the format.
Correct. And there is lack of agreement for URLs, so browsers decided
to make it up. It's not possible to guess what the browser has chosen
because it does not advertise it in any way (absent a standard). The
only 100% reliable way to do it would be to add a parameter to every
request which has a known-correct value that can be unambiguously
decoded. You just keep re-decoding the whole URL until that parameter
value matches the known-correct value. Sounds like a lot of fun to
implement across a whole application, right?

> Unfortunately https://tools.ietf.org/html/rfc1866 actually says we 
> should use ASCII when decoding the octets, but this is severely 
> antiquated and doesn't fit with modern practice. The WhatWG
> essentially redefines the format to say that the octets must be
> interpreted as UTF-8:
> 
> https://url.spec.whatwg.org/#application/x-www-form-urlencoded
> 
> So to summarize my view:
> 
> * The decoding of the `application/x-www-form-urlencoded` media
> type encoded octets is completely independent of the charset
> indicated in the `Content-Type` header, and rather goes to the
> specification of the format itself.

It's strange, because Content-Type can contain a charset parameter,
but MIME specifically says that "charset" parameters are only
appropriate for "text/*" MIME types. So for
application/x-www-form-urlencoded, you "shouldn't" add that parameter.
But there's no particular reason NOT to include it (it doesn't
actually violate any spec) and adding it COMPLETELY AND UNAMBIGUOUSLY
indicates what the browser chose as the encoding.

> * RFC 1866 is severely out of date and out of step, and the
> WhatWG's specification of the `application/x-www-form-urlencoded`
> media type should be used instead. (Modern browser practice would
> seem to agree with me.)

RFC 1886 has been very much superseded. Also, HTML specs shouldn't be
defining HTTP semantics. So ignore whatever is in RFC 1866 on multiple
grounds.

> * Therefore `web.xml` settings, HTTP headers, etc. are all
> irrelevant, as this is an issue dealing with the file format
> itself, and the latest spec for the file format says to use UTF-8,
> so everyone should use UTF-8 already.

Except for everyone who already uses something else and expects
everything to be backward-compatible.

The problem is that you don't get to declare what's "best" for
everyone and then the whole world does what you want. I happen to
agree with you (Everyone should move to UTF-8 for everything.
Everywhere. Forever.), but you have to recognize that there is history
and entrenched systems, environments, and mindsets.

> The new default `web.xml` in Tomcat 10 is a wonderful step in the
> right direction.

+1

- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl48NAoACgkQHPApP6U8
pFgJ6A/+JSArcUkqm3P6n0awICXTuqIx0TU1oIf9bzivpAI/Na9fr//ebnwzmvoy
EXpbnn97B7Sy8uZ1wvT0+PQLbmwVmM/f7zBk4q+7Ba/ogkmrSHeLlsCIbLAlXOLD
kr/xDE4ftxrwR2+ZwuQwxH0muFH+4rq2SBFWTQnGORCQDqRRK7eQoQYHWE0HIAxj
cAJmwkQEQyi+YHdgaUo0L4BU7lvgPGk7JyjbzWBiigFYy/1Du1caE7PzYLa5G3wZ
BrYDA6QoQA+nUmXHn/ayUVXvsZc2l/nU/uM5m68Tp1iEVxdgp4u8XtHuqgv0Nzda
IeQq9HOP8wd7l27/dk2DvlZBmSWt2XDOI5ig+NoLPT1ixyQIqVJ2K8SyayGdUHW9

Re: [OT] distinction between resource charset and format octet decoding

2020-02-06 Thread Garret Wilson


On 2/6/2020 11:46 AM, André Warnier (tomcat/perl) wrote:

…

As of Tomcat 10, conf/web.xml contains the following:


UTF-8
UTF-8

That *should* have the effect you are looking for but I confess I
haven't tested it in any great detail.



As I am sure many people (Christopher included) would agree, the real 
solution would be for browsers and other HTTP clients to indicate 
clearly in the request, the charset/encoding of each text parameter 
that they are sending.

There are even HTTP headers already defined for that.



Which HTTP headers are you referring to? `Content-Type`? It is my 
opinion that this is irrelevant and not applicable.


As I explained (extensively) in my original post for this thread back on 
2019-01-08, the issue is not the charset of 
`application/x-www-form-urlencoded`. That media type is made up of ASCII 
characters. It doesn't matter whether you say it's ASCII, ISO-8859-1, 
UTF-8, or whatever, the actual characters stay 100% the same. At issue 
is when certain octets are encoded (as specified by the 
`application/x-www-form-urlencoded` media type itself), what charset to 
use when decoding them. This is independent of the encoding of the media 
type itself; rather this is defined by the specification for the format.


Unfortunately https://tools.ietf.org/html/rfc1866 actually says we 
should use ASCII when decoding the octets, but this is severely 
antiquated and doesn't fit with modern practice. The WhatWG essentially 
redefines the format to say that the octets must be interpreted as UTF-8:


https://url.spec.whatwg.org/#application/x-www-form-urlencoded

So to summarize my view:

 * The decoding of the `application/x-www-form-urlencoded` media type
   encoded octets is completely independent of the charset indicated in
   the `Content-Type` header, and rather goes to the specification of
   the format itself.
 * RFC 1866 is severely out of date and out of step, and the WhatWG's
   specification of the `application/x-www-form-urlencoded` media type
   should be used instead. (Modern browser practice would seem to agree
   with me.)
 * Therefore `web.xml` settings, HTTP headers, etc. are all irrelevant,
   as this is an issue dealing with the file format itself, and the
   latest spec for the file format says to use UTF-8, so everyone
   should use UTF-8 already.

The new default `web.xml` in Tomcat 10 is a wonderful step in the right 
direction.


See my original post for more in-depth explanation.

Garret

Re: [OT] distinction between resource charset and format octet decoding

2020-02-06 Thread tomcat/perl

On 06.02.2020 14:44, Mark Thomas wrote:

On 06/02/2020 13:39, Garret Wilson wrote:

On 2/6/2020 10:36 AM, Mark Thomas wrote:

…

Whether Tomcat should ship with this setting present in conf/web.xml
by default is something that should probably be discussed for Tomcat
10. Given the current state of the web, there is a reasonable case for
doing so. I'll add that to the TOMCAT-NEXT discussion list.

Is this still on the list for discussion for Tomcat 10?

No, because it has already been implemented for Tomcat 10 and is in the
milestone release currently being voted on.

Waitasec. I'm not used to good news, so I want to make sure I understand
what you're saying. Are you saying that the proposed Tomcat 10
implementation already interprets encoded octets in web form submissions
using UTF-8 by default?!! :O

As of Tomcat 10, conf/web.xml contains the following:

UTF-8
UTF-8

That *should* have the effect you are looking for but I confess I
haven't tested it in any great detail.

As I am sure many people (Christopher included) would agree, the real solution would be
for browsers and other HTTP clients to indicate clearly in the request, the
charset/encoding of each text parameter that they are sending.

There are even HTTP headers already defined for that.
(Nowadays the default could be Unicode/UTF-8).

The problem is that browsers and other agents don't do that, although they undoubtedly
always know themselves, and although it would solve a series of issues that have literally
been there forever at the server and application level (*).

I have often wondered if/why the Apache Foundation does not pack enough influence over the
HTTP/HTML specifications process and over browser producers, to achieve that.

(And if not the Apache Foundation, then who ?)

(*) My own guess is that this basic thing (or lack of it) has cost over the years many
thousands of lines of unnecessary code and many thousands of unproductive developer hours.
As a tiny example, just consider the above web.xml parameters, and how much time in total
was dedicated to their definition and implementation.. Never mind all the previous related
filters and valves and their discussions on this list. And that's only for Tomcat.

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: [OT] distinction between resource charset and format octet decoding

Re: [OT] distinction between resource charset and format octet decoding

Re: [OT] distinction between resource charset and format octet decoding

Re: [OT] distinction between resource charset and format octet decoding

4 matches

Site Navigation

Mail list logo

Footer information