Re: Sanity Check

tomcat Tue, 22 Nov 2016 02:53:14 -0800

On 21.11.2016 18:09, Christopher Schultz wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256


AndrÃ©,

:)

On 11/19/16 12:31 PM, André Warnier (tomcat) wrote:

With respect, this is not only "André's problem".


Agreed. I apologize if it seemed like I was suggesting that you are
the only one complaining.

I would also posit that this being an English-language forum, the
posters here would tend to be predominently English-speaking
developers, who are quite likely not the ones most affected by such
issues. So the above numbers are quite likely to be
unrepresentative of the number of people really affected by such
matters.


Also agreed: we are a self-selected group. But while we are
predominantly English-speaking (even as a second of third language),
we are all serving user populations that fall outside of that realm.

For instance, my software is overwhelmingly deployed in the United
States, but we have full support for Simplified and Traditional
Chinese script (except for top-to-bottom and right-to-left rendering,
which we don't do quite yet).

So ISO-8859-1 has basically never worked for us, and we've been UTF-8
since roughly the beginning.

And one could also look at the amount of code in applications and
in Tomcat e.g., which is dedicated to working around linked
issues. (Think "UseBodyEncodingForURL",
"org.apache.catalina.filters.AddDefaultCharsetFilter" etc.)

Basically what I'm saying is that this
"posted-parameters-encoding-issue" is far from being "licked",
despite the fact that native English-speaking developers may have a
tendency to believe that it is.


Aah, I meant that *my* problem with *this* vendor is now an
open-and-shut case: they are squarely in violation of the
specifications. They may decide not to change, but at least we know
the truth of the matter and can move forward from there.

When it's unclear which party is at fault, the party with the bigger
bank account wins. (In that case, it's the vendor who has all the
money, not me :) But being able to claim that they advertise support
for this specification and clearly do not correctly-support it means
that really THEY should be making a change to their software, not me.

The only problem now is that it's not clear how to turn %C2%AE
into a character because you have to know that UTF-8 and not
Shift-JS or whatever is being used.

-> Required parameters : No parameters -> Optional parameters :
No parameters

OK. So no charset= parameter is allowed. My advise to specify
the charset parameter was wrong.


No, it wasn't, not really.  I believe that you were on a good track
there. It is the spec that is wrong, really.

One is allowed to question a spec if it appears wrong, or ? After
all, RFC means "Request For Comment".


Sure. The problem is that the app can only do so much, especially when
the browsers behave in a very weird way... specifically by flatly
refusing to provide a charset parameter to the Content-Type when it's
appropriate.

Being allows (spec-wise) to include a charset along with that
Content-Type would be nice. An alternative would be to keep the spec
in-fact and add a new spec that introduces a new header e.g.
Encoded-Content-Type that would be a stand-in for the missing
"charset" parameter for a/xfwu.

Agreed: it is always against the spec(s) to specify a charset for
any MIME type that is not text/*.


Agreed. It just makes no sense for data that is not fundamentally
"text". (Whether some such text data has or not a MIME type whose
designation starts with "text/" is quite another matter. For
example : the MIME type "application/ecmascript" refers to text
data (javascript code) - and allows a charset attribute - even
though its type name does not start with "text/"; there are many
other types like that).


I think the real problem is that many application/* MIME types really
should be text/* types instead. Javascript is another good example.
a/xwfu is also, by definition, text. If you want to upload binaries,
you use application/binary or multipart/form-data with a subtype of
application/binary.

Apache Tomcat supports the use of charset parameter with
Content-Type application/x-www-form-urlencoded in POST
requests.


Good for Tomcat.  That /is/ the intelligent thing to do, MIME-type
notwithstanding. Because if ever, clients such as standard web
browsers would come to pay more attention and apply this attribute,
much of the current confusion would go away.

Even better would be, if the RFC for
"application/x-www-form-urlencoded" would be amended, to specify
that this charset attribute SHOULD be provided, and that by default
its value would be "ISO-8859-1" (for now; but there is a good case
to make it UTF-8 nowadays).


Weirdly, the current behavior of web browsers is to:

a) Use the charset of the page that presented the form
and
b) Not report it to the server when submitting the POST request

So everybody loses, and you can't just claim "the standard should be
X". The standard default should be "undefined" :)

In fact, if Tomcat was to strictly respect the MIME type definition
of "application/x-www-form-urlencoded" and thus, after
percent-decoding the POST body, interpret any byte of the resulting
string strictly as being a character in the US-ASCII character set,
that /would/ instantly break thousands of applications.


It would break everything, and I don't think it would be a "strict"
following of the spec. There is a hole in the spec because the server
can't (per spec) know the intended character encoding of the text
after it has been url-decoded.

I'm saying that the a/xwfu raw body itself must be (per spec)
US-ASCII. But once url-decoded, those bytes can be interpreted as
pretty much anything, UTF-8 being the most sensible these days, but
evidently ISO-8859-1 gets used a lot. Hence your Andr†® problem.
Again, not YOUR problem. :)

it would now seem (unless I misinterpret, which is a distinct
possibility) that the content of a
"application/x-www-form-urlencoded" POST, *after*
URL-percent-decoding, *may* be a UTF-8 encoded Unicode string (it
may also be something else). (There is even a provision for
including a hidden "_charset_" parameter naming the
charset/encoding. Yet another muddle ?) (This also applies only to
HTML 5 <form> documents, but let's skip this for a moment).

Still, as far as I can tell, there is still to some extent the
same "chicken-and-egg" problem, in the sense that in order to parse
the above parameter, one would first have to decode the
"application/x-www-form-urlencoded" POST body, using some character
set. For which one would need to know ditto character set before
decoding.


The _charset_ thing is an horrible hack. It's worse than XML, but at
least the XML parser can prove to itself that the character set of the
bytes it's looking for are fairly close to the beginning of the
stream. There's no requirement that the _charset_ parameter, for
example, be the first parameter sent in the body of the request. :(

Pretty much the same solution applies to POSTs in the
"multipart/form-data" format, where each posted parameter already
has its own section with a MIME header.  Whenever one of these
parameters is text, it should specify a charset. (And if it
doesn't, then the current muddle applies).


The problem is that most of these parts don't have a text/* MIME type.
That's what I meant when I said you've "moved the problem" because
a/xwfu can still hide in there and nothing has been solved.

The only remaining muddle is with the parameters passed inside the
URL, as a query-string.

+1

But for those, one could apply for example the same mechanism as
is already applied for non-ASCII email header values (see
https://tools.ietf.org/html/rfc2047). This is not really ideal in
terms of simplicity, but 1) the code exists and works and 2) it
would certainly be preferable to the current muddled situation and
recurrent parameter encoding problems. (And again, for clients
which do not use this, then the current muddle applies).


UTF-8 is pretty much the agreed-upon standard these days, except where
it isn't :)

Altogether, to me it looks like there are 2 bodies of experts, one
on the HTML-and-client side and one on the HTTP-and-webserver side
(or maybe these are 4 bodies), who have not really been talking to
eachother constructively on this issue for years.


Yes and, oddly enough, they are all working under the W3C umbrella.

The result being that instead of agreeing on some simple rules,
each one of them kind of patched together its own separate set of
rules (and a lot of complex software), to obtain finally something
which still does not really solve the interoperability problem
fundamentally.

The current situation is nothing short of ridiculous : - there are
many character sets/encodings in use, but most/all of them are
clearly defined and named - there are millions of webservers, and
billions of web clients But fundamentally : - currently, a client
has no way to know for sure what character set/encoding it should
use, when it first tries to send some piece of text data to a
webserver - currently, a webserver has no way to know for sure in
what character set/encoding a client is sending text data to it


All true.

I'm sure that we can do better.  But someone somewhere has to take
the initiative.  And who better than an open-source software
foundation whose products already dominates the worldwide webserver
market ?


https://xkcd.com/927/


Yep, that is what I would be afraid of. (Great site by the way, thanks for the 
pointer).

But thus, it seems that
- we are agreed that this is (still) a problem, for web users and web 
developers worldwide
- we are agreed that solving it, would not break existing 
applications/webservers

- we are agreed that solving it would not require fundamentally new rules or RFCs, justmaybe "tweaking" a couple of them (for example : allow a "charset" attribute for someadditional MIME types which fundamentally concern text data (such as"application/x-www-form-urlencoded"), or allow a query-string to include an encoding asper rfc2047)(which does not per se require any tweaking of any RFC)- we are agreed that solving the problem, would not require the writing of a lot of newcode, as such code already exists in a much-used and debugged form- we are agreed that solving this issue would save many people a lot of aggravation, andsave the web-developer community a lot of basically superfluous coding and codemaintenance, and documentation and on-line support, which could reasonably be evaluatedtogether at thousands of man/hours per year- and we are agreed that we cannot think of any organisation or group of persons that sucha proposal could actually hurt or even inconvenience

So now, considering that such a thing would seem to have overall an overwhelming positiveeffect and no negative effect that we can think of, how would one go about proposing it ?

For one, which would be the proper instance(s) to approach and how ?

I mean, having it adopted by acclamation would be very nice, but I am sure that there aresome additional formalities to respect here.




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Sanity Check

Reply via email to