On 18.11.2016 05:56, Christopher Schultz wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Konstantin,

On 11/17/16 4:58 PM, Konstantin Kolinko wrote:
2016-11-17 17:21 GMT+03:00 Christopher Schultz
<ch...@christopherschultz.net>:
All,

I've got a problem with a vendor and I'd like another opinion
just to make sure I'm not crazy. The vendor and I have a
difference of opinion about how a character should be encoded in
an HTTP POST request.

The vendor's API officially should accept requests in UTF-8
encoding. We are using application/x-www-form-urlencoded content
type.

I'm trying to send a message with a non-ASCII character -- for
example, a ® (that's (R), the registered trademark symbol).

The Java code being used to package-up this POST looks something
like this:

OutputStream out = httpurlconnection.getOutputStream();
out.print("notetext="); out.print(URLEncoder.encode("test®",
"UTF-8")); out.close();

So the POST payload ends up being notetext=test%C2%AE or, on the
wire, the bytes are 6e 6f 74 65 74 65 78 74 3d 74 65 73 74 25 43
32 25 41 45.

The final bytes 25 43 32 25 41 45 are the characters % C 2 % A
E.

Can someone verify that I'm encoding everything correctly?

The vendor is claiming that ® can be sent "directly" like one
might do using curl:

$ curl -d 'notetext=®' [url]

and the bytes on the wire are 6e 6f 74 65 74 65 78 74 3d c2 ae
(note that c2 and ae are "bare" and not %-encoded).

1. That is a wrong way to use curl.  The manual says that the
argument to -d should be properly urlencoded. The above value is an
incorrect one.

https://curl.haxx.se/docs/manual.html See "POST (HTTP)" and below.

+1

The curl manual says that -d is the same as --data-ascii, which is
totally wrong here if they are accepting UTF-8.

2. If you are submitting data programmatically, I wonder why you
are using simple "application/x-www-form-urlencoded".

I think it would be better to use explicit charset argument in the
Content-Type value, as it is easy to do so with Java clients.

Their API expects application/x-www-form-urlencoded. Everything else
they do is in JSON... I have no idea why they don't accept JSON as
input, but that's the deal.

MIME types that aren't text/* aren't supposed to have Content-Type
parameters.

Maybe more precisely : there SHOULD be a Content-type header; but a "charset" attribute only makes sense if the content type is, generally-speaking, "text". ("text/plain" certainly qualifies; but one may argue about "text/html" and variants e.g., since these formats may have their own embedded charset indications)


3. The application/x-www-form-urlencoded encoding was originally
specified in HTML specification.

Current specification:
https://www.w3.org/TR/html51/sec-forms.html#urlencoded-form-data

It defers details to
https://url.spec.whatwg.org/#concept-urlencoded-serializer

Historic, HTML 4.01:
https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1

All true, but the spec argues with itself over the character encoding,
and browsers make this worse with their stupid "I'll use whatever
character encoding was used to load the page containing the form"
behavior. With a software-client API, there basically is no spec.

Their assertion is that their character encoding "is UTF-8". But it
looks like they aren't doing it right.

My opinion is that the correct value on the wire is 25 43 32 25 41
45 = % C 2 % A E.

So, the same bytes as I had, right?

If a vendor accepts non-encoded "c2 ae": it technically may work
(in some versions of some software), but this is not a standard
feature and one would better not rely on it.

Technically, if non-encoded bytes ("c2 ae") are accepted, they
won't be confused with special character ("=", "&", "+", "%",
CRLF), as all multi-byte UTF-8 characters have 0x80 bit set.

Their non-%-encoded bytes could be considered legitimate, because the
application/x-www-form-urlencoded rules say that any character "in the
character set of the request" can be dropped-into the request without
being %-encoded. But they we are back to the problem of not knowing
what the encoding of the request is.

Since UTF-8 is supposed to be the "official" character encoding,

Now where is that specified ? As far as I know, the default charset for everything HTTP and HTML-wise is still iso-8859-1, no ? (and unfortunately so).

 I
would expect that a properly-encoded request would contain nothing but
valid ASCII characters, which means that 0xc2 0xae need to be
%-encoded to become "%c2%ae".

4. You code fragment is broken and won't compile: there are none
"print" methods in java.io.OutputStream.

OutputStream works with byte[] and the method name is "write".

Yes, it was hastily-typed from memory. The true code compiles and runs
as expected.

5. Wikipedia:
https://en.wikipedia.org/wiki/Percent-encoding#The_application.2Fx-www
- -form-urlencoded_type

  Wikipedia mentions XForms spec, ->
https://www.w3.org/TR/2007/REC-xforms-20071029/#serialize-urlencode

Thanks

for the XForms reference... it's nice that it has a real
example (including a non-ASCII character) instead of the usual trivial
examples in the HTTP and HTML specs, for instance.

6. You can test with real browsers.

I will certainly be doing that.

https://www.w3.org/TR/2007/REC-xforms-20071029/#serialize-urlencode

The vendor has responded with (paraphrasing) "it seems we don't
completely follow this standard; we're considering what to do next,
which may include no change". This is a big vendor with *lots* of
software clients, so maintaining backward compatibility is going to be
a big deal for them. I've got some tricks up my sleeve if they decide
not to change anything. Hooray for specs. :(


What I never understood in all that, is why browsers and other clients never seem to respect (and servers do not seem to enforce) what is indicated here :

https://www.ietf.org/rfc/rfc2388.txt
4.5 Charset of text in form data

This would be a simple way to get rid of umpteen character set/encoding issues encountered when trying to interpret <form> data POSTed to web applications.

It seems to me contrary to common sense that in our day and age, the rules for this could not be set once and for all to something like :

1) the default character set/encoding of HTTP and HTML is Unicode/UTF-8
   (instead of the current really archaic iso-8859-1)
2) URLs (including query-strings) should be by default interpreted as Unicode/UTF-8, encoded as per https://tools.ietf.org/html/rfc3986#section-2
3) for POST requests :
- for the Content-type "application/x-www-form-urlencoded", there SHOULD be a charset attribute indicating the charset and encoding. By default, this is "text/plain; charset=UTF-8" - for the Content-type "multipart/form-data", each "part" MUST have a Content-type header. If this Content-type is a "text" type, then the Content-type header SHOULD contain a charset attribute. If omitted, by default this is "charset=UTF-8".

and be done with it once and for all.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to