-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

On 3/16/2009 8:30 PM, André Warnier wrote:
> Christopher Schultz wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> André,
>>
>> (Man, I need to get a keyboard mapping for "é". This copy-and-paste
>> thing is such a drag...)
> 
> Well, you can use Andre, I don't mind and I'm used to all kinds of
> spellings.  Or you can use André , the special form for people who
> haven't dominated their MIME charsets yet ;-)

Or for those whose charsets are mismatched (ha!).

> Well yes [size does matter], in a number of situations.  Think for example 
> about
> webserver logs, where these things then appear as a very very long
> string, percent-escaped to boot.

Eh, so you'd get your data enlarged to some extent. Again, the exact
Content-Type is not really relevant, since the problem is the same in
either case. The only difference is whether the servlet spec says it'll
expose that data to you through getParameter and friends.

> There is no "Content-Type of the request".  Try it : make a GET request
> (or a POST with application/x-www-form-urlencoded), and look for a
> request Content-Type with a charset.
> For a GET there is no content-type (because there is no request body).
> For a POST there is a content-type, but without charset.

That's the browser's fault, not the spec's. A request /does/ have a
Content-Type, whether implied or explicit. The problem is when the
client encodes the POST body with a content type other than the default,
and refuses to advertise it (which is the root of the problem).

> The gist of it is : for an "enctype=application/x-www-form-urlencoded"
> (whether explicit or by default), the URL is encoded in whatever charset
> the browser feels like encoding it. Which MAY depend on what the browser
> thinks the charset of the page is, which contains the <form>; or the
> "accept-charset" attribute of the form tag, or the user's preferences.
> But whatever the browser is in the end sending you, it does not say.

Agreed. My interpretation of the spec is that most clients are
non-compliant. When I use the filter attached to one of my other posts,
I most certainly *do* get POST content in UTF-8 encoding, yet the
browser fails to inform me with a Content-Type header.

If I POST "gregör", the POST body (again, without charset indicated in
the content-type) is this:

query=greg%C3%B6r

Note that if ISO-8859-1 had been used, the string should have been:

query=greg%F6r

So, the browser is patently violating the spec: it is using UTF-8 to
encode the request body yet not advertising it (RFC 2616 section 3.7.1).

Technically speaking, there is /no/ default charset unless the primary
media type is "text". My interpretation of the HTTP spec is that both
multipart/form-data /and/ application/x-www-urlencoded /require/ a
charset to be declared, even if the charset is "raw" or something like
that (for binary files, for instance).

> But $filename is also ("magically") a /filehandle/, as
> soon as you treat it like one and read from it.  That filehandle is
> connected to a temporary file in which the module has already read and
> saved the file part as uploaded by the browser.

Yeah, this is commons-upload for Java peeps:
http://commons.apache.org/fileupload/

> So, no, it is not a 10 MB string in memory.
> If the programmer closes that filehandle, the file is automatically
> deleted from whatever temporary space it occupied.
> Keep reading, and don't miss the
>  $type = uploadInfo($filename)->{'Content-Type'};

Note that the encoding for a file upload should always be
application/octet-stream. Otherwise, you'll get things like newline
conversions such that md5(source) != md5(target). The Content-Type
should be the mime-type for the file.

> In our applications, we are the ones sending the forms to the client,
> and we know the type of encoding to expect from them.

If that's the case, why not simply force Java to always use a certain
encoding? That's essentially what you're doing in Perl, whether you know
it or not.

> Just to keep people honest, we also always include a hidden parameter
> containing a UTF-8 string with non-US-ASCII characters, and check the
> returned length (in bytes and in characters) when the form is submitted.
> If there is a discrepancy between them, we know that the form
> parameter's encoding is not what it should be, and reject the post.

That can easily be done in Java, too.

> It doesn't [currently fail] because so far I am not processing form posts in 
> Java servlets.
> This discussion started because I need to do it now, in relation with
> the same external application for which I posted some questions about
> BufferedInputStreamReader's and such a while ago.

Yup, I remember.

> Now I have the problem in reverse : the application gets input from an
> iso-8859-2 form, in iso-8859-2, but is interpreting it as iso-8859-1.
> I was just wondering if by changing the form to use the
> multipart/form-data encoding type, the servlet would "magically" realise
> the errors of his ways, and read the data properly.
> Apparently however, browsers and HTTP and Servlet Specs conspire to make
> my life difficult.

Yeah. Hey, if you're sure the data will be in ISO-8859-2, then I would
just use a filter like the one I posted (and you've already played with)
and call it a day. You can rant and rave about the specs all you want,
but it's not going to solve your problem :)

Seriously, though, look into commons-fileupload if you want to actually
upload files (or even if you just really want to use multipart/form-data).

>> You can use commons-upload, which was intended to be used with file
>> uploads, and will probably read "simple" multipart/form-data fields as
>> well.
>>
> That's interesting, in a general sense.  I didn't know that one.  Where
> does it live ?

Sorry, I had the name (slightly) wrong and made an assumption that you'd
know what the heck "commons-foo" would mean. Apache commons is an Apache
site that hosts lots of small and super-useful Java libraries. The home
page is http://commons.apache.org/ (worth checking out everything they
have available) and commons-fileupload can be found here
http://commons.apache.org/fileupload/

> Unfortunately here, since I cannot modify the servlet, I'm stuck.
> But the setRequestCharacterEncoding filter will help in this case.

Hmm, if you can't modify the servlet you might be out of luck. Or, you
could always write a filter.... muhahaha!

> Ok, I found it. It is FileUpload, at http://commons.apache.org/fileupload/
> and it looks like Java may be as smart as perl after all ;-)

You could always switch to Python:

import Brain;

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkm/A6YACgkQ9CaO5/Lv0PAhpACgpV4REQiO7u1cQHyLJ1nA8m5C
8isAoL6NpxeQyUGUR1/7rK3l0SAv3/FB
=FDtt
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to