Thanks to Michael, Michael, Lloyd, Cees,
your answers and insights have made things clearer for me.
I think I'll use a combination of all of that for this new application we're
writing.
In other words, to program "defensively", I propose to do this :
when sending the html page with the <form> :
- create the page and save it as UTF-8
- have the proper charset indications in it
- include a hidden test field with some known UTF-8 sequence (e.g. "ÄÖÜ")
- make sure that the application and the webserver send out the page with the proper
Content-type and charset (HTTP headers)
But since we still don't know what the browser (and the user) will actually do
with this,
upon reception of the POST :
- get the test field and check how it was received
a) check if it has the "is_utf8()" flag set (probably not)
b) if not (a) check if at least it has the correct UTF-8 bytes in it
(6, not 3)
c) if nor (a) nor (b), reject with error (don't know what it is then)
d) if not (a), but (b), then set a flag 'must_decode'
- get the other parameters, and
- if the 'must_decode' flag is not set, leave them 'as is'
- if the flag is set, Encode::decode('utf8',..) all received
parameters, except for file uploads (*)
That's of course in the hope that, some day, browsers will send multipart data with the
proper charset indication, and that CGI.pm will take it into account and do the right thing.
(*) although a question then is how a Polish browser would send the filename attribute,
assuming it is originally something like "Qualitätsübersicht.pdf"