Thanks to Michael, Michael, Lloyd, Cees,

your answers and insights have made things clearer for me.
I think I'll use a combination of all of that for this new application we're 
writing.

In other words, to program "defensively", I propose to do this :

when sending the html page with the <form> :
- create the page and save it as UTF-8
- have the proper charset indications in it
- include a hidden test field with some known UTF-8 sequence (e.g. "ÄÖÜ")
- make sure that the application and the webserver send out the page with the proper Content-type and charset (HTTP headers)

But since we still don't know what the browser (and the user) will actually do 
with this,

upon reception of the POST :
- get the test field and check how it was received
        a) check if it has the "is_utf8()" flag set (probably not)
        b) if not (a) check if at least it has the correct UTF-8 bytes in it 
(6, not 3)
        c) if nor (a) nor (b), reject with error (don't know what it is  then)
        d) if not (a), but (b), then set a flag 'must_decode'

- get the other parameters, and
        - if the 'must_decode' flag is not set, leave them 'as is'
        - if the flag is set, Encode::decode('utf8',..) all received
                parameters, except for file uploads (*)

That's of course in the hope that, some day, browsers will send multipart data with the proper charset indication, and that CGI.pm will take it into account and do the right thing.



(*) although a question then is how a Polish browser would send the filename attribute, assuming it is originally something like "Qualitätsübersicht.pdf"

Reply via email to