A. Pagaltzis skribis 2006-09-16 19:38 (+0200): > * Darren Duncan <[EMAIL PROTECTED]> [2006-09-09 20:40]: > > 4. Make UTF-8 the default HTTP response character encoding, and the > > default declared charset for text/* MIME types, and explicitly > > declare that this is what the charset is. The only time that output > > should be anything else, even Latin-1, is if the programmer > > specifies such. > No, please don???t. For unknown MIME types, the charset should be > undeclared. In particular, `application/octet-stream` should never > have a charset forced on it if one is not assigned by the client code > explicitly. Likewise, for `application/xml` and `application/*+xml`, a > charset should NEVER be explicitly declared, as XML documents are > self-describing, whereas declaring a charset forces using the charset > declared in the HTTP header. This is very unwise (cf. Ruby???s > Postulate).
Darren discussed the *default* encoding. Like how text/html is a nice default for the MIME-type, UTF-8 is a nice encoding. Both should be overridable. My thoughts: * Default Content-Type header of "text/html; charset=UTF-8". * Default output encoding of UTF-8. * When a new Content-Type is set, but no new encoding * Keep the default output encoding of UTF-8 * Warn if it's text/* without /charset=/ * Use the specified charset as the output encoding * Change the output encoding to raw bytes if it's not text/* * When a new Content-Type is set, and a new encoding is given * Use the supplied encoding * Warn if it's text/* without /charset=/ * Warn if supplied encoding and charset aren't equal enough I think it's important to realise that only text/* have charset, and that Content-Type is MIME-type plus charset in one value. We shouldn't be "clever" and separate these: they're one string. For XML, you'd have to explicitly mention Content-Type and encoding, because the encoding can no longer be taken from the Content-Type, and the default for non-text/* is raw bytes. > > 5. Similarly, default to trying to treat the HTTP request as > > UTF-8 if it doesn't specify a character encoding; fallback to > > Latin-1 only if the text parts of the HTTP request don't look > > like valid UTF-8. > This is not just unwise, it is actually wrong. Latin-1 is the > default for `text/*` MIME types if no charset is declared. Using > a different charset in violation of the HTTP RFCs is __BROKEN__. Agreed. > In fact, now that I???m writing all this out, I am starting to > think that maybe CGI.pm6 should simply punt on charsets as CGI.pm > does. Otherwise, the code and API would have to have able to deal > with the full complexity of charsets in HTTP, and the docs would > have to explain it, which is no picnic at all. Simple schemes can always be documented equally simply. A first attempt: The default value for the C<Content-Type> header is C<text/html; charset=UTF-8> The encoding that $module uses for output data is taken from the C<charset> attribute in the C<Content-Type> header. If there is no charset in the C<Content-Type> header, UTF-8 is used for all text/* types, and raw for everything else. It is possible to explicitly force an output encoding. When you're not sending a text/* document, you need to do this if the document does contain text. This is the case with most XML formats. $response1.type = 'text/html; charset=iso-8859-1'; # implies: $response1.encoding = 'iso-8859-1; $response2.type = 'application/xml'; $response2.encoding = 'UTF-8'; my $response3 = Web::Response.new :type('text/html; charset=iso-8859-1'); my $response4 = Web::Response.new :type<application/xml>, :encoding<UTF-8>; -- korajn salutojn, juerd waalboer: perl hacker <[EMAIL PROTECTED]> <http://juerd.nl/sig> convolution: ict solutions and consultancy <[EMAIL PROTECTED]>