Gisle, Gisle Aas wrote:
Yes, but still the developer needs to parse it to get the content-charset. I was thinking it would have been realy nice to have a content_charset sub like there is a content_encoding one.Jacques Deguest <[EMAIL PROTECTED]> writes:One thing too that would be really nice to have is a convenient method to get the content charset. Right now, as far as I can tell, one need to split the content-type to get it either from the HTTP header if it exists or from the document meta information.Both will end up the headers of the HTTP::Response since LWP by default will parse out extra headers from the <head> of HTML content. One problem with this is that we might end up with duplicate headers. E.g.: $ GET -Sed http://www.activestate.com | grep Content-Type Content-Type: text/html Content-Type: text/html; charset=iso-8859-1 The first one is the one in the real response headers while the second was picked up from the: <meta HTTP-EQUIV="Content-Type" content="text/html; charset=iso-8859-1" />
I am not sure this is the best solution, but I take the one that contains the most information, since usually the HTTP headers contains less information than the HTML meta information.tag of the HTML document returned. This is just annoying. I have not found a good way to avoid this yet.
Or you could change the behavior of HTTP::Headers::content_type and have it return a list of array reference (containing the tokens broken down of the content-type) in list context or the first array reference in scalar context. But, I fear this will be not very much ascendant compatibility friendly.
Since, working with Unicode makes it very convenient to work with may different charsets and to do some nice re such as \p{InKatakana}, still we need to tell Perl what is the charset to decode the data. I do not think there would be much overhead since you are already processing the content-type field somewhere in your code.Let's imagine a method like: $r->content_charset(); # get $r->content_charset("utf-8"); # set that extracts or updates the charset parameter of the Content-Type field. RFC 2616 says: The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems. so I think it would make sense for $r->content_charset() to return "ISO-8859-1" when there is no explicit charset for "text/*" content.
I agree.
A charset needs to be associated with a content type right? No content-type defined means the charset should be ignored since its parent information is missing. I would not try to be too smart about this as for me or it may very well lead to some unexpected result from the developer stand-point.The main trouble I have with a method like this is what to do when a charset is set when there is no previous Content-Type field present. We could assume "text/plain" or something but it does not feel right.
--
Kind Regards,
Jacques Deguest,