Gisle,

Gisle Aas wrote:

Jacques Deguest <[EMAIL PROTECTED]> writes:

One thing too that would be really nice to have is a convenient method
to get the content charset. Right now, as far as I can tell, one need
to split the content-type to get it either from the HTTP header if it
exists or from the document meta information.
Both will end up the headers of the HTTP::Response since LWP by
default will parse out extra headers from the <head> of HTML content.
One problem with this is that we might end up with duplicate
headers. E.g.:

  $ GET -Sed http://www.activestate.com | grep Content-Type
  Content-Type: text/html
  Content-Type: text/html; charset=iso-8859-1

The first one is the one in the real response headers while the second
was picked up from the:

  <meta HTTP-EQUIV="Content-Type" content="text/html; charset=iso-8859-1" />
Yes, but still the developer needs to parse it to get the content-charset. I was thinking it would have been realy nice to have a content_charset sub like there is a content_encoding one.

tag of the HTML document returned.  This is just annoying.  I have not
found a good way to avoid this yet.
I am not sure this is the best solution, but I take the one that contains the most information, since usually the HTTP headers contains less information than the HTML meta information.

Or you could change the behavior of HTTP::Headers::content_type and have it return a list of array reference (containing the tokens broken down of the content-type) in list context or the first array reference in scalar context. But, I fear this will be not very much ascendant compatibility friendly.

Since, working with Unicode makes it very convenient to work with may
different charsets and to do some nice re such as \p{InKatakana},
still we need to tell Perl what is the charset to decode the data.
I do not think there would be much overhead since you are already
processing the content-type field somewhere in your code.
Let's imagine a method like:

   $r->content_charset();         # get
   $r->content_charset("utf-8");  # set

that extracts or updates the charset parameter of the Content-Type
field.

RFC 2616 says:

   The "charset" parameter is used with some media types to define the
   character set (section 3.4) of the data. When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP. Data in character sets other than "ISO-8859-1" or
   its subsets MUST be labeled with an appropriate charset value. See
   section 3.4.1 for compatibility problems.

so I think it would make sense for $r->content_charset() to return
"ISO-8859-1" when there is no explicit charset for "text/*" content.
I agree.

The main trouble I have with a method like this is what to do when a
charset is set when there is no previous Content-Type field present.
We could assume "text/plain" or something but it does not feel right.
A charset needs to be associated with a content type right? No content-type defined means the charset should be ignored since its parent information is missing. I would not try to be too smart about this as for me or it may very well lead to some unexpected result from the developer stand-point.

--
Kind Regards,
Jacques Deguest,


Reply via email to