Re: Doing character encoding/decoding within libwww?
On Sun, Sep 23, 2007 at 01:22:21AM +0200, Bjoern Hoehrmann wrote: > * Bill Moseley wrote: > >sub decoded_content { > > > > > >$content_ref = \Encode::decode($charset, $$content_ref, > > Encode::FB_CROAK() | Encode::LEAVE_SRC()); > > The documentation I re-read earlier even says that... This is still a > far cry from being generally useful though, it only works for text/* > and only if the encoding is specified in the header, or the format does > not use some kind of inline label that is inconsistent with the default. > Most of the time this is not the case, however. It will also find content-type in the markup, IIRC. It's been a long day. What other mime types are you thinking of other than text/*? -- Bill Moseley [EMAIL PROTECTED]
Re: Doing character encoding/decoding within libwww?
* Bill Moseley wrote: >sub decoded_content { > > >$content_ref = \Encode::decode($charset, $$content_ref, > Encode::FB_CROAK() | Encode::LEAVE_SRC()); The documentation I re-read earlier even says that... This is still a far cry from being generally useful though, it only works for text/* and only if the encoding is specified in the header, or the format does not use some kind of inline label that is inconsistent with the default. Most of the time this is not the case, however. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Doing character encoding/decoding within libwww?
On Sat, Sep 22, 2007 at 11:50:53PM +0200, Bjoern Hoehrmann wrote: > * Bill Moseley wrote: > >On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote: > >> For most uses of libwww, developers do little with character encoding. > >> Indeed, for general-case use of LWP::Simple, they can't, because that > >> information isn't even exposed. Has any thought gone into doing this > >> internally within libwww, so that when I fetch content, I get back text > >> instead of octets? > > > >If you have the response object: > > > >$response->decoded_content; > > That removes content encodings like gzip and deflate, but David is > asking about character encodings like utf-8 and iso-8859-1. Content > encodings are applied after character encodings. sub decoded_content { $content_ref = \Encode::decode($charset, $$content_ref, Encode::FB_CROAK() | Encode::LEAVE_SRC()); -- Bill Moseley [EMAIL PROTECTED]
Re: Doing character encoding/decoding within libwww?
* Bill Moseley wrote: >On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote: >> For most uses of libwww, developers do little with character encoding. >> Indeed, for general-case use of LWP::Simple, they can't, because that >> information isn't even exposed. Has any thought gone into doing this >> internally within libwww, so that when I fetch content, I get back text >> instead of octets? > >If you have the response object: > >$response->decoded_content; That removes content encodings like gzip and deflate, but David is asking about character encodings like utf-8 and iso-8859-1. Content encodings are applied after character encodings. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Doing character encoding/decoding within libwww?
On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote: > For most uses of libwww, developers do little with character encoding. > Indeed, for general-case use of LWP::Simple, they can't, because that > information isn't even exposed. Has any thought gone into doing this > internally within libwww, so that when I fetch content, I get back text > instead of octets? If you have the response object: $response->decoded_content; -- Bill Moseley [EMAIL PROTECTED]
Re: Doing character encoding/decoding within libwww?
* David Nesting wrote: >For most uses of libwww, developers do little with character encoding. >Indeed, for general-case use of LWP::Simple, they can't, because that >information isn't even exposed. Has any thought gone into doing this >internally within libwww, so that when I fetch content, I get back text >instead of octets? Generally speaking, this is rather difficult as some content may not be textual at all, and textual formats vary in how applications are to de- tect the encoding (e.g., XML has different rules than HTML, text/plain has no rules beyond looking at the charset parameter, and so on). If you want a general-purpose solution, a good start would be a module taking a HTTP::Response object and detecting the encoding, possibly decoding it on request. >I'd be happy to help work on some of this, but the fact that I see no >use of character encodings within libwww makes me wonder if this is more >of a policy decision not to do it. There was a bit of a discussion to somehow use HTML::Encoding for some parts of it, which pretty much solves the problem for HTML and XML, cf the list archives. Help on improving HTML::Encoding would be welcome, I have little time to work on it at the moment. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Doing character encoding/decoding within libwww?
For most uses of libwww, developers do little with character encoding. Indeed, for general-case use of LWP::Simple, they can't, because that information isn't even exposed. Has any thought gone into doing this internally within libwww, so that when I fetch content, I get back text instead of octets? I'd be happy to help work on some of this, but the fact that I see no use of character encodings within libwww makes me wonder if this is more of a policy decision not to do it. David