Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bill Moseley
On Sun, Sep 23, 2007 at 01:22:21AM +0200, Bjoern Hoehrmann wrote:
> * Bill Moseley wrote:
> >sub decoded_content {
> >
> >
> >$content_ref = \Encode::decode($charset, $$content_ref,
> >   Encode::FB_CROAK() | Encode::LEAVE_SRC());
> 
> The documentation I re-read earlier even says that... This is still a
> far cry from being generally useful though, it only works for text/*
> and only if the encoding is specified in the header, or the format does
> not use some kind of inline label that is inconsistent with the default.
> Most of the time this is not the case, however.

It will also find  content-type in the markup, IIRC.

It's been a long day.  What other mime types are you thinking of other
than text/*?

-- 
Bill Moseley
[EMAIL PROTECTED]



Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bjoern Hoehrmann
* Bill Moseley wrote:
>sub decoded_content {
>
>
>$content_ref = \Encode::decode($charset, $$content_ref,
>   Encode::FB_CROAK() | Encode::LEAVE_SRC());

The documentation I re-read earlier even says that... This is still a
far cry from being generally useful though, it only works for text/*
and only if the encoding is specified in the header, or the format does
not use some kind of inline label that is inconsistent with the default.
Most of the time this is not the case, however.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bill Moseley
On Sat, Sep 22, 2007 at 11:50:53PM +0200, Bjoern Hoehrmann wrote:
> * Bill Moseley wrote:
> >On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote:
> >> For most uses of libwww, developers do little with character encoding.
> >> Indeed, for general-case use of LWP::Simple, they can't, because that
> >> information isn't even exposed.  Has any thought gone into doing this
> >> internally within libwww, so that when I fetch content, I get back text
> >> instead of octets?
> >
> >If you have the response object:
> >
> >$response->decoded_content;
> 
> That removes content encodings like gzip and deflate, but David is
> asking about character encodings like utf-8 and iso-8859-1. Content
> encodings are applied after character encodings.

sub decoded_content {


$content_ref = \Encode::decode($charset, $$content_ref,
   Encode::FB_CROAK() | Encode::LEAVE_SRC());

-- 
Bill Moseley
[EMAIL PROTECTED]



Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bjoern Hoehrmann
* Bill Moseley wrote:
>On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote:
>> For most uses of libwww, developers do little with character encoding.
>> Indeed, for general-case use of LWP::Simple, they can't, because that
>> information isn't even exposed.  Has any thought gone into doing this
>> internally within libwww, so that when I fetch content, I get back text
>> instead of octets?
>
>If you have the response object:
>
>$response->decoded_content;

That removes content encodings like gzip and deflate, but David is
asking about character encodings like utf-8 and iso-8859-1. Content
encodings are applied after character encodings.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bill Moseley
On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote:
> For most uses of libwww, developers do little with character encoding.
> Indeed, for general-case use of LWP::Simple, they can't, because that
> information isn't even exposed.  Has any thought gone into doing this
> internally within libwww, so that when I fetch content, I get back text
> instead of octets?

If you have the response object:

$response->decoded_content;

-- 
Bill Moseley
[EMAIL PROTECTED]



Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bjoern Hoehrmann
* David Nesting wrote:
>For most uses of libwww, developers do little with character encoding.
>Indeed, for general-case use of LWP::Simple, they can't, because that
>information isn't even exposed.  Has any thought gone into doing this
>internally within libwww, so that when I fetch content, I get back text
>instead of octets?

Generally speaking, this is rather difficult as some content may not be
textual at all, and textual formats vary in how applications are to de-
tect the encoding (e.g., XML has different rules than HTML, text/plain
has no rules beyond looking at the charset parameter, and so on). If you
want a general-purpose solution, a good start would be a module taking a
HTTP::Response object and detecting the encoding, possibly decoding it
on request.

>I'd be happy to help work on some of this, but the fact that I see no
>use of character encodings within libwww makes me wonder if this is more
>of a policy decision not to do it.

There was a bit of a discussion to somehow use HTML::Encoding for some
parts of it, which pretty much solves the problem for HTML and XML, cf
the list archives. Help on improving HTML::Encoding would be welcome,
I have little time to work on it at the moment.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Doing character encoding/decoding within libwww?

2007-09-22 Thread David Nesting
For most uses of libwww, developers do little with character encoding.
Indeed, for general-case use of LWP::Simple, they can't, because that
information isn't even exposed.  Has any thought gone into doing this
internally within libwww, so that when I fetch content, I get back text
instead of octets?

I'd be happy to help work on some of this, but the fact that I see no use of
character encodings within libwww makes me wonder if this is more of a
policy decision not to do it.

David