Re: Doing character encoding/decoding within libwww?

David Nesting Sun, 23 Sep 2007 08:59:00 -0700

On 9/22/07, Bill Moseley <[EMAIL PROTECTED]> wrote:
>
> It's been a long day.  What other mime types are you thinking of other
> than text/*?
>


The most complete implementation imaginable would start with at least these:

text/html (html-specific rules)
text/xml (xml-specific rules)
text/* (general-purpose text rules)
application/*+xml (xml-specific rules)

You'd probably also want this to be extensible, so that I can add my own
media types at run-time to guarantee my non-obvious textual media type is
handled properly.

On the other hand, I'm less convinced now that dipping into the HTML or XML
content to figure out the proper encoding is necessarily the proper thing to
do here.  My complaint about LWP::Simple was that the HTTP Content-Type
(charset) information is lost by the time it gets to the caller.  If the
data isn't in text at that point, it will never reliably get there.  But for
HTML and XML, if the character encoding is actually specified in the
contentrather than in the HTTP headers, then it isn't as important to
deal with it
up front.  I could see a case then for dealing with text/* only and
returning octets for everything else, since text/* is the only media type
that has character encoding details in the HTTP headers.  That being said,
applications based on LWP::Simple are likely to work better with HTML and
XML "assistance" for the reason I gave earlier: users of LWP::Simple
probably aren't going to take the time to do the proper parsing and
decoding.  Yes, it's still "their fault" for not coding a robust
application, but helping them do that is I think still a valid goal, if we
can do it safely.

David

Re: Doing character encoding/decoding within libwww?

Reply via email to