Re: [Python-Dev] Bytes path support

Isaac Morland Sat, 23 Aug 2014 08:28:24 -0700

On Sat, 23 Aug 2014, Marko Rauhamaa wrote:

"Stephen J. Turnbull" <[email protected]>:

Just read as bytes and decode piecewise in one way or another. For
Oleg's HTML case, there's a well-understood structure that can be used
to determine retry points


HTML and XML are interesting examples since their encoding is initially
unknown:

 <?xml version="1.0"?>
                     ^
                     +--- Now I know it is UTF-8

 <?xml version="1.0" encoding="UTF-16"?>
                                     ^
                                     +--- Now I know it was UTF-16
                                          all along!

Then we have:


 HTTP/1.1 200 OK
 Content-Type: text/html; charset=ISO-8859-1

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 <html>
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=utf-16">

See how deep you have to parse the TCP stream before you realize the
content encoding is UTF-16.


For HTML it's not quite so bad.  According to the HTML 4 standard:

http://www.w3.org/TR/html4/charset.html

The Content-Type header takes precedence over a <meta> element. I thoughtI read once that the reason was to allow proxy servers to transcodedocuments but I don't have a cite for that. Also, the <meta> element"must only be used when the character encoding is organized such thatASCII-valued bytes stand for ASCII characters" so the initial UTF-16example wouldn't be conformant in HTML.

In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF (byteorder mark) is used:


http://www.w3.org/TR/html-markup/syntax.html#encoding-declaration

Not sure about XML.

Of course this whole area is a bit of an "arms race" between programmerscompeting to get away with being as sloppy as possible and otherprogrammers who have to deal with their mess.


Isaac Morland                   CSCF Web Guru
DC 2554C, x36650                WWW Software Specialist
_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Bytes path support

Reply via email to