Re: character sets in HTML files?

2001-10-19 Thread David A. Desrosiers


 One of the advantages of using Python 2 for parsing is that it can work
 with a complete 32-bit Unicode charset encoding (UTF-8), rather than
 just a locale-specific subset, and includes support for transforming
 many (most) subsets into UTF-8.

My understanding is that you need the catalogs and NLS support built
into Python to take advantage of that, and that means ensuring that the
package maintainer (or if you do source builds on your own) did not use the
--disable-nls switch when compiling. Many do (and there's good reason to).



/d





Re: character sets in HTML files?

2001-10-18 Thread MJ Ray

Bill Janssen [EMAIL PROTECTED] writes:

 As soon as we add an XML component to the parser...  It's on my list.

Should plucker just parse XML and feed non-xml stuff to tidy to
reformat?  Just an idea to simplify things.  I think it simplifies
things, at least.

 Actually, if you read the XHTML specs, you'll see that they refer you
 back to the HTML specs for many, even most, things.

Indeed, but I thought XML was in unicode?  Or did I dream that?
Probably did, as I'm sure I've seen encoding=iso-8859-1 in some
files, actually.

-- 
MJR



Re: character sets in HTML files?

2001-10-18 Thread Bill Janssen

   Remember, implementing an XML parser is no trivial matter. If the
 XML page or application fails validation, the page is bitbucketed. In the
 current scheme, Plucker tries to make sense of what's left of the broken
 HTML, but with XML, that's not allowed.

Luckily, Python 2 comes with three XML parsers.  I've been reading up on
them and trying to figure out which is the simplest to use for Plucker.

  Indeed, but I thought XML was in unicode?  Or did I dream that? Probably
  did, as I'm sure I've seen encoding=iso-8859-1 in some files,
  actually.
 
   It is indeed unicode, however, you can override it.

There are two things going on.

Every XML and/or HTML document allows the full Unicode character set.
Period.  Every HTML document can contain any Unicode character.  But
they are expressed differently in the document depending on what
charset encoding is being used.  If a simple encoding like US-ASCII is
used, characters not in that character set are expressed as #;,
where  is the decimal value for the Unicode character code.
That's why you sometimes see things like #8212; (em-dash) in HTML
files.

See http://www.w3.org/TR/2000/REC-xml-20001006#charsets for XML, and
http://www.w3.org/TR/html4/charset.html#h-5.1 for HTML, for more on
all this.

One of the practical consequences of all this is that when you receive
an HTML document, for example, in UTF-16LE or ISO-8859-5 charset
encoding, you need to transform it to a local charset encoding (say
US-ASCII or ISO-9959-1) before you can even parse it.  One of the
advantages of using Python 2 for parsing is that it can work with a
complete 32-bit Unicode charset encoding (UTF-8), rather than just a
locale-specific subset, and includes support for transforming many
(most) subsets into UTF-8.

Bill




Re: character sets in HTML files?

2001-10-17 Thread MJ Ray

Bill Janssen [EMAIL PROTECTED] writes:

 I've been reading the HTTP and HTML specs about character sets.

Shouldn't you be using the xhtml specs now?
-- 
MJR



Re: character sets in HTML files?

2001-10-17 Thread Bill Janssen

  I've been reading the HTTP and HTML specs about character sets.
 
 Shouldn't you be using the xhtml specs now?
 -- 
 MJR

As soon as we add an XML component to the parser...  It's on my list.

Actually, if you read the XHTML specs, you'll see that they refer you
back to the HTML specs for many, even most, things.

Bill




character sets in HTML files?

2001-10-16 Thread Bill Janssen

I've been reading the HTTP and HTML specs about character sets.

The HTTP spec says, If a page is of type 'text/*', and the HTTP headers
don't specify a character set, assume ISO-8859-1'.

The HTML spec says, Don't follow the HTTP spec rules about the
default being ISO-8859-1, and Use the HTTP-specified character set
first, if any, and after that the character set specified in the HTML
itself with a META tag.

Though I personally think that the in-document character set
specification should override the one specified in the HTTP headers,
I'm following those rules for HTML.  Does anyone know of any
interesting problems with those rules?

Bill