I had been told by the W3C people that the reason for forbidding control
characters in XML and HTML was for compatibility with SGML. I've never
checked it, since unfortunately the SGML standard is not online. If not
true, that's very interesting.

When you are thinking of XML as a general transmission mechanism for data
(not just a text document) it becomes clear. Suppose that you have a
database, of any sort. Some fields may or may not contain control
characters -- since control characters are perfectly legal in many if not
all databases. You want to query that database and get a selection, packaged
as XML.

Unfortunately, you have to invent your own home-brew quoting mechanism for
the control characters, since the standard XML does not permit you to
represent all of the -- perfectly valid -- characters in that database. And
such a home-brew mechanism will not interwork with anything else.

Conversely, you could filter out the control characters. That, of course,
would corrupt the data. Generally considered a bad thing.

Mark

—————

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

----- Original Message -----
From: "Lars Marius Garshol" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, July 17, 2001 02:28
Subject: Re: Is there Unicode mail out there?


>
> * Mark Davis
> |
> | The HTML spec depends on the SGML spec for a characterization of
> | allowable characters. The latter, unfortunately, disallows some
> | valid Unicode characters (most C0 controls), but inconsistently
> | allows other similar characters (C1 controls).
>
> SGML is silent on the issue of what characters are allowed. It is the
> SGML declaration used by each application which decides this, and you
> can easily make an SGML declaration which allows every Unicode
> character.
>
> To wit:
>
> <!SGML  "ISO 8879:1986 (WWW)"
>      CHARSET
>           BASESET  "ISO Registration Number 177//CHARSET
>                     ISO/IEC 10646-1:1993 UCS-4 with
>                     implementation level 3//ESC 2/5 2/15 4/6"
>          DESCSET 0       55296   0
>                  55296   2048    UNUSED  -- SURROGATES --
>                  57344   1056768 57344
>
> CAPACITY        SGMLREF
>                 TOTALCAP        150000
>                 GRPCAP          150000
>                 ENTCAP          150000
>
> SCOPE    DOCUMENT
> SYNTAX
>          SHUNCHAR NONE
>          BASESET  "ISO 646IRV:1991//CHARSET
>                    International Reference Version
>                    (IRV)//ESC 2/8 4/2"
>          DESCSET  0 128 0          FUNCTION
>                   RE            13
>                   RS            10
>                   SPACE         32
>                   TAB SEPCHAR    9
>
>          NAMING   LCNMSTRT ""
>                   UCNMSTRT ""
>                   LCNMCHAR ".-_:"
>                   UCNMCHAR ".-_:"
>                   NAMECASE GENERAL YES
>                            ENTITY  NO
>
>          DELIM    GENERAL  SGMLREF
>                   HCRO "&#38;#x"   -- 38 is the number for ampersand --
>                   SHORTREF SGMLREF
>          NAMES    SGMLREF
>          QUANTITY SGMLREF
>                   ATTCNT   60      -- increased --
>                   ATTSPLEN 65536   -- These are the largest values --
>                   LITLEN   65536   -- permitted in the declaration --
>                   NAMELEN  65536   -- Avoid fixed limits in actual --
>                   PILEN    65536   -- implementations of HTML UA's --
>                   TAGLVL   100
>                   TAGLEN   65536
>                   GRPGTCNT 150
>                   GRPCNT   64
>
> FEATURES
>   MINIMIZE
>     DATATAG  NO
>     OMITTAG  YES
>     RANK     NO
>     SHORTTAG YES
>   LINK
>     SIMPLE   NO
>     IMPLICIT NO
>     EXPLICIT NO
>   OTHER
>     CONCUR   NO
>     SUBDOC   NO
>     FORMAL   YES
>   APPINFO NONE
> >
>
> | That means that it is not possible in HTML (or more importantly, in
> | XML) to represent all valid Unicode characters in data fields.
>
> What would you want to use control characters for in an XML document?
>
> --Lars M.
>
>
>


Reply via email to