This was an issue when we were converting SGML documents to XML. The
solution for us was to expand them into Unicode characters in the XML DTD.

-----Original Message-----
From: Joseph Kesselman/CAM/Lotus [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 20, 2002 10:42 AM
To: Steve Carton
Cc: [EMAIL PROTECTED]
Subject: Re: Character Entities



The DOM has no concept of "character entities" per se.

Named references to characters (such as <)  are treated as predefined
Parsed Entity References, just as if you had defined them yourself in the
DTD. However, the DOM spec allows parsed entities to be "fully expanded",
and leaves the question of which (if any) are treated that way up to the
parser; most parsers I've seen _do_ fully expand these predefined entities
but that's optional.

Numeric character references (such as  ) are always expanded into their
corresponding Unicode characters.

And the DOM's requirement of text normalization means that if expansion was
done, the resulting character will be merged with any adjacent text
node(s).


Question: Why would you _want_ to stop their expansion? These are used
where it would otherwise be impossible to insert the character directly,
and are intended to be read as the character when processing the document's
contents. When you serialize the DOM back into XML format, it's the
serializer's responsibility to re-convert them back into their escaped
form.

Reply via email to