On Sun, Jul 22, 2007 at 11:25:23PM +0200, Matthias Wimmer wrote:
Hi Robin!
Robin Redeker schrieb:
Why at all do these characters have to be escaped?
I guess because many people did implement their own broken XML parsers
in the past and many couldn't handle real XML, so they enforced escaping
that character for the backward compatibility. (just a guess)
I can't beleave that there are any such problems. There is already
software producing XML, that is valid but not escaping all possible
characters.
Examples for this are jabberd2 (but to a very new SVN version), jadc2s
(up to today), Psi (still not escaping and ' in text nodes).
Hm, ok, then I also have absolutely no idea why these chars have to be
escaped. (And I'm also very curious!)
So there is out many software, that worked for years now, but
introducing this unneccessary restricting in RFC 3920 made them broken.
If you use expat you could get the original string from a text node
and look for a '' in that string. But this is an ugly hack that I also
consider unneccessary.
How do I do this with expat? I have never seen something like this. At
least normally expat is a SAX parser, that you set an
CharacterDataHandler. And the function you register as the
CharacterDataHandler gets passed unescaped UTF-8 data. Within the
CharacterDataHandler I see now way to determine if a has been
transfered as or as gt;.
The function I mean is XML_GetInputContext, the Perl API uses that
to get me the portion of the XML document that was parsed. If you got
that you might be able to find out whether unescaped is in that.
(see eg.
http://www.math.ucla.edu/computing/docindex/expat-html-2/reference.html#XML_GetInputContext
)
You could also use XML_SetDefaultHandler and XML_DefaultCurrent to get
nearly the same data without the limit of XML_CONTEXT_BYTES (I actually
don't know very precise what the difference is, I'm not that intimate
with the C API of expat).
These are all of course very weird ways to get to the original XML
character data of a recognized element, and I would love if XMPP
wouldn't require me to even have need for such access to the my XML
parser.
All I want is a DOM tree and all I want to care about when writing XMPP
out is passing my DOM tree to a XML generator without convincing the XML
generator that _I_ know better than it how XML should be generated and
look like so that a XMPP server doesn't get confused.
(Lets not start with namespace madness here, we don't want to open the
can of worms which is still left over from the 'About stream namespaces'
discussion from 4 months ago. Hmm... IMO the issue should be brought
up regularly even though that discussions about it end in:
We know, but we won't fix it, because the protocol should stay broken
and underspecified because if we fix it we break the implementations.).
The RFC should be fixed and software that doesn't parse unescaped in
text nodes should be fixed (noone is forced in todays world to write his
own XML parser, libxml2 (afaik) and expat (for sure) can be convinced to
handle partial transferred XML documents these days).
Yes ... I'd also say that because of reusing standards and
implementations of them, we should not force software to not accept
unescaped entities. We should even encourage software to accept these
unescaped entities.
I agree, we should encourage being conform to the W3C recommendation.
Robin