For reference: the XhtmlBaseParser in Doxia 1.1.1 emits entities as text, except
if they are not recognized (ie haven't been declared), then they are emitted as
unknown events.
-Lukas
Vincent Siveton wrote:
Hi Lukas,
2009/5/4 Lukas Theussl <ltheu...@apache.org>:
Vincent,
I'm trying to understand some of the issues we have with entities in the
XmlParser. Is there a special reason why entities are emitted as rawText and
not text?
The text used by XhtmlBaseParser#handleEntity() could contain
predefined entities [1] and numeric code entities (ie Æ will
become Æ by XmlPullParser)
XhtmlBaseSink#text() escapes chars and XhtmlBaseSink#rawText() not.
So using rawText() is to be sure to not escape text with entities.
I think they should be emitted as text:
First, custom entities can be used to simply define some replacement text
inside documents (eg <!ENTITY version "1.0">).
Second, the resulting events should be consumable by all sinks, not just
x(ht)ml based ones. Consider for instance the text "&Æ" (where
AElig is defined as <!ENTITY AElig "Æ">). Currently it is emitted by
the XhtmlBaseParser as one text event "&" and one rawText event "Æ".
This means that eg the Latex Sink will produce wrong output (the AElig
should be converted to "\AE" in latex).
IMO the resolved entity should be emitted in a format-independent way, eg as
one (unicode?) character, just like & is emitted as one character above.
The consuming sink then has to transform that into a format-specific
representation.
It could be another implementation.
XhtmlBaseParser#handleEntity() could unescape xml and call only sink.text()
Cheers,
Vincent
[1] http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-predefined-ent