Make XML based parsers better handle whitespace -----------------------------------------------
Key: DOXIA-226 URL: http://jira.codehaus.org/browse/DOXIA-226 Project: Maven Doxia Issue Type: Improvement Reporter: Benjamin Bentmann Regarding whitespace in XML documents, one needs to consider the following aspects: - ignorable whitespace, i.e. view "{{<tr> <td/> </tr>}}" and "{{<tr><td/></tr>}}" as equivalent - collapsible whitespace, i.e. view "{{Text Text}}" and "{{Text Text}}" as equivalent - trimmable whitespace, i.e. view "{{<p> Text </p>}}" and "{{<p>Text</p>}}" as equivalent Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser. Currently, whitespace handling is rather static, e.g. {{XhtmlBaseParser}} pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if {{AbstractXmlParser}} provided a default implementation of {{handleText()}} that subclasses can simply control via state flags instead of implementing {{handleText()}} from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided. More precisely, I image the following changes: - Have {{AbstractXmlParser}} maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element - Have {{AbstractXmlParser}} push/pop a tuple from this stack before/after calling {{handleStartTag()}}/{{handleEndTag()}} - Have {{AbstractXmlParser}} provide setters to allow subclasses to control the desired whitespace handling in their {{handleStartTag()}} implementation - Have {{AbstractXmlParser}} implement {{handleText()}} where it evalutes the top-most tuple from the stack -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira