Make XML based parsers better handle whitespace
-----------------------------------------------

                 Key: DOXIA-226
                 URL: http://jira.codehaus.org/browse/DOXIA-226
             Project: Maven Doxia
          Issue Type: Improvement
            Reporter: Benjamin Bentmann


Regarding whitespace in XML documents, one needs to consider the following 
aspects:
- ignorable whitespace, i.e. view "{{<tr> <td/> </tr>}}" and 
"{{<tr><td/></tr>}}" as equivalent
- collapsible whitespace, i.e. view "{{Text &nbsp; Text}}" and "{{Text Text}}" 
as equivalent
- trimmable whitespace, i.e. view "{{<p>  Text  </p>}}" and "{{<p>Text</p>}}" 
as equivalent

Those distinctions require a DTD/XSD in combination with a validating parser 
and/or application-specific knowledge. For robustness, doxia parsers for 
XML-based formats should not depend on the existence of a schema definition 
such that they reliably deliver events into the sinks. Hence I suggest to 
hard-code the required logic for proper whitespace handling into each parser.

Currently, whitespace handling is rather static, e.g. {{XhtmlBaseParser}} 
pushes all input whitespace into the sink. This might cause troubles with sinks 
that are not expected to receive ignorable whitespace. To address this issue, 
it seems helpful if {{AbstractXmlParser}} provided a default implementation of 
{{handleText()}} that subclasses can simply control via state flags instead of 
implementing {{handleText()}} from scratch in each parser. Copy&Paste - which 
caused DOXIA-225 - needs to be avoided.

More precisely, I image the following changes:
- Have {{AbstractXmlParser}} maintain a stack of tuples (ignorable, 
collapsible, trimmable) where each tuple describes the whitespace handling for 
the currently parsed element
- Have {{AbstractXmlParser}} push/pop a tuple from this stack before/after 
calling {{handleStartTag()}}/{{handleEndTag()}}
- Have {{AbstractXmlParser}} provide setters to allow subclasses to control the 
desired whitespace handling in their {{handleStartTag()}} implementation
- Have {{AbstractXmlParser}} implement {{handleText()}} where it evalutes the 
top-most tuple from the stack


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to