The HTML parser should emit attributes as HTML.Attribute objects and not as strings.

This is true for the final, user - accessible parser interface. htmlAttributeSet works as part of the internal implementation, where the attribute strings are already extracted from the text but not yet converted into the matching attribute constants. It is a highly specialized class that additionally handles the case insensitivity (following W3C HTML specification, both HTML tag and attribute names are case insensitive). Direct replacement into the SimpleAttributeSet will break the case insensitivity, and you will need to rework the code elsewhere to restore this.

Also, the HTML may contain the non standard attributes that have no corresponding attribute constant. These should be handled as strings.

To produce the less garbage, htmlAttributeSet may become exposed to the user via the AttributeSet interface that it implements. I did not see this as a big problem, but, if needed, the intermediate class surely can be instantiated. The Mauve and the GNU Classpath itself contain numerous automated tests for HTML parser regressions. These should be used when working with the parser code.
Good luck.
Audrius



Reply via email to