I was a little worried about the performance impact of adding
an infoset parameter to XNI callbacks so I did a little test
to see how much of an impact the added parameter would be.
The results are at the bottom.
What I Did
First, I created an XMLInfoSet interface (which is bascially
just a map) and wrote a simple default implementation. Then I
added an "infoset" parameter to the following methods in the
XMLDocumentHandler interface: (for completeness, they would
also need to be added to the XMLDocumentFragmentHandler
interface)
startElement(QName,XMLAttributes,XMLInfoSet)
emptyElement(QName,XMLAttributes,XMLInfoSet)
characters(XMLString,XMLInfoSet)
ignorableWhitespace(XMLString,XMLInfoSet)
endElement(QName,XMLInfoSet)
I didn't add the parameters to any of the other methods because
it didn't seem necessary. [Does anyone see off-hand any other
method that would need this parameter?]
Then I updated the Xerces2 implementation to pass this infoset
through the pipeline. This took a little bit of time but was
pretty straightforward.
How I Tested
The sax.Counter example in the Xerces2 package was used to get
the "performance" measurements in the next section. The jars
for Xerces 1.x, Xerces2, and Xerces2 (w/ infoset) were compared
for running time against three files of varying sizes and
content. These following table shows the basic nature of these
files:
File Elems Attrs Spaces Chars Tagginess
personal.xml 20 11 89 76 44%
simpsons.pgml 1117 4597 8481 0 54%
ot.xml 71461 0 25170 3236745 7%
The "tagginess" of the file gives you an indication of the
ratio of textual content (characters, ignorable whitespace,
attribute values, etc) vs. the characters required for the
markup (angle-brackets, element and attribute names, etc).
A pretty useful number which can be generated by the
sax.Counter program using the "-t" option.
The Results
The following table shows the time required for Xerces 1.x,
Xerces2, and Xerces2 (w/ infoset) to parse the sample files.
Each parser was "warmed up" by parsing the sample document
once and then the number shown is the average of the next
10 parses performed in a loop by using the "-x" option of
the sax.Counter example.
Time (ms)
File Xerces 1.x Xerces2 Xerces2+
personal.xml 39 29 29
simpsons.pgml 396 243 240
ot.xml 2179 2001 2071
"Xerces2+" is Xerces2 with infoset additions. I expected
the parsing time to increase from Xerces2 to Xerces2+
but the "simpsons.pgml" shows a slight decrease. However,
I attribute this to the combination of the small number
of parses and standard deviations in system performance.
Conclusion
The addition of a single parameter to the most called
methods of the XNI document handler interface doesn't seem
to have adverse affects on performance. Obviously this is
dependent on the nature of the parsed documents and length
of the parsing pipeline. However, I think that this
approach is reasonable.
What do other people think? Should we add the information?
And if I add the information, should it be on selective
handler methods or every method?
--
Andy Clark * IBM, TRL - Japan * [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]