I was a little worried about the performance impact of adding
an infoset parameter to XNI callbacks so I did a little test
to see how much of an impact the added parameter would be.

The results are at the bottom.

What I Did

First, I created an XMLInfoSet interface (which is bascially
just a map) and wrote a simple default implementation. Then I 
added an "infoset" parameter to the following methods in the
XMLDocumentHandler interface: (for completeness, they would
also need to be added to the XMLDocumentFragmentHandler
interface)

  startElement(QName,XMLAttributes,XMLInfoSet)
  emptyElement(QName,XMLAttributes,XMLInfoSet)
  characters(XMLString,XMLInfoSet)
  ignorableWhitespace(XMLString,XMLInfoSet)
  endElement(QName,XMLInfoSet)

I didn't add the parameters to any of the other methods because
it didn't seem necessary. [Does anyone see off-hand any other
method that would need this parameter?]

Then I updated the Xerces2 implementation to pass this infoset
through the pipeline. This took a little bit of time but was
pretty straightforward.

How I Tested

The sax.Counter example in the Xerces2 package was used to get
the "performance" measurements in the next section. The jars
for Xerces 1.x, Xerces2, and Xerces2 (w/ infoset) were compared
for running time against three files of varying sizes and
content. These following table shows the basic nature of these
files:

  File           Elems  Attrs  Spaces    Chars  Tagginess
  personal.xml      20     11      89       76        44%
  simpsons.pgml   1117   4597    8481        0        54%
  ot.xml         71461      0   25170  3236745         7%

The "tagginess" of the file gives you an indication of the
ratio of textual content (characters, ignorable whitespace,
attribute values, etc) vs. the characters required for the 
markup (angle-brackets, element and attribute names, etc). 
A pretty useful number which can be generated by the 
sax.Counter program using the "-t" option.

The Results

The following table shows the time required for Xerces 1.x,
Xerces2, and Xerces2 (w/ infoset) to parse the sample files.
Each parser was "warmed up" by parsing the sample document
once and then the number shown is the average of the next
10 parses performed in a loop by using the "-x" option of
the sax.Counter example.

                 Time (ms)
  File           Xerces 1.x   Xerces2  Xerces2+
  personal.xml           39        29        29
  simpsons.pgml         396       243       240
  ot.xml               2179      2001      2071

"Xerces2+" is Xerces2 with infoset additions. I expected
the parsing time to increase from Xerces2 to Xerces2+
but the "simpsons.pgml" shows a slight decrease. However,
I attribute this to the combination of the small number 
of parses and standard deviations in system performance.

Conclusion

The addition of a single parameter to the most called
methods of the XNI document handler interface doesn't seem
to have adverse affects on performance. Obviously this is
dependent on the nature of the parsed documents and length
of the parsing pipeline. However, I think that this 
approach is reasonable.

What do other people think? Should we add the information?
And if I add the information, should it be on selective
handler methods or every method?

-- 
Andy Clark * IBM, TRL - Japan * [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to