Avoid startup overhead by saving compiled large XSD for Xerces-J validation

Mike Beckerle Wed, 04 May 2022 13:39:18 -0700

(Apologies in advance if this is already answered. I did make an attempt to
search archives at lists.apache.org stack overflow etc., but did not find
this answered. )


We use Xerces-J to validate XML files. (XSD 1.0. We are not yet using XSD
1.1)

The schemas of these files are huge. Think 300+ fairly large XSD files all
included/imported together. Megabytes of XSD.

In contrast the XML documents we're validating are typically small. They
are a few hundred or thousand bytes of XML. But there are many of them, so
we're calling Xerces to parse+validate them in a loop.  We are already
using the Xerces APIs in such a way that the XSD is loaded once, and the
parser then called repeatedly for each input data document.

I imagine that to validate XML, Xerces does something akin to "compiling"
the XSD into lower-level data structures for faster use when actually
parsing (and validating) the incoming XML being parsed.

Question 1: Is that true? Is there much compiling/lowering of the XSD to
fast parser-runtime structures?

If the answer to that is yes, then my next question applies also.

Question 2: Is it possible to get this "compilation" of the large XSD
schema done, and then serialize the resulting java object to a file, and
reload this pre-compiled thing so as not to face this compiling overhead at
startup time?

I have seen some discussion of serializing an XSModel of the XSD schema but
that's more or less isomorphic to the XSD file objects, i.e., not really
saving any "compilation" overhead.

Any advice appreciated.

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com

Avoid startup overhead by saving compiled large XSD for Xerces-J validation

Reply via email to