Hello,
I am working in a project that aims for validating open data by an open
standard defined in an XML Schema[1]. The document size varies from 13kB -
2GB[2]. The basic problem I face is key constraint validation, defined as
key, keyref and unique combinations. The special case here is that most of
our validation consists of a compound key: meaning they have an ID and
version, and should match a foreign object with that same pair. To
illustrate:
Every [StopPointInJourneyPattern Id +
Version + order] must be unique within document.
refer="netex:StopPointInJourneyPattern_AnyVersionedKey_ordered">
Due to the general terrible XML schema validation performance the project
has an XSD-root with constraint validation and a separate file without
constraint validation.
The syntax validation performance alone within libxml2 in my perspective is
quite good. It takes about 14s to load the entire XSD, 9s to load a file of
about 400MB, and 3 seconds of validation. Xerces-c would take 50s total.
The main problem that I am trying to address is constraint validation
itself, which takes unreasonably long. I think improving this would help
the general public, not only this project. Exclusively adding the
illustrated example increases that 3 seconds of syntax validation to 186
seconds.
If we peak into the document using xmllint --shell:
setns netex=http://www.netex.org.uk/netex
xpath count(.//netex:StopPointInJourneyPattern)
Object is a number : 39509
Within 2 seconds the following is evaluated;
xpath count(.//netex:StopPointInJourneyPatternRef |
.//netex:FarePointInPatternRef | .//netex:FromPointInPatternRef |
.//netex:ToPointInPatternRef | .//netex:StartPointInPatternRef |
.//netex:EndPointInPatternRef)
Object is a number : 0
I would like to ask some naive questions considering the schema validation.
1) Considering there is no ref to match a key, why would the refer be
evaluated at all? By removing the key/keyref-pair manually the validation
time is reduced to 77s. Still quite high for merely evaluating uniqueness.
For the unique constraint this seems to be in effect, no elements, does not
cause overhead.
2) Limiting he uniqueness constraint to merely @id, the validation time is
reduced to 37s.
3) Considering my count() performance above (within a second) querying the
document seems not really to be an issue. Sure, it queries the entire tree
for a single object, but one could argue that the xpath result would be a
one time effort, or an index could be placed on all to be queried elements.
For example, each xsd:key would a hash list, all keyrefs could be queried
for on the hash list.
4) Changing the xpath evaluation to below, increases the evaluation time to
1 minute and 20 seconds. An valid expression, without any result, reduces
the computation time to 3 seconds. I find it interesting that a full path
xpath expression (including root) seems to work faster in the xmllint
shell, but performance worse as selector.
netex:dataObjects/netex:CompositeFrame/netex:frames/netex:ServiceFrame/netex:journeyPatterns/netex:ServiceJourneyPattern/netex:pointsInSequence/netex:StopPointInJourneyPattern
5) Considering the constraint validation is read-only, would it be possible
to parallelize them using multithreading?
The top of an oprofile trace for the entire constraint checking document
looks like this:
CPU: AMD64 generic, speed 2000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (CPU Clocks not Halted) with a unit mask of
0x00 (No unit mask) count 10
samples %image name symbol name
16300585 52.9514 libxml2.so.2.9.10xmlStreamPushInternal
5715003 18.5648 libxml2.so.2.9.10xmlStreamPop
1547103 5.0257 libxml2.so.2.9.10xmlSchemaXPathEvaluate
1341334 4.3572 libxml2.so.2.9.10xmlSchemaXPathProcessHistory
7769142.5238 libxml2.so.2.9.10xmlStrchr
6369482.0691 libxml2.so.2.9.10xmlSchemaValidatorPopElem
5858831.9032 libxml2.so.2.9.10xmlStrEqual
5597141.8182 libxml2.so.2.9.10xmlStreamPushAttr
3695501.2005 libxml2.so.2.9.10xmlHashLookup3
2953170.9593 libxml2.so.2.9.10__xmlRaiseError
2952510.9591 libxml2.so.2.9.10xmlSchemaXPathPop
2604830.8462 libxml2.so.2.9.10xmlStreamPush
1249480.4059 libxml2.so.2.9.10xmlStrlen
1145420.3721 libxml2.so.2.9.10xmlFACompareAtoms
98775 0.3209 libxml2.so.2.9.10xmlFAComputesDeterminism
98228 0.3191 libxml2.so.2.9.10xmlSchemaVAttributesComplex
90614 0.2944 libxml2.so.2.9.10xmlRegStrEqualWildcard
81907 0.2661 libc-2.32.so malloc_consolidate
81235 0.2639 libxml2.so.2.9.10xmlFARecurseDeterminism
62143 0.2019 libxml2.so.2.9.10xmlStrdup
60806