[
http://issues.apache.org/jira/browse/XERCESC-1363?page=comments#action_60431 ]
David Earlam commented on XERCESC-1363:
---------------------------------------
I checked out and built on VS.Net 2003 with the new 5 parameter version of
substring called from tokenizeString.
I can confirm that this lets me parse the file
in about 18 seconds (or 15.7 seconds when using the option -wfile=NUL so as to
not measure console IO scrolling).
This is about 36kB per second.
Even Christian's laptop reaches only 60kB/s.
Yet I have some C code that can process this at nearly 1Mb a second.
I reckon xerces list validation should be an order of magnitude faster than it
currently is.
I tried avoiding all parameter passing and call return overhead by replacing
tokenizeString's
use of substring with
{
int copysize = skip-index;
memcpy(token, &tokenizeStr[index], copysize * sizeof(XMLCh));
token[copysize] = 0;
}
since tokenizeString has the information to safely do this.
Yet even compiled with /Oi this gave at best a 0.04% improvement on calling
substring.
So I looked elsewhere.
I figured that
BaseRefVectorOf<XMLCh>::ensureExtraCapacity()
is reallocating too often.
I made a change to make it grow half as much again each time, rather than by a
constant 32 elements, no matter what size the vector.
With this change the test file now parses in under 0.8 seconds. This is 672
kB/s !
This is a speed versus data space trade-off which I believe is justified. Is
there a unit-test suite
I can run to ensure I've broken nothing ?
regards,
David
> DataTypeListValidator extraordinarily slow for long lists
> -----------------------------------------------------------
>
> Key: XERCESC-1363
> URL: http://issues.apache.org/jira/browse/XERCESC-1363
> Project: Xerces-C++
> Type: Bug
> Components: Validating Parser (Schema) (Xerces 1.5 or up only)
> Versions: 2.5.0, 2.6.0
> Environment: Windows 2000
> Reporter: David Earlam
> Priority: Minor
> Attachments: BaseRefVectorOf.c.patch, XMLString.cpp.patch, pq.zip,
> second_patch_XMLString.cpp.zip
>
> Validating an XML instance against a Schema with an unbounded xsd:list type
> can take much greater than O(n) processing resources, where n is the number
> of items in the list.
> To reproduce use this Schema:
> pq.xsd
> <?xml version="1.0" encoding="utf-8" ?>
> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xmlns:pqns="http://swsis.cambridge.arm.com/~dearlam/xercestest/"
> targetNamespace="http://swsis.cambridge.arm.com/~dearlam/xercestest/"
> elementFormDefault="qualified" version="0.1">
> <xs:annotation>
> <xs:documentation xml:lang="en">
> XML schema for Hofstadter's G�del pq-System.
>
> Test data for list data type validation.
> </xs:documentation>
> </xs:annotation>
> <xs:element name="pqData" type="pqns:pqDataType"></xs:element>
> <xs:complexType name="pqDataType">
> <xs:complexContent>
> <xs:restriction base="xs:anyType">
> <xs:sequence minOccurs="1" maxOccurs="1">
> <xs:element name="dashes"
> type="pqns:dashBlockType"></xs:element>
> <xs:element name="p" type="xs:string"
> xsi:nill="true"></xs:element>
> <xs:element name="dashes"
> type="pqns:dashBlockType"></xs:element>
> <xs:element name="q" type="xs:string"
> xsi:nill="true"></xs:element>
> <xs:element name="dashes"
> type="pqns:dashBlockType"></xs:element>
> </xs:sequence>
> </xs:restriction>
> </xs:complexContent>
> </xs:complexType>
> <xs:complexType name="porqType">
> <xs:simpleContent>
> <xs:extension base="xs:string"></xs:extension>
> </xs:simpleContent>
> </xs:complexType>
> <xs:complexType name="dashBlockType">
> <xs:simpleContent>
> <xs:extension base="pqns:dataDashes"></xs:extension>
> </xs:simpleContent>
> </xs:complexType>
> <xs:simpleType name="Dash">
> <xs:restriction base="xs:string">
> <xs:pattern value="[\-]"></xs:pattern>
> </xs:restriction>
> </xs:simpleType>
> <xs:simpleType name="dataDashes">
> <xs:restriction base="pqns:DashList">
> <xs:minLength value="0" />
> </xs:restriction>
> </xs:simpleType>
> <xs:simpleType name="DashList">
> <xs:list itemType="pqns:Dash"></xs:list>
> </xs:simpleType>
> </xs:schema>
> and this XML file
> pqData0.xml
> <?xml version="1.0" encoding="utf-8" ?>
> <pqData xmlns='http://swsis.cambridge.arm.com/~dearlam/xercestest/'
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xsi:schemaLocation="http://swsis.cambridge.arm.com/~dearlam/xercestest/
> http://swsis.cambridge.arm.com/~dearlam/xercestest/pq.xsd">
> <dashes>
> - -
> </dashes>
> <p/>
> <dashes>-</dashes>
> <q/>
> <dashes>-</dashes>
> </pqData>
> (replacing swsis.cambridge.arm.com/~dearlam/xercestest with your location)
> Then use
> domprint -wfpp=on pqData0.xml
> and
> domprint -n -s -wfpp=on pqData0.xml
> to print the XML non-validating and validating.
> They print in equal short time. OK.
> Now, edit pqData0.xml as pqData1.xml and replace
> - -
> with 4000 lines of
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> This gives a 500Kb file (which mimics my real data).
> If you then try
> domprint -wfpp=on pqData1.xml
> and
> domprint -n -s -wfpp=on pqData1.xml
> the first prints instantly (pipe it to NUL if you like), but the second
> consumes 99% CPU for 230 seconds, then prints.
> That's about 2 bytes per second !
> --
> (My suspicion is XMLString::tokenizeString is using subString() to calculate
> the string length
> way too many times...)
> kind regards,
> David
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]