When using schema-based validation while sax parsing large files containing
large numbers of minor validation errors, performance degrades rapidly as
the file size increases. In my testing, a 250M file is processed in about
15 minutes, whereas a 500M file containing proportiately the same number of
errors takes several hours to run through.
Running under OptimizeIt shows that, when working through the largest files,
the application is spending the majority of its time inside the
XMLSchemaValidator$XSIErrorReporter.reportError() method, specifically on
the line
fErrors.addElement(key);
I believe that because the declaration for the XSIErrorReporter's fErrors
attribute includes a capacityIncrement value of 8, the vector ends up having
to resize its underlying buffer every 8th error. As the vector grows (with
my test data, into the hundreds of thousands or even millions of entries),
this ends up consuming a great deal of CPU, as well as keeping the garbage
collector quite busy.
I recompiled with a very minor code change, altering the declaration for
fErrors from the current
Vector fErrors = new Vector(INITIAL_STACK_SIZE, INC_STACK_SIZE);
to use the default constructor
Vector fErrors = new Vector();
hence allowing the vector to use the default capacityIncrement value of
0--i.e. the capacity doubles each time the vector exceeds its maximum size.
With this change, the time required to process my largest test file is
reduced from six or seven hours to a bit over 30 minutes.
I don't pretend to fully understand everything that's going on in this
class. Is there a compelling reason for specifying a capacityIncrement for
fErrors?
In case it's useful, I'm running version 2.6.2 on Solaris. I'm guessing it
wouldn't really be appropriate to attach my test files....
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]