[
https://issues.apache.org/jira/browse/XERCESJ-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deepak Kumar updated XERCESJ-1276:
----------------------------------
Attachment: xerces-binaries-patched-over-2.11.0.zip
I am experiencing very similar problem but with a significantly larger impact,
attached is the zip holding binary with pulled patch (which does effective
usage of hashCode() and equals()), below is the binary manifest snippet
Manifest-Version: 1.0
Ant-Version: Apache Ant(TM) version 1.8.3 compiled on February 26 2012
Created-By: 1.6.0_32-ea (Sun Microsystems Inc.)
Problem details:
----------------
I have a compressed input stream file of roughly 25M (24.4Mib) holding xml,
compression is achieved using java.util.zip compression/decompression api's
with the default strategy, and I am sure the file could go anywhere close to
500M deflated.
A simple piece of code gets deployed in Tomcat 6.14 - Tomcat 7.0.50 (with java
1.6.30 & java 1.6.32) as a webapp to read-in the compressed file and run an xml
parser on it and it takes nearly 30 minutes to parse out fully on a 4-core i5
2.5Ghz processor laptop (nothing in this entire process is parallelized for any
kind of optimization reasons). This has been checked and confirmed with
explicitly putting the xerces binaries (2.6 and 2.11) to allow xerces to take
control of the entire parsing AND even on java default's parsing implementation
which is very much the same as seen in xerces.
During multiple execution below code in xerces has been identified as potential
hotspot (via multiple profiling tools) choking up entirely and is happening due
to somewhat bad nested looping in the code with significantly larger value
indexes (potentially in MB's) and also gets aligned with the comment.
org.apache.xerces.internal.impl.xs.XMLSchemaValidator#ValueStoreBase.contains()
// REVISIT: we can improve performance by using hash codes, instead
of
// traversing global vector that could be quite large.
..........
[NOTE] Interestingly the same piece of code runs perfectly (with both jdk and
xerces implementation) within a minute via Eclipse and even on the very plain
\" java -classpath ... ParserTest \" without any significant JVM hotspot
indications which makes a matter of worry on whether Tomcat internally is doing
something during the entire parsing???
As of now I am able to run it within a minute inside Tomcat also, binary pairs
can be used as a drop-in replacement for people facing such problem.
[ATTENTION] On a different angle with the existing xerces binaries if the
application attempts to re-process the xmls, even in a different thread, then
it severly impacts the execution of other operational threads, thus the entire
webapp appears to start freezing randomly, and strangely takes even much higher
time to do the parsing (close to 2x time) even with enough memory allocation. I
am not sure whether the issue will persist with other other application servers
like glassfish or jetty OR it's purely binded to Tomcat.
--Deepak
> Improve performance of XML Schema Identity-constraint validation ---
> XMLSchemaValidator$ValueStoreBase.contains() is painfully slow.
> ------------------------------------------------------------------------------------------------------------------------------------
>
> Key: XERCESJ-1276
> URL: https://issues.apache.org/jira/browse/XERCESJ-1276
> Project: Xerces2-J
> Issue Type: Bug
> Components: XML Schema 1.0 Structures
> Affects Versions: 2.6.2, 2.9.1
> Reporter: Kenny MacLeod
> Labels: gsoc, gsoc2013, mentor
> Attachments: XMLSchemaValidator.java,
> Xerces-J-src.2.11.0_patch1276.txt, xerces-binaries-patched-over-2.11.0.zip,
> xerces-value-store.txt
>
>
> Under certain conditions, the contains() method in
> XMLSchemaValidator$ValueStoreBase can cripple the performance of parsing and
> validation.
> I'm not sure what those conditions are, but as a guideline figure I was using
> JAXB2 to deserialize a 22meg XML file. Without schema validation, it took 5
> seconds. With validation, it took over 3 minutes (JDK 1.5.0_10 on win32). My
> profiler pointed the finger squarely at that method XMLSchemaValidator.
> Suspicions were aroused further when seeing this comment in the source:
> public boolean contains() {
> // REVISIT: we can improve performance by using hash codes,
> instead of
> // traversing global vector that could be quite large.
> This is present in Xerces 2.6.2 contained with JDK1.5.0_10, and also in the
> source for 2.9.1.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]