[jira] [Updated] (XERCESJ-1276) Improve performance of XML Schema Identity-constraint validation --- XMLSchemaValidator$ValueStoreBase.contains() is painfully slow.

Deepak Kumar (JIRA) Fri, 31 Jan 2014 22:38:31 -0800

     [ 
https://issues.apache.org/jira/browse/XERCESJ-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Deepak Kumar updated XERCESJ-1276:
----------------------------------

    Attachment: xerces-binaries-patched-over-2.11.0.zip

I am experiencing very similar problem but with a significantly larger impact, 
attached is the zip holding binary with pulled patch (which does effective 
usage of hashCode() and equals()), below is the binary manifest snippet

Manifest-Version: 1.0
Ant-Version: Apache Ant(TM) version 1.8.3 compiled on February 26 2012
Created-By: 1.6.0_32-ea (Sun Microsystems Inc.)

Problem details:
----------------

I have a compressed input stream file of roughly 25M (24.4Mib) holding xml, 
compression is achieved using java.util.zip compression/decompression api's 
with the default strategy, and I am sure the file could go anywhere close to 
500M deflated.

A simple piece of code gets deployed in Tomcat 6.14 - Tomcat 7.0.50 (with java 
1.6.30 & java 1.6.32) as a webapp to read-in the compressed file and run an xml 
parser on it and it takes nearly 30 minutes to parse out fully on a 4-core i5 
2.5Ghz processor laptop (nothing in this entire process is parallelized for any 
kind of optimization reasons). This has been checked and confirmed with 
explicitly putting the xerces binaries (2.6 and 2.11) to allow xerces to take 
control of the entire parsing AND even on java default's parsing implementation 
which is very much the same as seen in xerces.

During multiple execution below code in xerces has been identified as potential 
hotspot (via multiple profiling tools) choking up entirely and is happening due 
to somewhat bad nested looping in the code with significantly larger value 
indexes (potentially in MB's) and also gets aligned with the comment.

org.apache.xerces.internal.impl.xs.XMLSchemaValidator#ValueStoreBase.contains()
            // REVISIT: we can improve performance by using hash codes, instead 
of
            // traversing global vector that could be quite large.
            ..........


[NOTE] Interestingly the same piece of code runs perfectly (with both jdk and 
xerces implementation) within a minute via Eclipse and even on the very plain 
\" java -classpath ... ParserTest \" without any significant JVM hotspot 
indications which makes a matter of worry on whether Tomcat internally is doing 
something during the entire parsing???

As of now I am able to run it within a minute inside Tomcat also, binary pairs 
can be used as a drop-in replacement for people facing such problem.

[ATTENTION] On a different angle with the existing xerces binaries if the 
application attempts to re-process the xmls, even in a different thread, then 
it severly impacts the execution of other operational threads, thus the entire 
webapp appears to start freezing randomly, and strangely takes even much higher 
time to do the parsing (close to 2x time) even with enough memory allocation. I 
am not sure whether the issue will persist with other other application servers 
like glassfish or jetty OR it's purely binded to Tomcat.

--Deepak

> Improve performance of XML Schema Identity-constraint validation --- 
> XMLSchemaValidator$ValueStoreBase.contains() is painfully slow.
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESJ-1276
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1276
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: XML Schema 1.0 Structures
>    Affects Versions: 2.6.2, 2.9.1
>            Reporter: Kenny MacLeod
>              Labels: gsoc, gsoc2013, mentor
>         Attachments: XMLSchemaValidator.java, 
> Xerces-J-src.2.11.0_patch1276.txt, xerces-binaries-patched-over-2.11.0.zip, 
> xerces-value-store.txt
>
>
> Under certain conditions, the contains() method in 
> XMLSchemaValidator$ValueStoreBase can cripple the performance of parsing and 
> validation.
> I'm not sure what those conditions are, but as a guideline figure I was using 
> JAXB2 to deserialize a 22meg XML file.  Without schema validation, it took 5 
> seconds.  With validation, it took over 3 minutes (JDK 1.5.0_10 on win32). My 
> profiler pointed the finger squarely at that method XMLSchemaValidator.
> Suspicions were aroused further when seeing this comment in the source:
> public boolean contains() {
>             // REVISIT: we can improve performance by using hash codes, 
> instead of
>             // traversing global vector that could be quite large.
> This is present in Xerces 2.6.2 contained with JDK1.5.0_10, and also in the 
> source for 2.9.1.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (XERCESJ-1276) Improve performance of XML Schema Identity-constraint validation --- XMLSchemaValidator$ValueStoreBase.contains() is painfully slow.

Reply via email to