[ 
https://issues.apache.org/jira/browse/XERCESJ-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746994#comment-16746994
 ] 

Mukul Gandhi commented on XERCESJ-1705:
---------------------------------------

Thanks for the new test cases. The essential difference I see than your earlier 
example, is that your new XML document is much larger (its about 30MB. it has 
889056 XML A elements, and <assert> should run on all of them).

My workstation capabilities are about same as yours.

Xerces's XSD 1.1 <assert> implementation is non-performant for certain use 
cases (requires more memory / takes more time, where input documents are 
large). Its hard to fix this I guess. This can be improved to a certain extent, 
by specifying more memory to the JVM with an option like -Xmx. If you can share 
insights by running a java memory profiler on your use case with Xerces XSD 1.1 
<assert>, and give us a clue where if there are any memory leaks, we can try to 
fix that.

You may also consider splitting your current XML document into multiple smaller 
XML documents (where each of them may have say 100-1000 XML A elements). This 
can be done with XSLT for e.g. After the document split, you can run 
multi-threaded Xerces XSD 1.1 processes on these documents. This may have a 
greater chance to complete successfully.

Interestingly, I converted your new example to mine using XSD 1.1 CTA 
(<alternative>) to solve the same XSD validation requirement. Please see the 
attachments, new_prob_mukul.xml and new_prob_mukul.xsd. I made my XML using a 
simple XSLT transform on your file NEW_PROBLEM.xml.

Instead of, your XML elements A like,

<A>
   <B>100</B>
   <C/>
 </A>
 <A>
    <B>101</B>
    <D/>
 </A>

...

my XML has A elements like,

<A val="100">
   <B/>
   <C/>
 </A>
 <A val="101">
   <B/>
   <D/>
 </A>

...

Instead of data within element B, I'd suggest creating an attribute on element 
A using which CTA logic can be easily written. My suggested XSD for this is 
new_prob_mukul.xsd. My XML document new_prob_mukul.xml is of the same size as 
yours. The XSD 1.1 CTA validation that I've suggested, takes about 3-4 secs 
with Xerces. Please consider using such a modification.

> Validation against asserts (1.1) is slow and takes up a lot of memory for 
> larger files.
> ---------------------------------------------------------------------------------------
>
>                 Key: XERCESJ-1705
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1705
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: XML Schema 1.1 Structures
>    Affects Versions: 2.12.0
>            Reporter: Gerben Abbink
>            Priority: Major
>         Attachments: NEW_PROBLEM.xml, NEW_PROBLEM.xsd, PROBLEM.xml, 
> PROBLEM.xsd, SaxonEETester.java, SaxonOutput.txt, XercesOutput.txt, 
> XercesTester.java, new_prob_mukul.xml, new_prob_mukul.xsd
>
>
> The validation of xml against asserts in XMLSchema 1.1 is slow and takes up a 
> lot of memory for larger xml files. I have created a simple test xml file 
> with lots of repetition and a corresponding xml schema to show the problem.
> It takes 20 sec. to validate the xml against the xml schema. When i remove 
> the asserts in the xml schema it takes just 1 second to validate. Testing was 
> done from the command prompt on a modern Windows machine with 8GByte memory.
> To compare, i have also validated the xml file against the xml schema in 
> XMLSpy. With asserts it takes 2 sec., without the asserts 1 sec. (XMLSpy does 
> not uses Xerces.)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to