Re: Performance Problems using large XML DOM and XPath

Dane Foster 6 Feb 2002 21:20:15 -0000

Use dom4j.  It has built in XPath support and a SAX Filtering mechanism.
http://www.dom4j.org.  You can plug in the Apache parser to provide support
for Schema validation.

Dane Foster
http://www.equitytg.com
954.360.9800
----- Original Message -----
From: "Raghavendra, Karthik" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, February 05, 2002 10:44 PM
Subject: Performance Problems using large XML DOM and XPath

Hi,

I have a XML file containing approximately 14800 product catalog
information records and a file size of 80 MB.  I am using Xerces 1.4.4
to validate and parse the XML document and create a DOM tree.  I am
using a Solaris box with 2 GB of RAM and so memory is not an issue.  I
am using the XPath capabilities of Xalan 2 to traverse the DOM tree and
access required data.  I am noticing that the performance degrades over
time.  For example, initial processing (first hundred records) returns
within 800 milliseconds while the time for the last set of records
(14000 and above) is about 77 seconds.  I am using the CachedXPathAPI
instead of the XPathAPI since the XPathAPI resulted in significantly
greater response times.

The code snippets below:

Main Method:
-------------
DOMParser parser = new DOMParser();
parser.parse(XMLFile);
Document doc = parser.getDocument();
NodeList list = doc.getElementsByTagName("PRODUCT");
CachedXPathAPI xPath = new CachedXPathAPI();
Node prdNode = null;

for (int index=0;index<list.getlength();index++)
{
prdNode = list.item(index);
processProduct(prdNode,xPath);
...
...
}

ProcessProduct Method (this method uses XPath to get the relevant
information.  The product node for the current product is passed as a
parameter):
------------------------------------------------------------------------
------------------------------------------------------------------------
String prdNum =
getText(xPath.selectSingleNode(prdNode,"child::ProductNumber"));
String title = getText(xPath.selectSingleNode(prdNode,"child::Title"));
String language =
getText(xPath.selectSingleNode(prdNode,"child::Language"));
String publisher =
getText(xPath.selectSingleNode(prdNode,"child::Rights/child::Publisher")
);
...
...
...

getText Method (this method concatenates all the text node information
and returns it):
------------------------------------------------------------------------
-------------
StringBuffer text = new StringBuffer();
NodeList list = node.getChildNodes();
for (int index=0;index < list.getLength();index++)
{
if (list.item(index).getNodeType() == Node.TEXT_NODE)
text.append(list.item(index).getNodeValue());
}
return (text.toString());

I excuted the main method as indicated in the code snippet and observed
that the processing time increased gradually.  I then reversed the loop
in the main method to start from the last product record (14800) and
noticed that the processing time was 77 seconds.  However, it dropped
significantly after a few iterations (100 records) to about 25 seconds.

What I do not understand is the reason for this increase in processing
time.  I have compared the XML records to confirm that there is nothing
wrong with the data.  The structure and content for all 14800 records is
the same.  Am I doing something wrong?  Is there an issue with using
XPath for large DOMs?  Is there a XPath bug?  I am passing the product
node in the DOM to be traversed and I would think that the XPath lookup
should be the same for product #14000 as it is for product #1.  However,
the pattern suggests that there is some kind of processing overhead when
using XPath and going deeper in the DOM to retrieve data.

Any and all help is greatly appreciated.

Thanks in advance,
Karthik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Performance Problems using large XML DOM and XPath

Reply via email to