Hello all

I'm working on a project where I have to parse potentially large
(hundreds of MBs) XML log files, and to avoid OutOfMemoryErrors I use an
ElementHandler, as suggested in the dom4j faqs:

http://dom4j.org/faq.html#large-doc

Concerning memory consumption this approach works fine; regardless of
file size the amount of memory used is very small. However the same
cannot be said about CPU utilization.

Instead of using my own project to illustrate the problem I've created a
simple test case, so you can easily reproduce the problem. Let's say you
have a very simple document like this:

<?xml version="1.0" encoding="UTF-8"?>
<root>
        <element attribute="testing">
                <name>Testing</name>
                <data realdata="true">This is some real data</data>
        </element>
</root>

Instead of having just one "element" inside "root", you have over
200.000, all exactly the same as the one shown above. Now let's try to
parse this XML, using the following code:

package com.mydom4jtest;

import java.io.File;
import java.util.Calendar;

import org.dom4j.Element;
import org.dom4j.ElementHandler;
import org.dom4j.ElementPath;
import org.dom4j.io.SAXReader;

import junit.framework.TestCase;

/**
 * @author Trygve Hardersen
 *
 */
public class XMLParserTest extends TestCase {

        public void testParseXML() throws Exception{
                // Change the path to suit your system
                File xml = new File("C:\\tmp\\bigxml.xml");
                assertTrue(xml.exists());
                assertTrue(xml.isFile());
                assertTrue(xml.canRead());
                
                SAXReader read = new SAXReader(false);
                read.addHandler("/root/element", new
MyElementHandler());
                read.read(xml);
        }
        
        
        class MyElementHandler implements ElementHandler{
                
                private int parsed;
                
                private int tparsed;
                
                public MyElementHandler(){
                        parsed = 0;
                        tparsed = 0;
                }

                public void onEnd(ElementPath path) {
                        Element element = path.getCurrent();
                        parsed++;
                        tparsed++;
                        if(parsed >= 5000){
        
System.out.println(Calendar.getInstance().getTime()+": Parsed "+parsed+"
elements, totally parsed "+tparsed+" elements");
                                parsed = 0;
                        }
                        element.detach();
                }

                public void onStart(ElementPath path) {}
        }
}

For every 5.000th element in the file a line is printed to System.out,
showing the current time and the number of elements parsed. On my system
I get the following output:

Tue Aug 22 15:02:02 CEST 2006: Parsed 5000 elements, totally parsed 5000
elements
Tue Aug 22 15:02:03 CEST 2006: Parsed 5000 elements, totally parsed
10000 elements
Tue Aug 22 15:02:04 CEST 2006: Parsed 5000 elements, totally parsed
15000 elements
Tue Aug 22 15:02:06 CEST 2006: Parsed 5000 elements, totally parsed
20000 elements
Tue Aug 22 15:02:08 CEST 2006: Parsed 5000 elements, totally parsed
25000 elements
Tue Aug 22 15:02:11 CEST 2006: Parsed 5000 elements, totally parsed
30000 elements
Tue Aug 22 15:02:15 CEST 2006: Parsed 5000 elements, totally parsed
35000 elements
Tue Aug 22 15:02:19 CEST 2006: Parsed 5000 elements, totally parsed
40000 elements

As you can see the time needed to parse 5.000 elements steadily
increases from about 1 second to 4 seconds, and after a while I get:

Tue Aug 22 15:11:29 CEST 2006: Parsed 5000 elements, totally parsed
235000 elements
Tue Aug 22 15:11:57 CEST 2006: Parsed 5000 elements, totally parsed
240000 elements
Tue Aug 22 15:12:24 CEST 2006: Parsed 5000 elements, totally parsed
245000 elements
Tue Aug 22 15:12:51 CEST 2006: Parsed 5000 elements, totally parsed
250000 elements
Tue Aug 22 15:13:17 CEST 2006: Parsed 5000 elements, totally parsed
255000 elements
Tue Aug 22 15:13:46 CEST 2006: Parsed 5000 elements, totally parsed
260000 elements

As you can see the time has now grown considerably. In my testing I've
not seen the time stabilize, and I'm concerned what will happen on
really large documents.

Does anyone have an explanation for this? Or maybe a solution? Your help
is very much appreciated!

Thanks in advance!

Trygve Hardersen

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
dom4j-user mailing list
dom4j-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dom4j-user

Reply via email to