Hello all I'm working on a project where I have to parse potentially large (hundreds of MBs) XML log files, and to avoid OutOfMemoryErrors I use an ElementHandler, as suggested in the dom4j faqs:
http://dom4j.org/faq.html#large-doc Concerning memory consumption this approach works fine; regardless of file size the amount of memory used is very small. However the same cannot be said about CPU utilization. Instead of using my own project to illustrate the problem I've created a simple test case, so you can easily reproduce the problem. Let's say you have a very simple document like this: <?xml version="1.0" encoding="UTF-8"?> <root> <element attribute="testing"> <name>Testing</name> <data realdata="true">This is some real data</data> </element> </root> Instead of having just one "element" inside "root", you have over 200.000, all exactly the same as the one shown above. Now let's try to parse this XML, using the following code: package com.mydom4jtest; import java.io.File; import java.util.Calendar; import org.dom4j.Element; import org.dom4j.ElementHandler; import org.dom4j.ElementPath; import org.dom4j.io.SAXReader; import junit.framework.TestCase; /** * @author Trygve Hardersen * */ public class XMLParserTest extends TestCase { public void testParseXML() throws Exception{ // Change the path to suit your system File xml = new File("C:\\tmp\\bigxml.xml"); assertTrue(xml.exists()); assertTrue(xml.isFile()); assertTrue(xml.canRead()); SAXReader read = new SAXReader(false); read.addHandler("/root/element", new MyElementHandler()); read.read(xml); } class MyElementHandler implements ElementHandler{ private int parsed; private int tparsed; public MyElementHandler(){ parsed = 0; tparsed = 0; } public void onEnd(ElementPath path) { Element element = path.getCurrent(); parsed++; tparsed++; if(parsed >= 5000){ System.out.println(Calendar.getInstance().getTime()+": Parsed "+parsed+" elements, totally parsed "+tparsed+" elements"); parsed = 0; } element.detach(); } public void onStart(ElementPath path) {} } } For every 5.000th element in the file a line is printed to System.out, showing the current time and the number of elements parsed. On my system I get the following output: Tue Aug 22 15:02:02 CEST 2006: Parsed 5000 elements, totally parsed 5000 elements Tue Aug 22 15:02:03 CEST 2006: Parsed 5000 elements, totally parsed 10000 elements Tue Aug 22 15:02:04 CEST 2006: Parsed 5000 elements, totally parsed 15000 elements Tue Aug 22 15:02:06 CEST 2006: Parsed 5000 elements, totally parsed 20000 elements Tue Aug 22 15:02:08 CEST 2006: Parsed 5000 elements, totally parsed 25000 elements Tue Aug 22 15:02:11 CEST 2006: Parsed 5000 elements, totally parsed 30000 elements Tue Aug 22 15:02:15 CEST 2006: Parsed 5000 elements, totally parsed 35000 elements Tue Aug 22 15:02:19 CEST 2006: Parsed 5000 elements, totally parsed 40000 elements As you can see the time needed to parse 5.000 elements steadily increases from about 1 second to 4 seconds, and after a while I get: Tue Aug 22 15:11:29 CEST 2006: Parsed 5000 elements, totally parsed 235000 elements Tue Aug 22 15:11:57 CEST 2006: Parsed 5000 elements, totally parsed 240000 elements Tue Aug 22 15:12:24 CEST 2006: Parsed 5000 elements, totally parsed 245000 elements Tue Aug 22 15:12:51 CEST 2006: Parsed 5000 elements, totally parsed 250000 elements Tue Aug 22 15:13:17 CEST 2006: Parsed 5000 elements, totally parsed 255000 elements Tue Aug 22 15:13:46 CEST 2006: Parsed 5000 elements, totally parsed 260000 elements As you can see the time has now grown considerably. In my testing I've not seen the time stabilize, and I'm concerned what will happen on really large documents. Does anyone have an explanation for this? Or maybe a solution? Your help is very much appreciated! Thanks in advance! Trygve Hardersen ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ dom4j-user mailing list dom4j-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dom4j-user