[dom4j-user] Huge file, ElementHandler, and performance woes

Wali Ansary Fri, 09 Feb 2007 16:42:59 -0800

Folks,

I'm baffled.

I've been an avid user of dom4j for a while, and have used thebelow-mentioned stategy successfully ever since to process/transform hugeXML files without consuming much memory. However, this new code I haveappears to be getting gradually slower. I'm not sure if I'm missinganything.

Here are the timings per 10000 elements processed. Note that I want toprocess documents that have over 5 million of these elements:


10000: 3
20000: 9
30000: 15
40000: 21
50000: 27
60000: 33
70000: 39
80000: 45
90000: 56
100000: 87
110000: 158


// Create a reader and add an Element handler to efficiently iterate
// through the large document
SAXReader reader = new SAXReader();
reader.addHandler("/n-extract-response/guid-info",
     new ElementHandler() {

        public void onStart(ElementPath path) {
           // do nothing
        }

        public void onEnd(ElementPath path) {
           // Get the guid of the document
           Element guidInfoElement = path.getCurrent();
           String guid = guidInfoElement.valueOf("guid");

           // Get the document size, and possibly metadata size
           int totalSize = 0, docSize = 0, metaSize = 0;
           try {
              docSize = Integer.parseInt(guidInfoElement
                    .valueOf("size"));
              metaSize = Integer.parseInt(guidInfoElement
                    .valueOf("metadatasize"));
           } catch (NumberFormatException nfe) {
              // do nothing
           }

           // print as line
           totalSize = docSize + metaSize;
           writer.println(colID + delimiter + guid + delimiter
                 + totalSize);

           // for debugging purposes, track how long it takes per
           // 10000 guid-info elements
           count++;
           if (count % 10000 == 0) {
              end = System.currentTimeMillis();
              System.out.println(count + ": "
                    + ((end - start) / 1000));
              start = System.currentTimeMillis();
           }

           // make sure to detach to save memory
           guidInfoElement.detach();
        }
     });

// Set the start time, and begin reading
start = System.currentTimeMillis();
reader.read(nxoGuidsFile);
writer.close();

You can argue that the string-concatenation and/or Integer parsing is takingup time, but it doesnt explain the gradual increase in the timings.

I've tried compiling and running in both Java 1.4, 1.5 with variouscompilation settings, but to no avail.


Help!

Thanks
-Wali

_________________________________________________________________

MSN Hotmail is evolving check out the new Windows Live Mailhttp://ideas.live.com

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
dom4j-user mailing list
dom4j-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dom4j-user

[dom4j-user] Huge file, ElementHandler, and performance woes

Reply via email to