Thanks Evan. Conceptually, it can be done with SAX. I'm actually doing a rewrite with StAX. This was just a model I got pretty comfortable with.

I actually did a little tweaking and realized it's something within the detach() method that's causing this gradual performance hit. The memory was growing, but not much as it would without the detach() invocation. I'm guessing references to certain objects don't get cleaned up even after detaching a Node from a Document.

And btw, it was significant hit. Those numbers weren't milliseconds, but seconds!!

Thanks
-Wali



----Original Message Follows----
From: Evan Kirkconnell <[EMAIL PROTECTED]>
To: dom4j-user@lists.sourceforge.net
Subject: Re: [dom4j-user] Huge file, ElementHandler, and performance woes
Date: Mon, 12 Feb 2007 15:23:19 -0600

Oh... Can your code work with just SAX? If you don't need the Document
object, I'd think it'd be much faster to just use SAX directly.

Wali Ansary wrote:
> I've actually tweaked it to the extent that I left the onEnd() method
> totally empty, except for the detach() method. As far as I understand,
> the detach() method is preventing the 'detached' element from being a
> part of the resulting Document object. That's why this code consumes
> hardly any memory (I didn't have to change the default HotSpot
> settings to process a 676 MB file).
>
> I will rerun the code with your suggestions and follow up.
>
> Thanks
> -Wali
>
>
>
> ----Original Message Follows----
> From: Evan Kirkconnell <[EMAIL PROTECTED]>
> To: dom4j-user@lists.sourceforge.net
> Subject: Re: [dom4j-user] Huge file, ElementHandler, and performance woes
> Date: Mon, 12 Feb 2007 08:50:13 -0600
>
> It'd be good to make sure the slowness isn't due to memory issues and
> garbage collection. Here's some code that I've used in some speed
> tests.(got it off the Java forum or google groups I think) I'd recommend
> stopping your timer after each 10000, running the gc, showing the used
> memory, Thread.sleep() for a bit, then starting the timer, and moving on
> to the next series. Might also want to play around with how much memory
> is allocated to the VM, and see if it shifts your numbers. Seems like it
> was 6 milliseconds for a while, then started to jump up. Wasn't a smooth
> curve, which makes me suspicious.
>
> Also, have you tried tweaking it at all? I'd recommend trying some stuff
> like taking the int declarations out of the loop, doing writer.println
> for each string separately instead of appending them, maybe removing the
> .detach()(I don't really know much about what that means in SAX though).
>
> private static void runGC () throws Exception{
> // It helps to call Runtime.gc()
> // using several method calls:
> for (int r = 0; r < 4; ++ r) _runGC ();
> }
>
> private static void _runGC () throws Exception{
> long usedMem1 = usedMemory (), usedMem2 = Long.MAX_VALUE;
> for (int i = 0; (usedMem1 < usedMem2) && (i < 500); ++ i){
> s_runtime.runFinalization ();
> s_runtime.gc ();
> Thread.currentThread ().yield ();
>
> usedMem2 = usedMem1;
> usedMem1 = usedMemory ();
> }
> }
>
> private static void showUsedMemory(){
> long l = usedMemory();
> System.out.println("Used memory: "+l);
> }
>
> private static long usedMemory (){
> return s_runtime.totalMemory () - s_runtime.freeMemory ();
> }
>
> Wali Ansary wrote:
> > Folks,
> >
> > I'm baffled.
> >
> > I've been an avid user of dom4j for a while, and have used the
> > below-mentioned stategy successfully ever since to process/transform
> > huge XML files without consuming much memory. However, this new code I
> > have appears to be getting gradually slower. I'm not sure if I'm
> > missing anything.
> >
> > Here are the timings per 10000 elements processed. Note that I want to
> > process documents that have over 5 million of these elements:
> >
> > 10000: 3
> > 20000: 9
> > 30000: 15
> > 40000: 21
> > 50000: 27
> > 60000: 33
> > 70000: 39
> > 80000: 45
> > 90000: 56
> > 100000: 87
> > 110000: 158
> >
> >
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >
> > // Create a reader and add an Element handler to efficiently iterate
> > // through the large document
> > SAXReader reader = new SAXReader();
> > reader.addHandler("/n-extract-response/guid-info",
> > new ElementHandler() {
> >
> > public void onStart(ElementPath path) {
> > // do nothing
> > }
> >
> > public void onEnd(ElementPath path) {
> > // Get the guid of the document
> > Element guidInfoElement = path.getCurrent();
> > String guid = guidInfoElement.valueOf("guid");
> >
> > // Get the document size, and possibly metadata size
> > int totalSize = 0, docSize = 0, metaSize = 0;
> > try {
> > docSize = Integer.parseInt(guidInfoElement
> > .valueOf("size"));
> > metaSize = Integer.parseInt(guidInfoElement
> > .valueOf("metadatasize"));
> > } catch (NumberFormatException nfe) {
> > // do nothing
> > }
> >
> > // print as line
> > totalSize = docSize + metaSize;
> > writer.println(colID + delimiter + guid + delimiter
> > + totalSize);
> >
> > // for debugging purposes, track how long it takes per
> > // 10000 guid-info elements
> > count++;
> > if (count % 10000 == 0) {
> > end = System.currentTimeMillis();
> > System.out.println(count + ": "
> > + ((end - start) / 1000));
> > start = System.currentTimeMillis();
> > }
> >
> > // make sure to detach to save memory
> > guidInfoElement.detach();
> > }
> > });
> >
> > // Set the start time, and begin reading
> > start = System.currentTimeMillis();
> > reader.read(nxoGuidsFile);
> > writer.close();
> >
> >>>>>>>>>>>>>>>>>>>
> >
> > You can argue that the string-concatenation and/or Integer parsing is
> > taking up time, but it doesnt explain the gradual increase in the
> > timings.
> >
> > I've tried compiling and running in both Java 1.4, 1.5 with various
> > compilation settings, but to no avail.
> >
> > Help!
> >
> > Thanks
> > -Wali
> >
> > _________________________________________________________________
> > MSN Hotmail is evolving – check out the new Windows Live Mail
> > http://ideas.live.com
> >
> >
> >
> ------------------------------------------------------------------------
> >
> >
> -------------------------------------------------------------------------
> > Using Tomcat but need to do more? Need to support web services,
> security?
> > Get stuff done quickly with pre-integrated technology to make your
> job easier.
> > Download IBM WebSphere Application Server v.1.0.1 based on Apache
> Geronimo
> >
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> >
> >
> ------------------------------------------------------------------------
> >
> > _______________________________________________
> > dom4j-user mailing list
> > dom4j-user@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dom4j-user
> >
>
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job
> easier.
> Download IBM WebSphere Application Server v.1.0.1 based on Apache
> Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> dom4j-user mailing list
> dom4j-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dom4j-user
>
> _________________________________________________________________
> MSN Hotmail is evolving – check out the new Windows Live Mail
> http://ideas.live.com
>


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
dom4j-user mailing list
dom4j-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dom4j-user

_________________________________________________________________
Get Hotmail, News, Sport and Entertainment from MSN on your mobile. http://www.msn.txt4content.com/


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
dom4j-user mailing list
dom4j-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dom4j-user

Reply via email to