Joseph, This is very helpful, and I really appreciate the time you took to answer my question.
We are using Xerces, which on looking further may not be our biggest culprit. We do a lot of XPath queries of our documents, and recently we switched from the standard Xalan XPathAPI object to creating a persistent instance of CachedXPathAPI. I have not totally narrowed it down yet, but it appears this object may be chewing up a lot of memory (we're using up about 500K per call which contains the creation of a Xerces DOM and the CachedXPathAPI). Do you think that's possible? If so, I can just go back to the static call to XPathAPI. Thanks, Cory -----Original Message----- From: Joseph Kesselman [mailto:[EMAIL PROTECTED]] Sent: Tuesday, October 01, 2002 1:55 PM To: [EMAIL PROTECTED] Subject: RE: Using DTMDocumentImpl >If you think the DTM would be far more efficient Depends, of course, on which DOM implementation you're comparing it to. The "shoehorned" DTMDocumentImpl was intended to pack a node's core data (not counting strings) into just four integers (plus some amortized overhead). That's certainly more compact than most straightforward DOM implementations, which generally use an object per node -- even a completely empty Object consumes several times that space, last I checked, and then you have to add the member fields. On the other hand, there are definite downsides. Part of that compression is achieved by de-optimizing certain operations -- there's no link to the previous sibling; to find it we look at the parent's first-child and then scan next-siblings until just before our starting node. And the tests and mask-and-shift operations needed to extract the bitfields from those integers also consume some cycles. We tried to avoid de-optimizing the operations most important to XSLT, but others are on their own. As I say, I have no idea what the current status of DTMDocumentImpl is; I don't think we've actually tried running it in a Very Long Time. Getting it running at all may be the first step... Alternatively, there's the current DTM code -- DTMDefaultBase and the SAX2DTM and DOM2DTM classes derived from it. This isn't as compact, but on the other hand it isn't as slow. Rather than a single table of four-integer chunks and extracting subfields via shift-and-mask, it uses a separate table for each "column" of data... and it adds a few columns such as previous-sibling. DTMDefaultBase also contains a lot of support specifically for Xalan's needs. I'd call this "more efficient" rather than "far more efficient" -- probably a factor of 2 rather than a factor of 3-4. (Note that this only refers to node size; as mentioned earlier, strings aren't compacted... but we do try to share single instances when a string is used repeatedly, and our FastStringBuffer is used to avoid the overhead of an object per string.) DTMDefaultBase will probably handle larger documents than DTMDocumentImpl, if that matters to you... at least, it will do so when teamed up with a DTMManager which understands the overflow-addressing scheme, such as DTMManagerDefault. NOTE: DTM has been biased toward the XPath view of the document rather than the DOM view. The current DTM in particular tends to elide details which XPath doesn't care about. If you need something that captures all the details of a DOM, such as Entity Reference Nodes or the Document Type tree or exactly how text and <![CDATA[]]> have been mixed within a single element, DTM as it stands will probably not meet your needs. Similarly, DTM is really designed to be an immutable model. As noted above, changing a single string value is probably possible but may have to account for some interesting interactions. Changing the structure is not something either DTMDefaultBase or DTMDocumentImpl are currently able to handle, though there's a minor step in that direction in the RTF pruning code. DOM2DTM2 hopes to be more flexibile in that regard.... between stylesheet/XPath passes, not during them. Experience with DTM has been mixed. On the one hand, it is a more compact model. On the other hand, you may give up a lot of the power of your compiler and debugger to help you analyse your application; you can no longer just expand an object to see what a node contains, and you can't count on datatypes to help you distinguish between DTM Handles (the integers the application uses), DTM IDs (the integers DTM uses), and other integers. In the "do as I say, not as I did" department, I would strongly recommend you adopt a naming convention to help keep those value types from getting tangled. I know, that's a lot of "it depends" -- but that's the best answer I can give you; the choice of data structure really does depend on what your needs are. Hope it helps, anyway. Good luck... ______________________________________ Joe Kesselman / IBM Research
