RE: Using DTMDocumentImpl

Cory Isaacson Tue, 01 Oct 2002 14:09:51 -0700

Joseph,

This is very helpful, and I really appreciate the time you took to answer my
question.


We are using Xerces, which on looking further may not be our biggest
culprit.

We do a lot of XPath queries of our documents, and recently we switched from
the standard Xalan XPathAPI object to creating a persistent instance of
CachedXPathAPI. I have not totally narrowed it down yet, but it appears this
object may be chewing up a lot of memory (we're using up about 500K per call
which contains the creation of a Xerces DOM and the CachedXPathAPI). Do you
think that's possible? If so, I can just go back to the static call to
XPathAPI.

Thanks,

Cory

-----Original Message-----
From: Joseph Kesselman [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, October 01, 2002 1:55 PM
To: [EMAIL PROTECTED]
Subject: RE: Using DTMDocumentImpl


>If you think the DTM would be far more efficient

Depends, of course, on which DOM implementation you're comparing it to.

The "shoehorned" DTMDocumentImpl was intended to pack a node's core data
(not counting strings) into just four integers (plus some amortized
overhead).  That's certainly more compact than most straightforward DOM
implementations, which generally use an object per node -- even a
completely empty Object consumes several times that space, last I checked,
and then you have to add the member fields.

On the other hand, there are definite downsides. Part of that compression
is achieved by de-optimizing certain operations -- there's no link to the
previous sibling; to find it we look at the parent's first-child and then
scan next-siblings until just before our starting node. And the tests and
mask-and-shift operations needed to extract the bitfields from those
integers also consume some cycles. We tried to avoid de-optimizing the
operations most important to XSLT, but others are on their own.

As I say, I have no idea what the current status of DTMDocumentImpl is; I
don't think we've actually tried running it in a Very Long Time. Getting
it running at all may be the first step...


Alternatively, there's the current DTM code -- DTMDefaultBase and the
SAX2DTM and DOM2DTM classes derived from it. This isn't as compact, but on
the other hand it isn't as slow. Rather than a single table of
four-integer chunks and extracting subfields via shift-and-mask, it uses a
separate table for each "column" of data... and it adds a few columns such
as previous-sibling. DTMDefaultBase also contains a lot of support
specifically for Xalan's needs.  I'd call this "more efficient" rather
than "far more efficient" -- probably a factor of 2 rather than a factor
of 3-4. (Note that this only refers to node size; as mentioned earlier,
strings aren't compacted... but we do try to share single instances when a
string is used repeatedly, and our FastStringBuffer is used to avoid the
overhead of an object per string.)

DTMDefaultBase will probably handle larger documents than DTMDocumentImpl,
if that matters to you... at least, it will do so when teamed up with a
DTMManager which understands the overflow-addressing scheme, such as
DTMManagerDefault.


NOTE: DTM has been biased toward the XPath view of the document rather
than the DOM view. The current DTM in particular tends to elide details
which XPath doesn't care about. If you need something that captures all
the details of a DOM, such as Entity Reference Nodes or the Document Type
tree or exactly how text and <![CDATA[]]> have been mixed within a single
element, DTM as it stands will probably not meet your needs.

Similarly, DTM is really designed to be an immutable model. As noted
above, changing a single string value is probably possible but may have to
account for some interesting interactions. Changing the structure is not
something either DTMDefaultBase or DTMDocumentImpl are currently able to
handle, though there's a minor step in that direction in the RTF pruning
code. DOM2DTM2 hopes to be more flexibile in that regard.... between
stylesheet/XPath passes, not during them.


Experience with DTM has been mixed. On the one hand, it is a more compact
model. On the other hand, you may give up a lot of the power of your
compiler and debugger to help you analyse your application; you can no
longer just expand an object to see what a node contains, and you can't
count on datatypes to help you distinguish between DTM Handles (the
integers the application uses), DTM IDs (the integers DTM uses), and other
integers. In the "do as I say, not as I did" department, I would strongly
recommend you adopt a naming convention to help keep those value types
from getting tangled.


I know, that's a lot of "it depends" -- but that's the best answer I can
give you; the choice of data structure really does depend on what your
needs are. Hope it helps, anyway. Good luck...

______________________________________
Joe Kesselman  / IBM Research

RE: Using DTMDocumentImpl

Reply via email to