RE: Using DTMDocumentImpl

Joseph Kesselman Tue, 01 Oct 2002 13:29:45 -0700

>If you think the DTM would be far more efficient

Depends, of course, on which DOM implementation you're comparing it to.


The "shoehorned" DTMDocumentImpl was intended to pack a node's core data 
(not counting strings) into just four integers (plus some amortized 
overhead).  That's certainly more compact than most straightforward DOM 
implementations, which generally use an object per node -- even a 
completely empty Object consumes several times that space, last I checked, 
and then you have to add the member fields. 

On the other hand, there are definite downsides. Part of that compression 
is achieved by de-optimizing certain operations -- there's no link to the 
previous sibling; to find it we look at the parent's first-child and then 
scan next-siblings until just before our starting node. And the tests and 
mask-and-shift operations needed to extract the bitfields from those 
integers also consume some cycles. We tried to avoid de-optimizing the 
operations most important to XSLT, but others are on their own.

As I say, I have no idea what the current status of DTMDocumentImpl is; I 
don't think we've actually tried running it in a Very Long Time. Getting 
it running at all may be the first step...


Alternatively, there's the current DTM code -- DTMDefaultBase and the 
SAX2DTM and DOM2DTM classes derived from it. This isn't as compact, but on 
the other hand it isn't as slow. Rather than a single table of 
four-integer chunks and extracting subfields via shift-and-mask, it uses a 
separate table for each "column" of data... and it adds a few columns such 
as previous-sibling. DTMDefaultBase also contains a lot of support 
specifically for Xalan's needs.  I'd call this "more efficient" rather 
than "far more efficient" -- probably a factor of 2 rather than a factor 
of 3-4. (Note that this only refers to node size; as mentioned earlier, 
strings aren't compacted... but we do try to share single instances when a 
string is used repeatedly, and our FastStringBuffer is used to avoid the 
overhead of an object per string.)

DTMDefaultBase will probably handle larger documents than DTMDocumentImpl, 
if that matters to you... at least, it will do so when teamed up with a 
DTMManager which understands the overflow-addressing scheme, such as 
DTMManagerDefault.


NOTE: DTM has been biased toward the XPath view of the document rather 
than the DOM view. The current DTM in particular tends to elide details 
which XPath doesn't care about. If you need something that captures all 
the details of a DOM, such as Entity Reference Nodes or the Document Type 
tree or exactly how text and <![CDATA[]]> have been mixed within a single 
element, DTM as it stands will probably not meet your needs. 

Similarly, DTM is really designed to be an immutable model. As noted 
above, changing a single string value is probably possible but may have to 
account for some interesting interactions. Changing the structure is not 
something either DTMDefaultBase or DTMDocumentImpl are currently able to 
handle, though there's a minor step in that direction in the RTF pruning 
code. DOM2DTM2 hopes to be more flexibile in that regard.... between 
stylesheet/XPath passes, not during them.


Experience with DTM has been mixed. On the one hand, it is a more compact 
model. On the other hand, you may give up a lot of the power of your 
compiler and debugger to help you analyse your application; you can no 
longer just expand an object to see what a node contains, and you can't 
count on datatypes to help you distinguish between DTM Handles (the 
integers the application uses), DTM IDs (the integers DTM uses), and other 
integers. In the "do as I say, not as I did" department, I would strongly 
recommend you adopt a naming convention to help keep those value types 
from getting tangled. 


I know, that's a lot of "it depends" -- but that's the best answer I can 
give you; the choice of data structure really does depend on what your 
needs are. Hope it helps, anyway. Good luck...

______________________________________
Joe Kesselman  / IBM Research

RE: Using DTMDocumentImpl

Reply via email to