priliminary thoughts on DTD grammar caching

neilg Thu, 21 Mar 2002 14:10:47 -0800

Hi folks,

To motivate these thoughts, here's the use-case I'm imagining.  It seems to
me nost likely that what people will want to cache is external DTD's;
probably large DTD's that XML documents to be validated should simply
reference and conform to. I don't think we should entertain the idea of
doing anything with the internal subset, or with parameter entities whose
contents lie outside the document (unless the parameter entity decl also
lies outside the document of course).


The first question is whether, given the current infrastructure, it's
possible to create grammar objects that correspond to external DTD's.  I
think the answer is yes, since if a document simply refers to an external
DTD and has no internal subset, then the grammar produced is just what we'd
want.

Assuming we can do this, we should decide whether to read the external
subset if there is a grammar for it.  I think the answer is that we should
not--although we will want a feature to control this...  The whole point of
grammar caching is to reduce disk access and reduce the number of method
calls; so it would seem to me counterproductive to step through the
external subset if a grammar for it is known.  On the other hand, a
validating processor is supposed to read all external decls; there's no
provision in the specs for a processor to ignore this if it somehow already
"knows" what it's doing.  So in the spirit of being 100% conformant, we'd
probably want to read external decls by default even if it might not make
much sense in this application.  I'd love to hear perspectives on this one!

What should be done with internal-subset declarations in the document is
another tough question.  My sense is that, in general, they shouldn't be
ignored (even were it possible in the current framework to ignore them) and
that their presence should be reflected in the grammar that the document is
validated against.  But this implies modifying a cached grammar, which is a
very problematic idea...

I can see two ways around this:  1.  state up front that our DTD
implementation will modify cached external subsets, and if the grammar pool
wants to preserve a pristine grammar then it must provide a clone to the
DTD validator.  This implies a rather considerable loss in flexibility and
also means we'll have to implement a clone() method on the grammar class;
cloning a grammar will also be a mean performance hit (though better than
rebuilding from scratch of course).

Alternatively, we could state that, if a grammar comes from a cache, our
DTD implementation will not modify it (i.e., internal decls will have no
effect, although we could still send them off to the handlers).  This is
similarly inflexible, and means that  when DTD grammar caching is employed,
we are no longer a XML 1.0-compliant parser.

I guess a feature could be used to select the behaviour.  I'm just afraid
of the number of times we'll have to check the status of this feature when
grammar caching's enabled...  There are a lot of cases involved, and this
might not be trivial performance-wise (or implementation-wise either).

Lots of questions here; thoughts, comments, answers etc. greatly
appreciated!

Cheers,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

priliminary thoughts on DTD grammar caching

Reply via email to