Re: [digester2] performance of ns-aware parsing

Reid Pinchback Sat, 05 Feb 2005 21:02:50 -0800

--- Simon Kitching <[EMAIL PROTECTED]> wrote:

> On Thu, 2005-02-03 at 07:52 -0800, Reid Pinchback wrote: 
> > Even for Sax the performance difference between (a) and (b) is roughly 
> > a factor of 2 across all parsers when processing small (typical 
> > message-sized) 
> > docs that don't use NS. 
> 
> I would *really* love to see some actual measurements on this if you can
> find some. You seem to be quoting from some study you have done or read
> - it would be great to have this. [See comments on Piccolo below]


Take another look at the Piccolo data, and compare the 2 Soap examples
to the random no-NS data.  The differences between the two Soap examples
isn't material because both use NS, so in a sense you have a couple of
different samples of NS data, and in the random case you have another
sample, but I agree it would be better to create tests that were better
understood in order to decide what the difference was.


> >  Mucking with (d) is supposed to result in significant
> > wins when you tune the grammar handling to your app, but I haven't tried it 
> > myself and I've never seen timing differences quoted.  
> > 
> 
> I don't quite understand what (d) means, but is it actually relevant?
> Again, we are talking about *namespaces* not validation.

Yes... and every entity (Element and Attribute) is jammed through a
resolution process first.  Remember XML attributes with default values?
Guess where those values are identified and handed to the parser - during
the resolution process.  Namespaces just add more data to shuffle
around during the resolution process.


> What I'm trying to achieve is to avoid having actions or patterns deal
> with element-names containing prefixes, eg stating that an element's
> name is "foo:item". This is just broken; the item's name is really the
> tuple (some-namespace, item).
> 
> Grammars/schemas can optionally be bound to namespaces, but namespaces
> themselves are a lower layer that can be used without any of these
> things. I'm talking here about requiring the parser to convert
> <foo:item> into (namespace, item) but do not intend to imply that any
> kind of schema should be loaded for the specified namespace. 

That sounds sensible.

> The XMLReader.setNamespaceAware(true) method does exactly this; enables
> mapping of prefixes -> namespaces, but does not enable processing of
> either DTDs or schemas.

I don't think it actually has any impact at all on DTD processing.
DTDs, if declared, are always processed unless you install an entity 
resolver that excises that activity out.

> >  I agree
> > that old parsers providing (c) aren't particularly interesting, but
> > if you spend any time tracing through the guts of the parsing, particularly
> > when you see how DTDs are loaded for entity resolution, you begin to see 
> > (d) as having potential.  Throwing (b) away may result in less code in
> > Digester2, but it may be worth doing some timing tests to see if that 
> > code reduction is consequence-free.
> 
> What does loading DTDs have to do with namespaces?

As you said, the XML spec doesn't require that the namespaces mean
anything, and hence it is possible that a parser won't try to resolve
and validate against multiple DTDs, but I haven't ever traced through
the code in a situation where there were multiple namespaces to
resolve against, so I don't know if there is relationship there or not.
In general, if a parser thinks it needs a DTD in order to understand
a document, it tends to grab it.  I don't know if there are situations
where it tries to interpret namespace declations as public ids for DTDs.
If that happens, then those DTDs would also be loaded by the parser
and namespaces would have to be matched to the appropriate collections
of contexts during entity resolution.


> > > I still find it hard to believe that leaving out namespace support makes
> > > a performance difference. The parser needs to keep a map of
> > >    prefix->(stack of namespace)
> > > and that's about it. 

I stopped using belief as a measurement of code a long time
ago.  Usually only works when I wrote all the code.  :-)
I'll cook up an experiment and see what I can come up with
in the way of timing information.


> Sorry, what per-entity operations, and what temporary object creations?

The Jade/Javolution author wrote a fair bit about that, I'll see
if I can find his pages.  I couldn't find the details at the
Javolution site; when Jade was separate he indicated that the
String operations required to satisfy the SAX API semantics 
dragged down performance heavily.

> >   Zapthink comments on XML parsing challenges,
> >   
> > http://searchwebservices.techtarget.com/originalContent/0,289142,sid26_gci858888,00.html
> 
> No occurrence of the word "namespace" anywhere in the article.

For this and other similar concepts, it helps to start associating
namespaces with other aspects of parsing internals.  Elements and 
attributes have to be "matched up" to their definitions - the 
resolution process.  Namespaces are an aspect of the match up, just 
more information to shuffle around and perform string compares against.
Take a look at all the elements and attributes in a (e.g. 10K document), 
calculate all the callbacks invoked, and any activity that adds a 
per-callback load has potential for impacting performance.  That is 
why Jade put effort into eliminating String creations, because those 
where proportional to the number of entities parsed.  Folks who try
to speed up parsers seem to follow 1 of 2 approaches:
  1. eliminate per-entity costs (same idea as factoring ops out of loops)
  2. avoid per-entity costs (e.g. pull parsers and deferred DOM parsers)





__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [digester2] performance of ns-aware parsing

Reply via email to