Re: [digester2] performance of ns-aware parsing

2005-02-06 Thread Simon Kitching
On Sun, 2005-02-06 at 13:02 -0800, Reid Pinchback wrote:
 --- Simon Kitching [EMAIL PROTECTED] wrote:
   I stopped using belief as a measurement of code a long time
   ago.  Usually only works when I wrote all the code.  :-)
   I'll cook up an experiment and see what I can come up with
   in the way of timing information.
  
  That would be excellent. I look forward to seeing the results..
 
 Actually, an experiment implies a question to be answered, and
 while this has been an interesting back-and-forth, not sure
 we really have a question to answer.  This whole thing began
 with me simply asking a question about something you'd
 put in your readme file on the upcoming work.  Practically
 I don't see you not expecting a namespace-aware parser, the
 question is really more one of the user of Digester2 deciding
 if they are using namespace features.  While we could do
 timing tests to help people understand what the impact may
 or may not be of using NS in the documents they parse, it
 obviously has nothing to do with whether or not you are
 going to expect a parser to handle NS if the docs contain NS.
 That will be the developer's problem, not yours, yes?

Hi Reid,


I don't quite understand the above.

You mean these are the questions?
* should people avoid creating xml documents that use namespaces
  if they care about the performance of later parsing the doc?
* Is there a significant performance benefit in parsing 
  non-namespaced xml with a non-namespace-aware parser?
* Is there a significant performance benefit in parsing
  namespace-using-xml with a non-namespace-aware parser
  (yecch!).

The first is an interesting question, and is partially related to the
third one in that it gives people an *option* (though not a good one
IMHO) to parse the document fast. But mostly I agree this is the
developer's problem, not digester's. Tf we can give a hint somewhere in
our docs about parser performance with/without ns, though, I'm sure
people would appreciate it.

For either of the second, the answer is relevant to digester; if the
answer to either is yes, then I would support allowing a
non-namespace-aware parser to be used with digester. By support, I mean
writing code that allows instantiation of ns-aware or non-ns-aware
parser, code that looks for localname/qname, support in the RuleManager
classes for matching such elements, and unit tests to test it all.

Currently, I'm not hugely motivated to test either of the last two
scenarios, as I *believe* the answer to both is no, but if someone else
does I'll look at the results with interest.

Is this what you meant?

Regards,

Simon


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [digester2] performance of ns-aware parsing

2005-02-06 Thread Reid Pinchback

--- Simon Kitching [EMAIL PROTECTED] wrote:
  I stopped using belief as a measurement of code a long time
  ago.  Usually only works when I wrote all the code.  :-)
  I'll cook up an experiment and see what I can come up with
  in the way of timing information.
 
 That would be excellent. I look forward to seeing the results..

Actually, an experiment implies a question to be answered, and
while this has been an interesting back-and-forth, not sure
we really have a question to answer.  This whole thing began
with me simply asking a question about something you'd
put in your readme file on the upcoming work.  Practically
I don't see you not expecting a namespace-aware parser, the
question is really more one of the user of Digester2 deciding
if they are using namespace features.  While we could do
timing tests to help people understand what the impact may
or may not be of using NS in the documents they parse, it
obviously has nothing to do with whether or not you are
going to expect a parser to handle NS if the docs contain NS.
That will be the developer's problem, not yours, yes?







__ 
Do you Yahoo!? 
Yahoo! Mail - You care about security. So do we. 
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [digester2] performance of ns-aware parsing

2005-02-05 Thread Simon Kitching
On Thu, 2005-02-03 at 07:52 -0800, Reid Pinchback wrote: 
 --- Simon Kitching [EMAIL PROTECTED] wrote:
 
  On Wed, 2005-02-02 at 20:45 -0800, Reid Pinchback wrote:
  Of course if someone can demonstrate that non-namespace-aware parsers
  *are* still useful then I'll change my mind.
 
 Just to clarify, since I was being sloppy before (I gotta
 stop typing in shorthand) there is an important distinction:
 
 a) having NS-aware parser, always using NS-aware API methods
 b) having NS-aware parser, selectively using NS-aware API methods
 c) having non-NS-aware parser (and obviously never using NS-aware API methods)
 d) having NS-aware parser where the developer fixes a grammar that
ignores any NS distinctions
 


 Even for Sax the performance difference between (a) and (b) is roughly 
 a factor of 2 across all parsers when processing small (typical 
 message-sized) 
 docs that don't use NS. 

I would *really* love to see some actual measurements on this if you can
find some. You seem to be quoting from some study you have done or read
- it would be great to have this. [See comments on Piccolo below]


  Mucking with (d) is supposed to result in significant
 wins when you tune the grammar handling to your app, but I haven't tried it 
 myself and I've never seen timing differences quoted.  
 

I don't quite understand what (d) means, but is it actually relevant?
Again, we are talking about *namespaces* not validation.

The w3c namespaces spec clearly makes a distinction between namespaces
and whether or not the namespace URI means anything:

quote source=http://www.w3c.org/TR/xml-names11/;
Note also that the Namespaces specification says nothing about what
might (or might not) happen if one were to attempt to dereference a
URI/IRI used to identify a namespace.
/quote

What I'm trying to achieve is to avoid having actions or patterns deal
with element-names containing prefixes, eg stating that an element's
name is foo:item. This is just broken; the item's name is really the
tuple (some-namespace, item).

Grammars/schemas can optionally be bound to namespaces, but namespaces
themselves are a lower layer that can be used without any of these
things. I'm talking here about requiring the parser to convert
foo:item into (namespace, item) but do not intend to imply that any
kind of schema should be loaded for the specified namespace. 

The XMLReader.setNamespaceAware(true) method does exactly this; enables
mapping of prefixes - namespaces, but does not enable processing of
either DTDs or schemas.


 I'm not trying to advocate any approach except to notice that, since your 
 README mentioned requiring a namespace-aware parser, it sounded like 
 there was a potential for options (b), (c), and (d) to become unintentionally
 closed to developers in Digester2 when they weren't in Digester1. 

Well, I did intend to close options (b) and (c) as I didn't believe
there was any reason at all to support them. Some real measurements
showing the kind of performance you quote would definitely change my
mind.

  I agree
 that old parsers providing (c) aren't particularly interesting, but
 if you spend any time tracing through the guts of the parsing, particularly
 when you see how DTDs are loaded for entity resolution, you begin to see 
 (d) as having potential.  Throwing (b) away may result in less code in
 Digester2, but it may be worth doing some timing tests to see if that 
 code reduction is consequence-free.

What does loading DTDs have to do with namespaces?


  I still find it hard to believe that leaving out namespace support makes
  a performance difference. The parser needs to keep a map of
 prefix-(stack of namespace)
  and that's about it. 
 
 Actually the XML spec distinguishes between the default namespace
 and all other namespaces, so parsers can reasonably make the same
 distinction and try to avoid a bunch of per-entity operations and 
 temporary object creations in the case where there is no namespace.

Sorry, what per-entity operations, and what temporary object creations?

 Look at the piccolo stats published on Sourceforge.  Compare Soap, 
 Soap+NS, and random XML-no NS timings and it suggests that NS 
 ain't free.
 
 Useful links:
 
   Jade (now part of Javolution) http://javolution.org/api/index.html,
   look at the javolution.xml package (trades String for CharSequence
   to increase performance, but keeps NS)

Hmm.. I've added a reference to javolution to the wiki. 

However I couldn't find any info on the performance of namespaceAware vs
nonNamespaceAware...

 
   Picollo you probably already have the link for, but for anybody
   else interested: http://piccolo.sourceforge.net

Piccolo does have a page where they state their performance tests for
SOAP - namespaces off is about 12% faster than SOAP - namespaces on.
But there is no further info on what these phrases mean.

The piccolo site provides a download for SAXBench benchmarking tool,
but (a) I never managed to get this working, and (b) it 

Re: [digester2] performance of ns-aware parsing

2005-02-05 Thread Reid Pinchback

--- Simon Kitching [EMAIL PROTECTED] wrote:

 On Thu, 2005-02-03 at 07:52 -0800, Reid Pinchback wrote: 
  Even for Sax the performance difference between (a) and (b) is roughly 
  a factor of 2 across all parsers when processing small (typical 
  message-sized) 
  docs that don't use NS. 
 
 I would *really* love to see some actual measurements on this if you can
 find some. You seem to be quoting from some study you have done or read
 - it would be great to have this. [See comments on Piccolo below]

Take another look at the Piccolo data, and compare the 2 Soap examples
to the random no-NS data.  The differences between the two Soap examples
isn't material because both use NS, so in a sense you have a couple of
different samples of NS data, and in the random case you have another
sample, but I agree it would be better to create tests that were better
understood in order to decide what the difference was.


   Mucking with (d) is supposed to result in significant
  wins when you tune the grammar handling to your app, but I haven't tried it 
  myself and I've never seen timing differences quoted.  
  
 
 I don't quite understand what (d) means, but is it actually relevant?
 Again, we are talking about *namespaces* not validation.

Yes... and every entity (Element and Attribute) is jammed through a
resolution process first.  Remember XML attributes with default values?
Guess where those values are identified and handed to the parser - during
the resolution process.  Namespaces just add more data to shuffle
around during the resolution process.


 What I'm trying to achieve is to avoid having actions or patterns deal
 with element-names containing prefixes, eg stating that an element's
 name is foo:item. This is just broken; the item's name is really the
 tuple (some-namespace, item).
 
 Grammars/schemas can optionally be bound to namespaces, but namespaces
 themselves are a lower layer that can be used without any of these
 things. I'm talking here about requiring the parser to convert
 foo:item into (namespace, item) but do not intend to imply that any
 kind of schema should be loaded for the specified namespace. 

That sounds sensible.

 The XMLReader.setNamespaceAware(true) method does exactly this; enables
 mapping of prefixes - namespaces, but does not enable processing of
 either DTDs or schemas.

I don't think it actually has any impact at all on DTD processing.
DTDs, if declared, are always processed unless you install an entity 
resolver that excises that activity out.

   I agree
  that old parsers providing (c) aren't particularly interesting, but
  if you spend any time tracing through the guts of the parsing, particularly
  when you see how DTDs are loaded for entity resolution, you begin to see 
  (d) as having potential.  Throwing (b) away may result in less code in
  Digester2, but it may be worth doing some timing tests to see if that 
  code reduction is consequence-free.
 
 What does loading DTDs have to do with namespaces?

As you said, the XML spec doesn't require that the namespaces mean
anything, and hence it is possible that a parser won't try to resolve
and validate against multiple DTDs, but I haven't ever traced through
the code in a situation where there were multiple namespaces to
resolve against, so I don't know if there is relationship there or not.
In general, if a parser thinks it needs a DTD in order to understand
a document, it tends to grab it.  I don't know if there are situations
where it tries to interpret namespace declations as public ids for DTDs.
If that happens, then those DTDs would also be loaded by the parser
and namespaces would have to be matched to the appropriate collections
of contexts during entity resolution.


   I still find it hard to believe that leaving out namespace support makes
   a performance difference. The parser needs to keep a map of
  prefix-(stack of namespace)
   and that's about it. 

I stopped using belief as a measurement of code a long time
ago.  Usually only works when I wrote all the code.  :-)
I'll cook up an experiment and see what I can come up with
in the way of timing information.


 Sorry, what per-entity operations, and what temporary object creations?

The Jade/Javolution author wrote a fair bit about that, I'll see
if I can find his pages.  I couldn't find the details at the
Javolution site; when Jade was separate he indicated that the
String operations required to satisfy the SAX API semantics 
dragged down performance heavily.

Zapthink comments on XML parsing challenges,

  http://searchwebservices.techtarget.com/originalContent/0,289142,sid26_gci85,00.html
 
 No occurrence of the word namespace anywhere in the article.

For this and other similar concepts, it helps to start associating
namespaces with other aspects of parsing internals.  Elements and 
attributes have to be matched up to their definitions - the 
resolution process.  Namespaces are an aspect of the match up, just 
more information to 

Re: [digester2] performance of ns-aware parsing

2005-02-05 Thread Simon Kitching
On Sat, 2005-02-05 at 21:02 -0800, Reid Pinchback wrote:
 --- Simon Kitching [EMAIL PROTECTED] wrote:
Mucking with (d) is supposed to result in significant
   wins when you tune the grammar handling to your app, but I haven't tried 
   it 
   myself and I've never seen timing differences quoted.  
   
  
  I don't quite understand what (d) means, but is it actually relevant?
  Again, we are talking about *namespaces* not validation.
 
 Yes... and every entity (Element and Attribute) is jammed through a
 resolution process first.  Remember XML attributes with default values?
 Guess where those values are identified and handed to the parser - during
 the resolution process.  Namespaces just add more data to shuffle
 around during the resolution process.

Well, in a document that doesn't use namespaces, the penalty is zero.

In a document that uses namespaces, there are a few xmlns:... attributes
floating around. But these have to be handled by the DTD processor
regardless of whether namespace processing is enabled or not, yes?

I don't see where namespaces adds any extra data for a DTD processor to
deal with during the infoset augmentation stage.


 
  What I'm trying to achieve is to avoid having actions or patterns deal
  with element-names containing prefixes, eg stating that an element's
  name is foo:item. This is just broken; the item's name is really the
  tuple (some-namespace, item).
  
  Grammars/schemas can optionally be bound to namespaces, but namespaces
  themselves are a lower layer that can be used without any of these
  things. I'm talking here about requiring the parser to convert
  foo:item into (namespace, item) but do not intend to imply that any
  kind of schema should be loaded for the specified namespace. 
 
 That sounds sensible.
 
  The XMLReader.setNamespaceAware(true) method does exactly this; enables
  mapping of prefixes - namespaces, but does not enable processing of
  either DTDs or schemas.
 
 I don't think it actually has any impact at all on DTD processing.
 DTDs, if declared, are always processed unless you install an entity 
 resolver that excises that activity out.

You are right; DTDs get processed in the same manner regardless of
whether the parser is namespace-aware or not. What I meant was
namespaceAware does not affect the parser's handling of DTDs or schemas
(though it is a prerequisite for schema validation).

 
I agree
   that old parsers providing (c) aren't particularly interesting, but
   if you spend any time tracing through the guts of the parsing, 
   particularly
   when you see how DTDs are loaded for entity resolution, you begin to see 
   (d) as having potential.  Throwing (b) away may result in less code in
   Digester2, but it may be worth doing some timing tests to see if that 
   code reduction is consequence-free.
  
  What does loading DTDs have to do with namespaces?
 
 As you said, the XML spec doesn't require that the namespaces mean
 anything, and hence it is possible that a parser won't try to resolve
 and validate against multiple DTDs, but I haven't ever traced through
 the code in a situation where there were multiple namespaces to
 resolve against, so I don't know if there is relationship there or not.
 In general, if a parser thinks it needs a DTD in order to understand
 a document, it tends to grab it.  

I presume you're using DTD as a general term covering both traditional
DTDs (which are not namespace-aware) and w3c schemas?

An xml parser does need to read a DTD regardless of whether validation
is enabled or not, for the reasons you pointed out: default attributes,
entity definitions etc.

But w3c xml schemas deliberately don't have any functionality that
affects the infoset of the document. So if you're not validating you can
completely ignore any xml schema - and parsers do. To double-check, I
tested this today, and verified the entity resolver isn't called to
resolve xsi:schemaLocation references unless validation is enabled.

 I don't know if there are situations
 where it tries to interpret namespace declations as public ids for DTDs.
No, xml parsers never dereference namespace-uris to load either DTDs or
schemas. The only way to reference a schema from an xml document is via
  xsi:schemaLocation=namespace url

I think some XML editing programs do try to load schemas based upon the
namespace URI (eg jEdit, XMLSpy) but this is quite different (and
probably against the xml standard).


I still find it hard to believe that leaving out namespace support makes
a performance difference. The parser needs to keep a map of
   prefix-(stack of namespace)
and that's about it. 
 
 I stopped using belief as a measurement of code a long time
 ago.  Usually only works when I wrote all the code.  :-)
 I'll cook up an experiment and see what I can come up with
 in the way of timing information.

That would be excellent. I look forward to seeing the results..


Regards,

Simon