Re: [digester2] performance of ns-aware parsing
On Sun, 2005-02-06 at 13:02 -0800, Reid Pinchback wrote: --- Simon Kitching [EMAIL PROTECTED] wrote: I stopped using belief as a measurement of code a long time ago. Usually only works when I wrote all the code. :-) I'll cook up an experiment and see what I can come up with in the way of timing information. That would be excellent. I look forward to seeing the results.. Actually, an experiment implies a question to be answered, and while this has been an interesting back-and-forth, not sure we really have a question to answer. This whole thing began with me simply asking a question about something you'd put in your readme file on the upcoming work. Practically I don't see you not expecting a namespace-aware parser, the question is really more one of the user of Digester2 deciding if they are using namespace features. While we could do timing tests to help people understand what the impact may or may not be of using NS in the documents they parse, it obviously has nothing to do with whether or not you are going to expect a parser to handle NS if the docs contain NS. That will be the developer's problem, not yours, yes? Hi Reid, I don't quite understand the above. You mean these are the questions? * should people avoid creating xml documents that use namespaces if they care about the performance of later parsing the doc? * Is there a significant performance benefit in parsing non-namespaced xml with a non-namespace-aware parser? * Is there a significant performance benefit in parsing namespace-using-xml with a non-namespace-aware parser (yecch!). The first is an interesting question, and is partially related to the third one in that it gives people an *option* (though not a good one IMHO) to parse the document fast. But mostly I agree this is the developer's problem, not digester's. Tf we can give a hint somewhere in our docs about parser performance with/without ns, though, I'm sure people would appreciate it. For either of the second, the answer is relevant to digester; if the answer to either is yes, then I would support allowing a non-namespace-aware parser to be used with digester. By support, I mean writing code that allows instantiation of ns-aware or non-ns-aware parser, code that looks for localname/qname, support in the RuleManager classes for matching such elements, and unit tests to test it all. Currently, I'm not hugely motivated to test either of the last two scenarios, as I *believe* the answer to both is no, but if someone else does I'll look at the results with interest. Is this what you meant? Regards, Simon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [digester2] performance of ns-aware parsing
--- Simon Kitching [EMAIL PROTECTED] wrote: I stopped using belief as a measurement of code a long time ago. Usually only works when I wrote all the code. :-) I'll cook up an experiment and see what I can come up with in the way of timing information. That would be excellent. I look forward to seeing the results.. Actually, an experiment implies a question to be answered, and while this has been an interesting back-and-forth, not sure we really have a question to answer. This whole thing began with me simply asking a question about something you'd put in your readme file on the upcoming work. Practically I don't see you not expecting a namespace-aware parser, the question is really more one of the user of Digester2 deciding if they are using namespace features. While we could do timing tests to help people understand what the impact may or may not be of using NS in the documents they parse, it obviously has nothing to do with whether or not you are going to expect a parser to handle NS if the docs contain NS. That will be the developer's problem, not yours, yes? __ Do you Yahoo!? Yahoo! Mail - You care about security. So do we. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [digester2] performance of ns-aware parsing
On Thu, 2005-02-03 at 07:52 -0800, Reid Pinchback wrote: --- Simon Kitching [EMAIL PROTECTED] wrote: On Wed, 2005-02-02 at 20:45 -0800, Reid Pinchback wrote: Of course if someone can demonstrate that non-namespace-aware parsers *are* still useful then I'll change my mind. Just to clarify, since I was being sloppy before (I gotta stop typing in shorthand) there is an important distinction: a) having NS-aware parser, always using NS-aware API methods b) having NS-aware parser, selectively using NS-aware API methods c) having non-NS-aware parser (and obviously never using NS-aware API methods) d) having NS-aware parser where the developer fixes a grammar that ignores any NS distinctions Even for Sax the performance difference between (a) and (b) is roughly a factor of 2 across all parsers when processing small (typical message-sized) docs that don't use NS. I would *really* love to see some actual measurements on this if you can find some. You seem to be quoting from some study you have done or read - it would be great to have this. [See comments on Piccolo below] Mucking with (d) is supposed to result in significant wins when you tune the grammar handling to your app, but I haven't tried it myself and I've never seen timing differences quoted. I don't quite understand what (d) means, but is it actually relevant? Again, we are talking about *namespaces* not validation. The w3c namespaces spec clearly makes a distinction between namespaces and whether or not the namespace URI means anything: quote source=http://www.w3c.org/TR/xml-names11/; Note also that the Namespaces specification says nothing about what might (or might not) happen if one were to attempt to dereference a URI/IRI used to identify a namespace. /quote What I'm trying to achieve is to avoid having actions or patterns deal with element-names containing prefixes, eg stating that an element's name is foo:item. This is just broken; the item's name is really the tuple (some-namespace, item). Grammars/schemas can optionally be bound to namespaces, but namespaces themselves are a lower layer that can be used without any of these things. I'm talking here about requiring the parser to convert foo:item into (namespace, item) but do not intend to imply that any kind of schema should be loaded for the specified namespace. The XMLReader.setNamespaceAware(true) method does exactly this; enables mapping of prefixes - namespaces, but does not enable processing of either DTDs or schemas. I'm not trying to advocate any approach except to notice that, since your README mentioned requiring a namespace-aware parser, it sounded like there was a potential for options (b), (c), and (d) to become unintentionally closed to developers in Digester2 when they weren't in Digester1. Well, I did intend to close options (b) and (c) as I didn't believe there was any reason at all to support them. Some real measurements showing the kind of performance you quote would definitely change my mind. I agree that old parsers providing (c) aren't particularly interesting, but if you spend any time tracing through the guts of the parsing, particularly when you see how DTDs are loaded for entity resolution, you begin to see (d) as having potential. Throwing (b) away may result in less code in Digester2, but it may be worth doing some timing tests to see if that code reduction is consequence-free. What does loading DTDs have to do with namespaces? I still find it hard to believe that leaving out namespace support makes a performance difference. The parser needs to keep a map of prefix-(stack of namespace) and that's about it. Actually the XML spec distinguishes between the default namespace and all other namespaces, so parsers can reasonably make the same distinction and try to avoid a bunch of per-entity operations and temporary object creations in the case where there is no namespace. Sorry, what per-entity operations, and what temporary object creations? Look at the piccolo stats published on Sourceforge. Compare Soap, Soap+NS, and random XML-no NS timings and it suggests that NS ain't free. Useful links: Jade (now part of Javolution) http://javolution.org/api/index.html, look at the javolution.xml package (trades String for CharSequence to increase performance, but keeps NS) Hmm.. I've added a reference to javolution to the wiki. However I couldn't find any info on the performance of namespaceAware vs nonNamespaceAware... Picollo you probably already have the link for, but for anybody else interested: http://piccolo.sourceforge.net Piccolo does have a page where they state their performance tests for SOAP - namespaces off is about 12% faster than SOAP - namespaces on. But there is no further info on what these phrases mean. The piccolo site provides a download for SAXBench benchmarking tool, but (a) I never managed to get this working, and (b) it
Re: [digester2] performance of ns-aware parsing
--- Simon Kitching [EMAIL PROTECTED] wrote: On Thu, 2005-02-03 at 07:52 -0800, Reid Pinchback wrote: Even for Sax the performance difference between (a) and (b) is roughly a factor of 2 across all parsers when processing small (typical message-sized) docs that don't use NS. I would *really* love to see some actual measurements on this if you can find some. You seem to be quoting from some study you have done or read - it would be great to have this. [See comments on Piccolo below] Take another look at the Piccolo data, and compare the 2 Soap examples to the random no-NS data. The differences between the two Soap examples isn't material because both use NS, so in a sense you have a couple of different samples of NS data, and in the random case you have another sample, but I agree it would be better to create tests that were better understood in order to decide what the difference was. Mucking with (d) is supposed to result in significant wins when you tune the grammar handling to your app, but I haven't tried it myself and I've never seen timing differences quoted. I don't quite understand what (d) means, but is it actually relevant? Again, we are talking about *namespaces* not validation. Yes... and every entity (Element and Attribute) is jammed through a resolution process first. Remember XML attributes with default values? Guess where those values are identified and handed to the parser - during the resolution process. Namespaces just add more data to shuffle around during the resolution process. What I'm trying to achieve is to avoid having actions or patterns deal with element-names containing prefixes, eg stating that an element's name is foo:item. This is just broken; the item's name is really the tuple (some-namespace, item). Grammars/schemas can optionally be bound to namespaces, but namespaces themselves are a lower layer that can be used without any of these things. I'm talking here about requiring the parser to convert foo:item into (namespace, item) but do not intend to imply that any kind of schema should be loaded for the specified namespace. That sounds sensible. The XMLReader.setNamespaceAware(true) method does exactly this; enables mapping of prefixes - namespaces, but does not enable processing of either DTDs or schemas. I don't think it actually has any impact at all on DTD processing. DTDs, if declared, are always processed unless you install an entity resolver that excises that activity out. I agree that old parsers providing (c) aren't particularly interesting, but if you spend any time tracing through the guts of the parsing, particularly when you see how DTDs are loaded for entity resolution, you begin to see (d) as having potential. Throwing (b) away may result in less code in Digester2, but it may be worth doing some timing tests to see if that code reduction is consequence-free. What does loading DTDs have to do with namespaces? As you said, the XML spec doesn't require that the namespaces mean anything, and hence it is possible that a parser won't try to resolve and validate against multiple DTDs, but I haven't ever traced through the code in a situation where there were multiple namespaces to resolve against, so I don't know if there is relationship there or not. In general, if a parser thinks it needs a DTD in order to understand a document, it tends to grab it. I don't know if there are situations where it tries to interpret namespace declations as public ids for DTDs. If that happens, then those DTDs would also be loaded by the parser and namespaces would have to be matched to the appropriate collections of contexts during entity resolution. I still find it hard to believe that leaving out namespace support makes a performance difference. The parser needs to keep a map of prefix-(stack of namespace) and that's about it. I stopped using belief as a measurement of code a long time ago. Usually only works when I wrote all the code. :-) I'll cook up an experiment and see what I can come up with in the way of timing information. Sorry, what per-entity operations, and what temporary object creations? The Jade/Javolution author wrote a fair bit about that, I'll see if I can find his pages. I couldn't find the details at the Javolution site; when Jade was separate he indicated that the String operations required to satisfy the SAX API semantics dragged down performance heavily. Zapthink comments on XML parsing challenges, http://searchwebservices.techtarget.com/originalContent/0,289142,sid26_gci85,00.html No occurrence of the word namespace anywhere in the article. For this and other similar concepts, it helps to start associating namespaces with other aspects of parsing internals. Elements and attributes have to be matched up to their definitions - the resolution process. Namespaces are an aspect of the match up, just more information to
Re: [digester2] performance of ns-aware parsing
On Sat, 2005-02-05 at 21:02 -0800, Reid Pinchback wrote: --- Simon Kitching [EMAIL PROTECTED] wrote: Mucking with (d) is supposed to result in significant wins when you tune the grammar handling to your app, but I haven't tried it myself and I've never seen timing differences quoted. I don't quite understand what (d) means, but is it actually relevant? Again, we are talking about *namespaces* not validation. Yes... and every entity (Element and Attribute) is jammed through a resolution process first. Remember XML attributes with default values? Guess where those values are identified and handed to the parser - during the resolution process. Namespaces just add more data to shuffle around during the resolution process. Well, in a document that doesn't use namespaces, the penalty is zero. In a document that uses namespaces, there are a few xmlns:... attributes floating around. But these have to be handled by the DTD processor regardless of whether namespace processing is enabled or not, yes? I don't see where namespaces adds any extra data for a DTD processor to deal with during the infoset augmentation stage. What I'm trying to achieve is to avoid having actions or patterns deal with element-names containing prefixes, eg stating that an element's name is foo:item. This is just broken; the item's name is really the tuple (some-namespace, item). Grammars/schemas can optionally be bound to namespaces, but namespaces themselves are a lower layer that can be used without any of these things. I'm talking here about requiring the parser to convert foo:item into (namespace, item) but do not intend to imply that any kind of schema should be loaded for the specified namespace. That sounds sensible. The XMLReader.setNamespaceAware(true) method does exactly this; enables mapping of prefixes - namespaces, but does not enable processing of either DTDs or schemas. I don't think it actually has any impact at all on DTD processing. DTDs, if declared, are always processed unless you install an entity resolver that excises that activity out. You are right; DTDs get processed in the same manner regardless of whether the parser is namespace-aware or not. What I meant was namespaceAware does not affect the parser's handling of DTDs or schemas (though it is a prerequisite for schema validation). I agree that old parsers providing (c) aren't particularly interesting, but if you spend any time tracing through the guts of the parsing, particularly when you see how DTDs are loaded for entity resolution, you begin to see (d) as having potential. Throwing (b) away may result in less code in Digester2, but it may be worth doing some timing tests to see if that code reduction is consequence-free. What does loading DTDs have to do with namespaces? As you said, the XML spec doesn't require that the namespaces mean anything, and hence it is possible that a parser won't try to resolve and validate against multiple DTDs, but I haven't ever traced through the code in a situation where there were multiple namespaces to resolve against, so I don't know if there is relationship there or not. In general, if a parser thinks it needs a DTD in order to understand a document, it tends to grab it. I presume you're using DTD as a general term covering both traditional DTDs (which are not namespace-aware) and w3c schemas? An xml parser does need to read a DTD regardless of whether validation is enabled or not, for the reasons you pointed out: default attributes, entity definitions etc. But w3c xml schemas deliberately don't have any functionality that affects the infoset of the document. So if you're not validating you can completely ignore any xml schema - and parsers do. To double-check, I tested this today, and verified the entity resolver isn't called to resolve xsi:schemaLocation references unless validation is enabled. I don't know if there are situations where it tries to interpret namespace declations as public ids for DTDs. No, xml parsers never dereference namespace-uris to load either DTDs or schemas. The only way to reference a schema from an xml document is via xsi:schemaLocation=namespace url I think some XML editing programs do try to load schemas based upon the namespace URI (eg jEdit, XMLSpy) but this is quite different (and probably against the xml standard). I still find it hard to believe that leaving out namespace support makes a performance difference. The parser needs to keep a map of prefix-(stack of namespace) and that's about it. I stopped using belief as a measurement of code a long time ago. Usually only works when I wrote all the code. :-) I'll cook up an experiment and see what I can come up with in the way of timing information. That would be excellent. I look forward to seeing the results.. Regards, Simon