Re: [digester2] performance of ns-aware parsing
--- Simon Kitching [EMAIL PROTECTED] wrote: I stopped using belief as a measurement of code a long time ago. Usually only works when I wrote all the code. :-) I'll cook up an experiment and see what I can come up with in the way of timing information. That would be excellent. I look forward to seeing the results.. Actually, an experiment implies a question to be answered, and while this has been an interesting back-and-forth, not sure we really have a question to answer. This whole thing began with me simply asking a question about something you'd put in your readme file on the upcoming work. Practically I don't see you not expecting a namespace-aware parser, the question is really more one of the user of Digester2 deciding if they are using namespace features. While we could do timing tests to help people understand what the impact may or may not be of using NS in the documents they parse, it obviously has nothing to do with whether or not you are going to expect a parser to handle NS if the docs contain NS. That will be the developer's problem, not yours, yes? __ Do you Yahoo!? Yahoo! Mail - You care about security. So do we. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [digester2] performance of ns-aware parsing
--- Simon Kitching [EMAIL PROTECTED] wrote: On Thu, 2005-02-03 at 07:52 -0800, Reid Pinchback wrote: Even for Sax the performance difference between (a) and (b) is roughly a factor of 2 across all parsers when processing small (typical message-sized) docs that don't use NS. I would *really* love to see some actual measurements on this if you can find some. You seem to be quoting from some study you have done or read - it would be great to have this. [See comments on Piccolo below] Take another look at the Piccolo data, and compare the 2 Soap examples to the random no-NS data. The differences between the two Soap examples isn't material because both use NS, so in a sense you have a couple of different samples of NS data, and in the random case you have another sample, but I agree it would be better to create tests that were better understood in order to decide what the difference was. Mucking with (d) is supposed to result in significant wins when you tune the grammar handling to your app, but I haven't tried it myself and I've never seen timing differences quoted. I don't quite understand what (d) means, but is it actually relevant? Again, we are talking about *namespaces* not validation. Yes... and every entity (Element and Attribute) is jammed through a resolution process first. Remember XML attributes with default values? Guess where those values are identified and handed to the parser - during the resolution process. Namespaces just add more data to shuffle around during the resolution process. What I'm trying to achieve is to avoid having actions or patterns deal with element-names containing prefixes, eg stating that an element's name is foo:item. This is just broken; the item's name is really the tuple (some-namespace, item). Grammars/schemas can optionally be bound to namespaces, but namespaces themselves are a lower layer that can be used without any of these things. I'm talking here about requiring the parser to convert foo:item into (namespace, item) but do not intend to imply that any kind of schema should be loaded for the specified namespace. That sounds sensible. The XMLReader.setNamespaceAware(true) method does exactly this; enables mapping of prefixes - namespaces, but does not enable processing of either DTDs or schemas. I don't think it actually has any impact at all on DTD processing. DTDs, if declared, are always processed unless you install an entity resolver that excises that activity out. I agree that old parsers providing (c) aren't particularly interesting, but if you spend any time tracing through the guts of the parsing, particularly when you see how DTDs are loaded for entity resolution, you begin to see (d) as having potential. Throwing (b) away may result in less code in Digester2, but it may be worth doing some timing tests to see if that code reduction is consequence-free. What does loading DTDs have to do with namespaces? As you said, the XML spec doesn't require that the namespaces mean anything, and hence it is possible that a parser won't try to resolve and validate against multiple DTDs, but I haven't ever traced through the code in a situation where there were multiple namespaces to resolve against, so I don't know if there is relationship there or not. In general, if a parser thinks it needs a DTD in order to understand a document, it tends to grab it. I don't know if there are situations where it tries to interpret namespace declations as public ids for DTDs. If that happens, then those DTDs would also be loaded by the parser and namespaces would have to be matched to the appropriate collections of contexts during entity resolution. I still find it hard to believe that leaving out namespace support makes a performance difference. The parser needs to keep a map of prefix-(stack of namespace) and that's about it. I stopped using belief as a measurement of code a long time ago. Usually only works when I wrote all the code. :-) I'll cook up an experiment and see what I can come up with in the way of timing information. Sorry, what per-entity operations, and what temporary object creations? The Jade/Javolution author wrote a fair bit about that, I'll see if I can find his pages. I couldn't find the details at the Javolution site; when Jade was separate he indicated that the String operations required to satisfy the SAX API semantics dragged down performance heavily. Zapthink comments on XML parsing challenges, http://searchwebservices.techtarget.com/originalContent/0,289142,sid26_gci85,00.html No occurrence of the word namespace anywhere in the article. For this and other similar concepts, it helps to start associating namespaces with other aspects of parsing internals. Elements and attributes have to be matched up to their definitions - the resolution process. Namespaces are an aspect of the match up, just more information
Re: [digester] initial code for Digester2.0
--- Simon Kitching [EMAIL PROTECTED] wrote: On Wed, 2005-02-02 at 20:45 -0800, Reid Pinchback wrote: Of course if someone can demonstrate that non-namespace-aware parsers *are* still useful then I'll change my mind. Just to clarify, since I was being sloppy before (I gotta stop typing in shorthand) there is an important distinction: a) having NS-aware parser, always using NS-aware API methods b) having NS-aware parser, selectively using NS-aware API methods c) having non-NS-aware parser (and obviously never using NS-aware API methods) d) having NS-aware parser where the developer fixes a grammar that ignores any NS distinctions Even for Sax the performance difference between (a) and (b) is roughly a factor of 2 across all parsers when processing small (typical message-sized) docs that don't use NS. Mucking with (d) is supposed to result in significant wins when you tune the grammar handling to your app, but I haven't tried it myself and I've never seen timing differences quoted. I'm not trying to advocate any approach except to notice that, since your README mentioned requiring a namespace-aware parser, it sounded like there was a potential for options (b), (c), and (d) to become unintentionally closed to developers in Digester2 when they weren't in Digester1. I agree that old parsers providing (c) aren't particularly interesting, but if you spend any time tracing through the guts of the parsing, particularly when you see how DTDs are loaded for entity resolution, you begin to see (d) as having potential. Throwing (b) away may result in less code in Digester2, but it may be worth doing some timing tests to see if that code reduction is consequence-free. I still find it hard to believe that leaving out namespace support makes a performance difference. The parser needs to keep a map of prefix-(stack of namespace) and that's about it. Actually the XML spec distinguishes between the default namespace and all other namespaces, so parsers can reasonably make the same distinction and try to avoid a bunch of per-entity operations and temporary object creations in the case where there is no namespace. Look at the piccolo stats published on Sourceforge. Compare Soap, Soap+NS, and random XML-no NS timings and it suggests that NS ain't free. Useful links: Jade (now part of Javolution) http://javolution.org/api/index.html, look at the javolution.xml package (trades String for CharSequence to increase performance, but keeps NS) Picollo you probably already have the link for, but for anybody else interested: http://piccolo.sourceforge.net Zapthink comments on XML parsing challenges, http://searchwebservices.techtarget.com/originalContent/0,289142,sid26_gci85,00.html Developerworks articles on XML performance, http://www-106.ibm.com/developerworks/xml/library/x-perfap1.html Sun articles on XML performance, http://java.sun.com/developer/technicalArticles/xml/JavaTechandXML_part3/ __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [digester] initial code for Digester2.0
One section of the release notes says: The Digester now *always* uses a namespace-aware xml parser. I was wondering why this is. There are a lot of XML parsers out there, and some of them have done things like trade namespace awareness for performance. If somebody has a application where namespaces aren't an issue, why should they be limited to only using a namespace-aware parser? Not something that seems like an important issue if you are just using a Digester to process some kind of app config file, but is an issue if processing streams of XML data is fundamentally what the app is about. __ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [digester] initial code for Digester2.0
--- Oliver Zeigermann [EMAIL PROTECTED] wrote: On Wed, 02 Feb 2005 18:28:04 +1300, Simon Kitching [EMAIL PROTECTED] wrote: My major concern is that if we are going to warn people not to implement the Action interface, then what really is the point of providing it in the first place? As I said above, I just cannot think of any situation where a class would want to be an Action *and* extend some other class. I am +1 for using an interface and the default (why abstract?) implementation like with Swing or SAX. I don't get why we would ever warn people not to implement the interface, beyond including JavaDoc that clarified what the behaviour contract is for the various methods. Part of a developer's job is to exercise judgement about what they are or are not going to do in their implementation. If the existing Action implementations and base class provides what a developer needs to do 99% of the time, they won't bother implementing the interface, but when they encounter that 1% scenario, its nice not to hit a brick wall. Here is a concrete example of why you could want to implement the interface and extend another class, I've actually had situations with the existing Digester where I'd wished I could do that. The one that I can recall now was an instrumentation issue. Doing debugging and performance tuning of a suite of rules can be tedious because, currently, the only options are either to watch a spew of logging messages or single-step your way through all the callbacks in a debugger (PAIN). If the major coupling points in the Digester had been abstracted by interfaces, it would have been easier to insert instrumentation proxies or EasyMock'd test implementations of classes at key points. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [digester] initial code for Digester2.0
--- Simon Kitching [EMAIL PROTECTED] wrote: Supporting namespaces in an xml parser seems very simple to me. I think it much more likely that only antique and unmaintained parsers fail to support namespaces. And people who are determined to use antique and unmaintained parsers can just stick with digester 1.x as far as I am concerned. I'm not pushing for digester to remove non-namespace-aware support - just digester2! Wow, that is an unexpectedly harsh reaction. My reason for asking was simple, and I believe not unreasonable. You were the one asking for feedback on your proposal. Using the namespace-based API of an XML parser is known throughput substantially, covered in a host of Java xml mag articles, available from google searches, and one or two of the Java performance tuning books still in distribution. XML performance tuning is a tough area, and people continually struggle with it. I don't recall the SAX-only stats, but I know that for DOM parsers you can shoot for an increase XML processing bandwidth by an order of magnitude through a change in parser and not using NS. Antiqueness of parsers isn't the issue. I think it helps to keep in mind that NS was intended as a way of creating name-resolution scopes that allow the merging of document structures from different origins that otherwise could experience element and attribute name clashes. When somebody has an application that doesn't require that kind of merging, and they aren't using a namespace-dependent XML technology like Soap or XMLSchma, then using using NS features of an NS parser can be a burden without corresponding benefit. Under the hood, that parser has to do a lot of work to continually manage the NS resolution of the node names. It has no way of knowing that the work is pointless - you've told it to assume that there is a point when you use the NS features. __ Do you Yahoo!? Yahoo! Mail - You care about security. So do we. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [digester] initial code for Digester2.0
--- Simon Kitching [EMAIL PROTECTED] wrote: Does this mean you prefer Action to Rule? I certainly expect to hear from people who want to keep the current names... I'm not wedded to Rule but I do have a concern about Action. I suspect it could make Struts code rather confusing. __ Do you Yahoo!? Yahoo! Mail - Easier than ever with enhanced search. Learn more. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [digester] initial code for Digester2.0
--- Simon Kitching [EMAIL PROTECTED] wrote: Ok, we'll see what the general consensus is. I happen to personally like prefixes rather than suffixes, but will go with the majority opinion. I vote for prefixes. That sounds reasonable. However I do dislike having mutual dependencies between java packages; a DAG (directed acyclic graph) is good for a number of reasons. I strongly agree. Cyclic package dependencies seem unimportant when you only have a few classes, but as the amount of code grows, you quickly find that testing and refactoring because much more difficult than it had to be. __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [digester] initial code for Digester2.0
Sure thing. Just to make it easier to envision, let's get packages out of the equation. Just think about cyclic dependencies between two classes in the same package. That is enough to show the problem; packages just add complexity because the dependencies can be much harder to detect visually (usually you would use something like JDepend to spot them) and harder to unwind. Refactoring is harder simply because you have to do a larger number of smaller steps. Doesn't mean impossible, more steps just mean more work, more time, more money. Tricky enough when only two classes are involved, harder as the number of classes involved in the cycle increase. Get enough classes involved, and you start to hear statements like it will be easier to throw that away and start over again than it will be to fix it. class A { int a; int fooA(int arg) { // 1a. do stuff with {B.fooB,a,arg} // 2a. do other stuff with result and {a} } } class B { int b1, b2; int fooB(int arg) { // 1b. do stuff with {A.fooA,b1,arg} // 2b. do other stuff with result and b2 // 3b. do stuff with {A.fooA,b2,arg} } } Refactoring remains possible, but tricky because you have both compile-time code dependencies and run-time state dependencies. You are faced with things like factoring out small fragments of code into helper classes, and maybe introducing an interface to at least eliminate the compile-time dependency between A and B, even if the run-time dependency remains. Often the solution ends up something like a) make interface I b) create class C implements I and migrate some of A and B state into C c) modify A and B to share I It works, it just takes time... and often you are doing it before even trying to tackle whatever bug or feature enhancement you were faced with in the first place. __ Do you Yahoo!? Take Yahoo! Mail with you! Get it on your mobile phone. http://mobile.yahoo.com/maildemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [digester] Are performance improvements wanted?
I won't repeat my previous comments re: JUnitPerf, but they apply here too. Just looked at the bench case stuff, looks decent, better for fast tests of small code fragments. Whether it is appropriate or not depends on what you are trying to achieve. If you want to be able to record measurements (e.g. in some historical performance file) and compare against that, the approach is fine. What I'm a bit more concerned about right now is to, at more-or-less-the-same-time, compare the timings of two pieces of code in the same environment. I'd like the test to know if I've achieved an improvement or not. On the issue of platform-specific differences, I agree, that is tough. The problem with posting numbers is that systems vary so much its hard to draw conclusions. If somebody claimed to have similar hardware and O/S to you, if their numbers are the same, higher, or lower than yours, what does it tell you? Unfortunately, the data is from an experiment that is too uncontrolled to help a developer decide if a proposed code change is likely to be faster across multiple platforms. If you are inclined to muse in the direction of random unpractical thoughts, you could envision a small reference set of Java code fragments. Measure Digester performance in terms of the reference set. That performance number should be platform dependent, while the actual results on any given platform would be finally determined by the raw performance of the reference set. That is essentially the technique used in a variety of numerical modeling, estimation, or optimization approaches. Definitely pie-in-the-sky category solution. Maybe put it on the Wiki for, oh, Digester 27.0. :-) --- Phil Steitz [EMAIL PROTECTED] wrote: The approach used in o.a.c.beanutils.BeanUtilsBenchCase -- creating a separate microbenchmarks test case with timing included -- could probably also be applied to [digester] and other commons components. I have no clue how one would go about eliminating platform-specific differences. __ Do you Yahoo!? New and Improved Yahoo! Mail - Send 10MB messages! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [digester] Are performance improvements wanted?
--- Simon Kitching [EMAIL PROTECTED] wrote: You should be warned, though, that the logging area is particularly tricky. Yup, I figured that could be the case. Before I even proposed this I'd already decided that I'd just float each change as a proposal, and just grin and bear it if there was something that made the change unwise. While you strive to create performance fixes that don't change behaviour at all, sometimes you run into cases were that isn't true. When that happens, folks have to decide if the change would be to something that mattered, or not. From what I remember, there is a requirement that frameworks which use digester (eg j2ee app servers) must be able to direct logging output to different destinations depending on which app the framework is running the digester on behalf of. ... I was not able to find a better way to organise logging while satisfying the original requirements. I'm not saying there *isn't* a way to improve digester logging, just that it is probably necessary to read that email thread first to be sure the improvements still satisfy the requirements as described by Craig. Ok, I'll see if I can find anything archived about that. At a guess I bet its something like the following: - getLogger returns a reference to a logger - Digester instances currently each have their own reference - if you use that reference to change the logger behaviour for your Digester, do you change only your own logging, or everybody else's logging via the Digester/Digester.sax categories, and would sharing a static logger change that? Can't say I've traced this kind of thing through log4j, but I'd have expected that changing the logger changed everybody's logging via the same category against the same repository. Could be I'm wrong. Normally I'd expect that if multiple clients needed different control of logging for the same category, they'd need to have their own repositories. In any case, I'm not overly worried about winning on this particular change. Its the kind of thing that matters more during development than during execution - its a measurable drag on running unit tests that instantiate Digester instances in loops, but not such a big deal in real-life Digester usage. Not an issue for now, but for the future I'm particularly intrigued by some of the Wiki comments for Digester 2.0, and how it might be time to split out various areas of functionality. I think at that point you might have a chance to allow for some very serious performance improvements in areas that wouldn't be possible today without changing the API in undesirable ways. I think a lot of the circular dependencies between classes and packages that exist in Digester today are the initial sniff test of interesting opportunities with a different approach. Reid __ Do you Yahoo!? New and Improved Yahoo! Mail - 100MB free storage! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[digester] Wiki todo 2.1.7, yes Digester can do Ant properties
FYI, I've verified that yes, the Digester substitution facilities in 1.6 can be used to do the same kind of variable substitution that Ant has. Just wanted to send in a note so nobody wastes time tackling the same problem. Once Simon has finished merging the 1.6 source into the head, I'll post the change. At that poing somebody with Wiki godliness should probably indicate the issue closed. Nothing earth-shattering to do it. VariableExpansionTestCase was a large part of the way there, just needed to take it a little bit further. No changes to functional code are needed, just required a combination of the substitutor framework, CallMethodRule, CallParamRule, and an appropriate initial object shoved on the Digester stack. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[digester] Are performance improvements wanted?
I just finished a project where I had to do a fair bit of performance tuning work over the last year. I was looking through the current digester source, and even without torquing the code wierdly or changing class APIs I've seen places that could probably be made faster. 1) Would folks be interested in digester performance fixes? No point in my wasting time on them if, for example, some major re-write is underway. 2) What would be the preferred way of submitting them? I was thinking of submitting a tweaked class as an enhancement request with an attached patch and maybe a unit test that measured both the old and new code. People could use the test to try the changes on other platforms (I'd only be testing on some Win32 sdk versions, but the fixes I have in mind should either help or at least do no harm on other platforms). How much of a gain people would see in real use of course would depend on what they were doing; I'm expecting these fixes to matter more in situations where digesters would run frequently (e.g. SOAP) and developers have, where feasible, already dealt with the obvious (factoring out rule+parser factory+parser instantiations). Thanks Reid ___ Do you Yahoo!? Shop for Back-to-School deals on Yahoo! Shopping. http://shopping.yahoo.com/backtoschool - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]