Re: DERBY-688: Some review comments and feedback
Hi Bryan, Thanks so much for reviewing the changes and for verifying the patches. I appreciate your time with this. 3) Who will run these tests, and when? If all the execution code is optional, how do we ensure that it doesn't get broken? I still have a couple of more patches to post to complete the XML work (phases 4, 5, ...). One of those patches (probably the final one) will enable the xmlSuite to be run as part of derbyall. Due to the dependencies on Xalan and on a JAXP implementation, the suite will not run against all JVMs: to start, I'll just have it running against JVMs that have the required classes included in them. This was originally going to be Sun and IBM 1.4, but as you discovered this past weekend, the Xalan that's embedded with Sun jdk 1.4.2 is not a recent enough version to pass. The tests cannot run against Sun 1.5 because the Xalan classes have (I believe) been renamed in Sun jdk 1.5 and thus will not be available to Derby (so far as I understand it, the user would have to download an external version of Xalan, as with Sun jdk 1.4.2). So at this point, I think the xmlSuite will only be run for IBM 1.4 and IBM 1.5, since those jvms include Xalan 2.5 or greater and also include a JAXP parser. So anyone running derbyall with either of those jvms would run the XML tests automatically. That said, the nightly tests that are reported here: http://people.apache.org/~fuzzylogic/derby_test_results/ show results against ibm1.4.2. So while XML failures will not show up in Ole's report, we should at least be able to see them at the above-indicated URL. 4) Can you further explain the BY VALUE vs. BY REF behaviors? What do these clauses mean, why is BY REF better, at what point would we want to re-introduce BY VALUE, how does this manifest itself in the code? The main way in which BY VALUE vs BY REF manifests itself in the code is when dealing with variable bindings. SQL/XML[2006] defines a syntax by which a value can be bound into a query expression. For example: select xmlserialize( xmlquery('$ci/my/stuff' passing by ref xcol as ci empty on empty) as clob) from xt_1 In this query xcol is bound into the variable $ci and then the query is executed. A key way in which BY REF and BY VALUE come into play, then, is when comparison operations between more than one XML value are part of the query. Take the following query: select xmlserialize( xmlquery('$ci/[EMAIL PROTECTED] = $c2/[EMAIL PROTECTED]' passing by ref xcol as ci, xcol as c2 empty on empty ) as clob) from xt_1 If ci and c2 are passed BY REF then the result of this query would be true; if either was passed BY VALUE, the result would be false. This is what I tried to capture in the comments in sqlgrammar.jj, where I have: * [I]f the same XML value is passed BY REF into two different XML arguments * for a single operator, then every node in the first XML argument must have an * identical node in the second XML argument, and the ids for both nodes must be * the same. I admit that the comment there could use some more explanation--but hopefully I can do that as a follow-up patch, instead of re-generate the patches from square one...? A while back I was prototyping some code to support XML binding and I found that it was both easier and more efficient to support BY REF, so that's the experience on which I've based the decision to use BY REF. That said, though, it turns out that Xalan does not support variable binding (or if it does, I haven't figured it out yet), so the difference between BY REF and BY VALUE is just syntactic right now. I've chosen BY REF because that was easiest for me to implement when I did the prototyping for variable binding, and I think that's the way to go in future. If at any point someone wants to fry the BY VALUE fish, then s/he should certainly feel free to do so :) 5) If/when you re-generate the patches, please use relative path names for the files in the patches so that we don't get strings like c:/private/derby_src/java in the file names. Yes, will do. Sorry. most of your examples and tests show the use of extremely tiny XML documents; they can fit into literal strings and are at most a few hundred bytes long. But in practice, XML documents are often ridiculous gigantic things which are hundreds of thousands of bytes long, and people try not to manipulate them in memory, but rather read them from files and write them to files, streaming them through parsers and into in-memory DOM trees only as needed. How does this work in Derby? For 10.2 I am only working to add XML support to Derby in the SQL layer. I do not plan to address XML-specific JDBC processing. I'm planning to include in the documentation something to this effect: begin doc There is no JDBC-side support for the XML datatype in Derby. This means it is not possible to bind directly into an XML value nor to retrieve an
DERBY-688: Some review comments and feedback
Hi Army, thanks for posting the patches and for continuing the work on the XML support. I think this is going to be a great feature! Here's my feedback; hope it's useful. thanks, bryan 1) The patches read well. The comments are fantastic! The effort is greatly appreciated. 2) The patches all applied cleanly for me, once I locally edited the absolute file names in the patches. After each individual patch in the sequence, I had no problems re-building derby. So really no build problems to mention. 3) Who will run these tests, and when? If all the execution code is optional, how do we ensure that it doesn't get broken? 4) Can you further explain the BY VALUE vs. BY REF behaviors? What do these clauses mean, why is BY REF better, at what point would we want to re-introduce BY VALUE, how does this manifest itself in the code? 5) If/when you re-generate the patches, please use relative path names for the files in the patches so that we don't get strings like c:/private/derby_src/java in the file names. 6) This is kind of a user-level question, and shows my ignorance about how XML support is supposed to fit into Derby: most of your examples and tests show the use of extremely tiny XML documents; they can fit into literal strings and are at most a few hundred bytes long. But in practice, XML documents are often ridiculous gigantic things which are hundreds of thousands of bytes long, and people try not to manipulate them in memory, but rather read them from files and write them to files, streaming them through parsers and into in-memory DOM trees only as needed. How does this work in Derby? Some questions that occur: a) If I have a large XML document in a file, how do I get that into my XML column in my database? Is it like a CLOB/BLOB where I work with some sort of a special stream class? b) The mirror-image question is how do I fetch a large XML document from my table and stream it to my file on my client efficiently? c) Internally, does the store use CLOB/BLOB techniques for XML storage? does it store them in separate files? d) how does DRDA tranmit XML over the net? Is it externalized data? Obviously, these questions are motivated by some of the work that Tomohito Nakayama and others have been doing recently with BLOB/CLOB efficiency, for example DERBY-326 and DERBY-550. 7) Another user-level question: in your test programs, your XML documents tend to be quite simple. They don't have the sorts of things that real-life XML documents have, like: a) ?xml ... ? headers, with varying encodings and the like b) multiple namespaces with various namespace prefixes c) strange sections of escaped CDATA d) DTD declarations with external DTDs e) named external entities etc. Presumably, since all of this is handled by the parser, it just works. However, I'm a little confused about how the parsing happens in a client-server scenario: is the XMLPARSE processing performed on the client side? Or on the server side? I think this only becomes relevant when the user must do something to ensure that the XML parser and the XPATH/XQUERY engines are configured properly; they need to know which side (client/server) of their environment needs to be so configured. 8) We need to make sure that the documentation clearly specifies which versions of the add-on XML software (parsers, XPATH, etc.) are specified, and we need to do our best to make the error messages when a bad version is used clear and specific. For example, XALAN 2.4 is bundled with the Sun 1.4 JDK but it is probably far too old to be used successfully. Yet how to install a newer version as an endorsed standard, and how to recognize the error messages when the wrong version is being used, is pretty subtle right now. 9) When I run lang/xmlBinding.java, I see the following diff. This diff occurs in all three configurations I tried (embedded, DerbyNet, and DerbyNetClient) -bash-2.05b$ java org.apache.derbyTesting.functionTests.harness.RunTest lang/xmlBinding.java *** Start: xmlBinding jdk1.4.2_11 2006-08-05 17:28:52 *** 9 del Inserted roughly 40k of data. 10 del Inserted roughly 40k of data. 10a9,10 Inserted roughly 39k of data. Inserted roughly 37k of data. 21 del 1, [ roughly 40k ] 22 del 2, [ roughly 40k ] 22a21,22 1, [ roughly 39k ] 2, [ roughly 37k ] Test Failed. *** End: xmlBinding jdk1.4.2_11 2006-08-05 17:28:59 ***