Great article, Dennis. I agree with your recommendation that schema designers should try to avoid using XSD data types that don't match up well with Java and .NET datatypes (and vice versa). But I have another recommendation: if the situation allows it (i.e., if the complexity of the process doesn't require complex object graphs and algorithms) just don't use a compiled OO programming language and avoid the whole XML/object impedance mismatch. Use XQuery or a scripting language instead.
I know. It's a radical idea. Anne On 1/19/09, Michael Poulin <[email protected]> wrote: > Dennis asks a few questions but I let me answer only some of the now. > > In my experience, XMLBeans outperformed JAXB for several time at the > beginning and still have beeter performances (now, depending on the schema > complexity). I use validation all the time - this is the purpose of having > Schema as a controller of data quality. XMLBeans do not allow generate > invalid XML because it keeps XML in parallel with Java code validating any > and every data modification withing Java. Yes, this takes additional time > but you know about this up-front and can mitigate this problem by right > design. > > I have a couple of cases where XMLBeans saved my ... when caught wrong data > transformation in the code and in the Web Service messages. I think, Dennis > and me just use the same things for different purposes. > > I look at XML Schema (and XMLBeans/JAXB) much wider than for data binding. > In particular, if you have a Java String and VARCHAR in database for the > same data (string), how do you control the lengh of that Java String? So, if > I can use XML Schema mechanism as a universal data quality controller - for > Web communication and databases - I put its importance and priority above > the data in the code which and require making this code usable from the > application standpoint, by design. The worst example I know about binding is > open-source Didgester that allows converting XML into Java with no controls > whatsoever; this results in garbage in Java data and code. > > XML Schema in my use is the most important element of data formats in Java > code (exception, obviously, is a legacy code); it goes through the most > tough verifications and reviews itself. That is, "just go ahead and use > whatever schema type they fancy" has never happened to me. Because of > mentioned role of Schema, I always consider it in parallel with application > code, not one after another, i.e. I am trying to keep balance between > simplicity of the code, quality of data and performance-oriented > (compensating) design. > > - Michael > > > > > ________________________________ > From: Dennis Sosnoski <[email protected]> > To: [email protected] > Sent: Monday, January 19, 2009 5:16:39 AM > Subject: Re: [service-orientated-architecture] Dennis on Schema for Web > Services – Part I: Basic Datatypes > > > Michael Poulin wrote: >> Dennis wrote: "But the data binding step needs to deal with mismatches >> between schema data types and structures and programming language data >> types and structures, and these mismatches can create problems for >> applications. .." and he talks about JAXB 2.0 but does not even mention >> XMLBeans, strange. > > XMLBeans is not really a data binding tool, instead implementing a data > binding facade over an XML store. This does mean that XMLBeans offers > considerably lower performance than normal data binding tools, and > programmers generally find the XMLBeans generated APIs more difficult to > use. But the issues discussed in this first article all apply equally > well to XMLBeans as to JAXB 2.0 - XMLBeans just uses its own GDate class > rather than an XMLGregorianCalenda r, and also has it's own variants for > other schema types (which in many cases just add another layer of > wrapper around existing Java types). > >> >> My approach to the problem mentioned above is right opposite - I >> believe that XML Schema, as the data quality controlling mechanism,has >> to dictate the format of the data in the programming language. I do >> not know if C#allows such control but Java certainly does. I use it >> since the first announcement of XMLBeans by BEA, when JAXB walked 'in >> short bridges under the table'. > > That's certainly a valid approach, but the result is that you end up > with code which is not necessarily very usable from the application > standpoint. > >> >> I still think that XMLBeans are better than JAXB, at least, due to >> full support of the XML Schema, but I did not check it out recently. > > I could make points in favor of each. Performance and ease of use aside, > I think XMLBeans is far too lenient about letting you generate invalid > XML - if you don't set a required value, XMLBeans happily spits out XML > without that element or attribute present. Of course you can turn on > validation so that it checks the output, but that has a very substantial > impact on performance. I believe frameworks which automatically report > an error when you try to marshal XML with missing required components > are safer. > >> >> So, instead of screwing XML Schema to satisfy clumsy Java code, I do >> enforce quality of transmitted data onto the receiver and allow it to >> deal with its own data quality problems. I do not mean such >> irresponsible constructs in the Schema like 'any'; it does not control >> data quality. However, with a few exceptions, I use XML Schema to >> generate Java objects and to use the latter on the sender and receiver >> sides of the Web Service communication (actually, I use only >> document/literal style). > > So you'd tell people to just go ahead and use whatever schema type they > fancy (e.g., nonPositiveInteger) without regard to the usefulness of the > type or the impact this has on generated code? Seems silly to me if you > know in advance that the values can be handled with a > double/float/ int/long representation, but to each their own. > > Out of curiosity, how do you deal with the lack of any distinction > between completely- vs. incompletely- specified schema date/time values > mentioned in the article? > > - Dennis > >> >> - Michael >> >> ------------ --------- --------- --------- --------- --------- - >> *From:* Gervas Douglas <gervas.douglas@ gmail.com> >> *To:* service-orientated- architecture@ yahoogroups. com >> *Sent:* Sunday, January 18, 2009 9:01:54 PM >> *Subject:* [service-orientated -architecture] Dennis on Schema for Web >> Services – Part I: Basic Datatypes >> >> *You can view the following article at: >> >> http://www.infoq. com/articles/ schema-for- ws-part1; jsessionid= >> A4FA64435750D836 AA32113976421FFE >> >> Gervas* >> >> <<XML message exchange is the basis of most varieties of web services, >> including both SOAP and REST approaches. The use of XML creates some >> drawbacks, including potential issues with performance, but it also >> provides a level of abstraction which allows for loose coupling >> between the parties involved in an exchange. In order for that loose >> coupling to really work, though, you need to be able to define the >> structure of XML documents being exchanged in a way which allows >> verification of correct documents. The W3C's XML Schema definition >> language (which will be referred to as just "schema" for the rest of >> this article) is the approach most widely used for these message >> structure definitions. >> >> Most web service applications don't work with XML documents directly, >> instead going through a data binding conversion layer within a web >> service toolkit. This is convenient for application developers, since >> it means they can work directly with data structures in their >> programming language of choice. But the data binding step needs to >> deal with mismatches between schema data types and structures and >> programming language data types and structures, and these mismatches >> can create problems for applications. If you want your web services to >> provide consistent, cross-platform compatibility (which is generally >> the whole point of using web services in the first place), you need to >> design your schema definitions to avoid potential problem areas - or >> at least be aware of the risks involved in using problematic schema >> features. >> >> In this series of articles we're going to look at various types of >> problems that arise from the mismatch between schema and web service >> data bindings. For this first article we'll start at the most basic >> level, looking at simple data types and the problems they create. >> >> >> Representing Numbers >> >> Numeric values are about as basic as you can get when it comes to >> business data. Given the importance of numbers, you might think that >> this would be an area where schema worked smoothly and consistently. >> And in an abstract sense, it really does - but when schema gets >> applied by web services toolkits you can still run into a multitude of >> problems. >> >> Part of the issue is the sheer variety of built-in schema numeric >> datatypes. Figure 1 shows the portions of the schema datatype tree >> involved in this area. To understand it, think in terms of >> specialization - the further you move down one of the branches of the >> upside-down tree, the more specialized the data that is represented by >> a type. At the top layer, directly under the generic anySimpleType, >> are the three basic numeric types float, decimal, and double. float >> and double are terminal types, matching the IEEE standard for floating >> point numbers, and as such provide excellent interoperability across >> web services platforms: Every major programming language supports >> 32-bit floating point numbers matching the float specification and >> 64-bit floating point numbers matching the double schema >> specification, so web services toolkits can just map these directly to >> the native language types. There may be minor differences between the >> programming language text representations of special values >> (not-a-number, positive and negative infinity, and positive and >> negative zero) and those used by schema, but the toolkits can easily >> handle translation. >> >> /*Figure 1. Schema numeric types*/ >> >> It's when you go down the *decimal* branch of the tree that you start >> running into problems. decimal itself is defined as a string of any >> number of *decimal* digits, with an optional leading sign and optional >> decimal point. *integer*, the direct descendant of *decimal*, matches >> a subset of the values corresponding to *decimal* in that it allows >> any number of decimal digits, with an optional leading sign, but does >> not allow a decimal point. The descendants of *integer* further >> restrict the allowed values, in the case of *nonPositiveInteger * and >> *nonNegativeInteger * by prohibiting values respectively greater than >> or less than zero, and in the case of *long* by limiting the range of >> values to a 64-bit 2s-complement equivalent. *int, short*, and *byte* >> further restrict the range, to 32-bit, 16-bit, and 8-bit 2s-complement >> respectively, while the *unsigned* variations match unsigned values of >> the same number of bits. >> >> All major programming languages support values matching the *long, >> int*, and *short* schema types along the main branch of the tree, but >> the other variations create potential problems. Java, for instance, >> doesn't include primitive types corresponding to *unsignedLong* or >> *unsignedInt* . Java web services frameworks generally work around this >> lack of language support by using special classes rather than >> primitives for these types, but this makes the web service interface >> somewhat awkward and can create performance issues (since primitives >> are generally much faster than object types when used in calculations) . >> >> >> >> Even the *decimal* and *integer* types present problems. Most Java >> toolkits handle these using the standard j/ava.lang.BigDecim al and >> java.lang.BigIntege r/ classes, which suffer from poor performance but >> support values of unlimited size. .Net instead uses a fixed-size >> 128-bit representation, which limits the possible value range (as >> allowed by the schema specification) but provides relatively good >> performance. >> >> The schema numeric types are confusing and inconsistent (why a >> *nonPositiveInteger * type, but no *nonPositiveDecimal * type, for >> instance?), and generally just represent syntactic sugar in any case >> (since the ranges can instead be implemented using simpleType >> restriction) . For these reasons it's best to avoid using most of these >> types in your schema definitions, especially those intended for use >> with web services. Use specific sized types (*double* and *float* for >> real numbers, and *long* and *int* for integers) where possible, since >> these translate consistently to programming language primitive types. >> If you need to work with values beyond the range or precision possible >> with these sized types, understand that *decimal* and *integer* will >> not necessarily give you what you want due to implementation >> differences, and instead consider using a string and handling the >> conversion of the value in application code. >> >> >> The Issues of Time >> >> Time-related values are another common source of problems in working >> with schema. Nine separate time-related datatypes are defined by >> schema, all based on a particular version of the Western Gregorian >> calendar. Unlike the numeric types, the time-related types aren't in >> any direct form of specialization relationship - instead, they're all >> considered as derived directly from the generic *anySimpleType* . >> >> The most widely-used time datatypes are *dateTime, date*, and *time*. >> These three datatypes share a common representation format, with >> *dateTime* as the general case. Here's a sample *dateTime* value, for >> the current time as I write this article: "2008-09-08T15: 38:53". A >> *date* value uses the same representation as a *dateTime*, but strips >> off the 'T' and the hour-minute- second values that follow (leaving >> "2008-09-08" , in this case); a *time* value, conversely, strips off >> everything up to and including the 'T', keeping only the >> hour-minute- second values ("15:38:53") . >> >> Seems pretty simple so far, right? Where it gets confusing is in the >> actual interpretation of one of these values. Dates and times vary >> depending on where you're located, with the variation normally >> expressed in terms of time zones. For instance, as I write this >> article in New Zealand I'm 12 hours ahead of Universal time and 19 >> hours ahead of the Pacific Daylight Time currently in effect for the >> West coast of the U.S. At the same instant I wrote my sample >> *dateTime* value here as "2008-09-08T15: 38:53", the time in Seattle >> was "2008-09-07T20: 38:53". >> >> For many applications you need to specify date/times in a manner which >> permits relating one value to another. Schema supports this >> requirement by allowing date/time values to use an appended time zone >> indication. This time zone indication can either take the form of the >> letter 'Z', used to indicate a date/time Universal time (UTC) value, >> or an offset from Universal time in hours and minutes. So any of these >> *dateTime* values (and many more variations) could all be used to >> indicate the same instant: "2008-09-08T15: 38:53+12: 00", >> "2008-09-07T20: 38:53-08: 00", or "2008-09-08T03: 38:53Z". >> >> But schema doesn't /require/ that you specify a time zone indication, >> and without such an indication a date/time value can only be >> interpreted as being accurate for some arbitrary location which could >> be anywhere in the world. For some applications that may be just what >> you want - a person's birth date, for instance, is usually treated as >> a particular date without reference to location, and people likewise >> celebrate the Gregorian New Year as it occurs locally around the world >> - but for other applications it creates major issues. Consider the >> case of a conference call, for instance, where all the parties >> involved need to coordinate the time of the event to their local clocks. >> >> Unfortunately, schema does not allow you to distinguish between the >> cases where a fully-specified date/time is needed and those where a >> zoneless value is allowed or even expected (at least not in a way >> which web services toolkit can interpret - you could do this by using >> *simpleType* restriction patterns, but patterns are generally ignored >> by the toolkits). So the ambiguity of schema on this point means that >> toolkits need to handle values both with and without time zone >> indications. >> >> The need to handle both types of values creates some major headaches >> in terms of interpretation, especially since programming languages >> generally implement date/time handling based on absolute time values. >> There's just no way to correctly convert a schema value which is >> missing a time zone indication to an absolute time. Of course, that >> doesn't stop toolkits from doing something with such values, anyway. >> In most cases they convert the value as supplied by assuming it's >> given in terms of the local time zone, and that's often what you want >> - but when it's not, the resulting problems can be very difficult to >> isolate. >> >> Problems due to time zones are especially messy for the *date* type. >> Most often, people treat dates as a fixed slot on the calendar. When >> you sign a legal document, for instance, you'll generally fill in the >> date of your signature. If you agree to a new project, there'll >> usually be a scheduled completion date (fanciful as these scheduled >> dates may sometimes be). And if you're asked to show your driver's >> license for proof of age when making a purchase, the clerk will look >> at your birth date and compare it with an age cutoff. In all these >> cases the date is treated as having day resolution, and differences >> between timezones are normally ignored. But the schema date type uses >> an associated time zone indication, just like the dateTime and time >> types. This use of a time zone indication creates a disconnect between >> the schema date type and the common form of a date. Generally this >> gets handled by converting dates to the 00:00 (midnight, as the start >> of the day) time representing the start of that day in whatever >> timezone was specified. But if you then print out that date value >> using the local timezone, you may find it's different from what was >> originally specified in the document. >> >> If schema defined separate types for date/time values with time zone >> specifications and those without it'd be easy for applications to pick >> which type they wanted to use. Without this ability, it's difficult >> for toolkits to work around a basically flawed representation of >> date/time values in schema. Java's JAXB 2.0 takes what is probably the >> most comprehensive approach to the problem, handling all the schema >> date/time types with a special class >> (|javax.xml. datatype. XmlGregorianCale ndar|) which corresponds directly >> to schema representations. This approach preserves all the nuances of >> schema representations of values, but at the cost of passing the >> interpretation issues on to developers. Other toolkits generally just >> use defaults, such as assuming the local timezone. >> >> Given the nasty issues lurking in this area, the best general approach >> is probably to only use the schema date/time types for values which >> should be fully-specified with time zone indications, and to make sure >> that any documents you generate do include time zone indications. Most >> web services toolkits will generate the time zone indications for you >> on output automatically, so this last part is easy. Requiring that >> your input documents also use time zone indications can be more >> difficult, especially since documents may be going through several >> stages of processing. If you want to be certain you don't run into >> problems caused by mistaken conversion assumptions your best solution >> is probably to use a string type in the schema representation, so that >> your web service toolkit will pass the value on to your application >> code without trying to interpret the value. >> >> If you need zoneless date/time values (as for the birth date example), >> your best approach may again be to use a *string* type in the schema >> representation. That's not very satisfying from the standpoint of >> providing an accurate representation of the data in the schema, but >> avoids the issues with web services toolkits interpreting unzoned >> values as being in the local timezone. >> >> >> References >> >> Data structures used internally by applications often contain multiple >> linkages between components, including cross-references and indirect >> associations. XML, on the other hand, is inherently tree-structured. >> It's very easy to represent one-to-many relationships in XML through >> containment, but any other type of relationship is problematic. Even >> one-to-many relationships can be inefficient. Consider the case of a >> document listing a customer's order history, for instance. Each order >> will have associated billing and shipping addresses, but these >> addresses are often going to be repeated from one order to the next. >> If you just embed the addresses inside the information for each order, >> you'll end up with a lot of redundant information in your documents. >> >> References can be used to get around the limitations of XML's tree >> structure. The idea of a reference is that you define something once >> in an XML document, including a unique identifier. Any time other data >> needs make use of that definition, you create a reference using the >> unique identifier. >> >> Schema directly supports two forms of references. The first, using the >> ID type, defines element identifiers which can be linked from anywhere >> in the document by using the IDREF or IDREFS types. The nice part of >> ID/IDREF links is that they're simple - identifiers are just names, >> and any type of element can define an ID value in the schema. The >> downside of ID/IDREF links is that they use a global context, so >> there's no way to say that the value used for a particular IDREF must >> be defined on a particular element type, and the names used as ID >> values must be unique within a document (even across types of >> elements). Some web service toolkits support using ID/IDREF links to >> represent references within data structures (including JAX-WS/JAXB >> 2.0, and Apache Axis2 when used with JiBX data binding); other >> toolkits (such as .Net, and Axis2 used with ADB) do not, instead >> treating IDREF values as simple text strings. >> >> The second type of references support by schema are key/keyref links. >> While ID/IDREF links are defined using datatypes, key/keyref links are >> instead part of the structure of a schema definition. This difference >> allows key/keyref links to be much more expressive than ID/IDREF >> links, including defining contexts within which key values are unique. >> But because key/keyref links are designed more for purposes of >> document validation than for structuring, they are complex and not >> generally used by data binding frameworks which convert XML data to >> and from data structures. >> >> So if you want to embed linkages within your XML documents and have >> them handled by web services toolkits, your only hope is the ID/IDREF >> approach. Some toolkits will support these links directly; others will >> just treat the identifier values as strings, but you can write >> application code to cross-reference the identifier and reference >> values and build your own links. >> >> >> Conclusion >> >> In this article we've looked at some of the problems that arise when >> using the most common schema datatypes in web services. There are many >> other specialized schema datatypes beyond those mentioned in this >> article (a total of 42!), and some of these present other issues. As a >> general principle, the best approach to take in your web service >> schema definitions is to avoid the use of overly-specialized types >> (except for the numeric types that match common programming language >> types), and use a string type when you want full control over the >> interpretation of values. >> >> It's worth pointing out that although some of the issues discussed in >> this article could be handled better by data binding frameworks, a lot >> of the problems lie with schema itself. In particular, the data/time >> family of types are at best cumbersome to work with and at worst >> invite errors through the lack of distinction between zoned and >> unzoned value types. It's possible to pass the confusion on to the >> user, as JAXB does with the XmlGregorianCalenda r type, but that's not >> really a solution.>> >> >> >> > > > > ------------------------------------ Yahoo! Groups Links <*> To visit your group on the web, go to: http://groups.yahoo.com/group/service-orientated-architecture/ <*> Your email settings: Individual Email | Traditional <*> To change settings online go to: http://groups.yahoo.com/group/service-orientated-architecture/join (Yahoo! ID required) <*> To change settings via email: mailto:[email protected] mailto:[email protected] <*> To unsubscribe from this group, send an email to: [email protected] <*> Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/
