Re: [service-orientated-architecture] Dennis on Schema for Web Services – Par t I: Basic Datatypes

Dennis Sosnoski Mon, 19 Jan 2009 14:20:18 -0800

Keeping the document as XML is a natural approach for people coming from 
an XML-centric background, Anne. I think it's a great approach where 
you're only using selected portions of the document data, and as I 
mentioned in my response to Michael XMLBeans is very good for this type 
of processing. Keeping the document as XML also works very well when a 
processing step is just transforming documents, so that XSLT can be 
used. But if you're using essentially all the document data in your 
processing, this approach is going to be both very slow and very awkward 
for developers who are coming at the problem from a programming language 
background.


Incidentally, I disagree with the idea that there's an XML/object 
impedance mismatch. Unlike RDBs, XML has a simple structure which is 
very easy to translate into equivalent objects. There's definitely an 
XSD/programming language impedance mismatch, though, which is the point 
of the InfoQ series.

  - Dennis


Anne Thomas Manes wrote:
> Great article, Dennis. I agree with your recommendation that schema
> designers should try to avoid using XSD data types that don't match up
> well with Java and .NET datatypes (and vice versa). But I have another
> recommendation: if the situation allows it (i.e., if the complexity of
> the process doesn't require complex object graphs and algorithms) just
> don't use a compiled OO programming language and avoid the whole
> XML/object impedance mismatch. Use XQuery or a scripting language
> instead.
>
> I know. It's a radical idea.
>
> Anne
>
> On 1/19/09, Michael Poulin <[email protected]> wrote:
>   
>> Dennis asks a few questions but I let me answer only some of the now.
>>
>> In my experience, XMLBeans outperformed JAXB for several time at the
>> beginning and still have beeter performances (now, depending on the schema
>> complexity). I use validation all the time - this is the purpose of having
>> Schema as a controller of data quality. XMLBeans do not allow generate
>> invalid XML because it keeps XML in parallel with Java code validating any
>> and every data modification withing Java. Yes, this takes additional time
>> but you know about this up-front and can mitigate this problem by right
>> design.
>>
>> I have a couple of cases where XMLBeans saved my ... when caught wrong data
>> transformation in the code and in the Web Service messages. I think, Dennis
>> and me just use the same things for different purposes.
>>
>> I look at XML Schema (and XMLBeans/JAXB) much wider than for data binding.
>> In particular, if you have a Java String and VARCHAR in database for the
>> same data (string), how do you control the lengh of that Java String? So, if
>> I can use XML Schema mechanism as a universal data quality controller - for
>> Web communication and databases - I put its importance and priority above
>> the data in the code which and require making this code usable from the
>> application standpoint, by design. The worst example I know about binding is
>> open-source Didgester that allows converting XML into Java with no controls
>> whatsoever; this results in garbage in Java data and code.
>>
>> XML Schema in my use is the most important element of data formats in Java
>> code (exception, obviously, is a legacy code); it goes through the most
>> tough verifications and reviews itself. That is, "just go ahead and use
>> whatever schema type they fancy" has never happened to me. Because of
>> mentioned role of Schema, I always consider it in parallel with application
>> code, not one after another, i.e. I am trying to keep balance between
>> simplicity of the code, quality of data and performance-oriented
>> (compensating) design.
>>
>> - Michael
>>
>>
>>
>>
>> ________________________________
>> From: Dennis Sosnoski <[email protected]>
>> To: [email protected]
>> Sent: Monday, January 19, 2009 5:16:39 AM
>> Subject: Re: [service-orientated-architecture] Dennis on Schema for Web
>> Services – Part I: Basic Datatypes
>>
>>
>> Michael Poulin wrote:
>>     
>>> Dennis wrote: "But the data binding step needs to deal with mismatches
>>> between schema data types and structures and programming language data
>>> types and structures, and these mismatches can create problems for
>>> applications. .." and he talks about JAXB 2.0 but does not even mention
>>> XMLBeans, strange.
>>>       
>> XMLBeans is not really a data binding tool, instead implementing a data
>> binding facade over an XML store. This does mean that XMLBeans offers
>> considerably lower performance than normal data binding tools, and
>> programmers generally find the XMLBeans generated APIs more difficult to
>> use. But the issues discussed in this first article all apply equally
>> well to XMLBeans as to JAXB 2.0 - XMLBeans just uses its own GDate class
>> rather than an XMLGregorianCalenda r, and also has it's own variants for
>> other schema types (which in many cases just add another layer of
>> wrapper around existing Java types).
>>
>>     
>>> My approach to the problem mentioned above is right opposite - I
>>> believe that XML Schema, as the data quality controlling mechanism,has
>>> to dictate the format of the data in the programming language. I do
>>> not know if C#allows such control but Java certainly does. I use it
>>> since the first announcement of XMLBeans by BEA, when JAXB walked 'in
>>> short bridges under the table'.
>>>       
>> That's certainly a valid approach, but the result is that you end up
>> with code which is not necessarily very usable from the application
>> standpoint.
>>
>>     
>>> I still think that XMLBeans are better than JAXB, at least, due to
>>> full support of the XML Schema, but I did not check it out recently.
>>>       
>> I could make points in favor of each. Performance and ease of use aside,
>> I think XMLBeans is far too lenient about letting you generate invalid
>> XML - if you don't set a required value, XMLBeans happily spits out XML
>> without that element or attribute present. Of course you can turn on
>> validation so that it checks the output, but that has a very substantial
>> impact on performance. I believe frameworks which automatically report
>> an error when you try to marshal XML with missing required components
>> are safer.
>>
>>     
>>> So, instead of screwing XML Schema to satisfy clumsy Java code, I do
>>> enforce quality of transmitted data onto the receiver and allow it to
>>> deal with its own data quality problems. I do not mean such
>>> irresponsible constructs in the Schema like 'any'; it does not control
>>> data quality. However, with a few exceptions, I use XML Schema to
>>> generate Java objects and to use the latter on the sender and receiver
>>> sides of the Web Service communication (actually, I use only
>>> document/literal style).
>>>       
>> So you'd tell people to just go ahead and use whatever schema type they
>> fancy (e.g., nonPositiveInteger) without regard to the usefulness of the
>> type or the impact this has on generated code? Seems silly to me if you
>> know in advance that the values can be handled with a
>> double/float/ int/long representation, but to each their own.
>>
>> Out of curiosity, how do you deal with the lack of any distinction
>> between completely- vs. incompletely- specified schema date/time values
>> mentioned in the article?
>>
>> - Dennis
>>
>>     
>>> - Michael
>>>
>>> ------------ --------- --------- --------- --------- --------- -
>>> *From:* Gervas Douglas <gervas.douglas@ gmail.com>
>>> *To:* service-orientated- architecture@ yahoogroups. com
>>> *Sent:* Sunday, January 18, 2009 9:01:54 PM
>>> *Subject:* [service-orientated -architecture] Dennis on Schema for Web
>>> Services – Part I: Basic Datatypes
>>>
>>> *You can view the following article at:
>>>
>>> http://www.infoq. com/articles/ schema-for- ws-part1; jsessionid=
>>> A4FA64435750D836 AA32113976421FFE
>>>
>>> Gervas*
>>>
>>> <<XML message exchange is the basis of most varieties of web services,
>>> including both SOAP and REST approaches. The use of XML creates some
>>> drawbacks, including potential issues with performance, but it also
>>> provides a level of abstraction which allows for loose coupling
>>> between the parties involved in an exchange. In order for that loose
>>> coupling to really work, though, you need to be able to define the
>>> structure of XML documents being exchanged in a way which allows
>>> verification of correct documents. The W3C's XML Schema definition
>>> language (which will be referred to as just "schema" for the rest of
>>> this article) is the approach most widely used for these message
>>> structure definitions.
>>>
>>> Most web service applications don't work with XML documents directly,
>>> instead going through a data binding conversion layer within a web
>>> service toolkit. This is convenient for application developers, since
>>> it means they can work directly with data structures in their
>>> programming language of choice. But the data binding step needs to
>>> deal with mismatches between schema data types and structures and
>>> programming language data types and structures, and these mismatches
>>> can create problems for applications. If you want your web services to
>>> provide consistent, cross-platform compatibility (which is generally
>>> the whole point of using web services in the first place), you need to
>>> design your schema definitions to avoid potential problem areas - or
>>> at least be aware of the risks involved in using problematic schema
>>> features.
>>>
>>> In this series of articles we're going to look at various types of
>>> problems that arise from the mismatch between schema and web service
>>> data bindings. For this first article we'll start at the most basic
>>> level, looking at simple data types and the problems they create.
>>>
>>>
>>>     Representing Numbers
>>>
>>> Numeric values are about as basic as you can get when it comes to
>>> business data. Given the importance of numbers, you might think that
>>> this would be an area where schema worked smoothly and consistently.
>>> And in an abstract sense, it really does - but when schema gets
>>> applied by web services toolkits you can still run into a multitude of
>>> problems.
>>>
>>> Part of the issue is the sheer variety of built-in schema numeric
>>> datatypes. Figure 1 shows the portions of the schema datatype tree
>>> involved in this area. To understand it, think in terms of
>>> specialization - the further you move down one of the branches of the
>>> upside-down tree, the more specialized the data that is represented by
>>> a type. At the top layer, directly under the generic anySimpleType,
>>> are the three basic numeric types float, decimal, and double. float
>>> and double are terminal types, matching the IEEE standard for floating
>>> point numbers, and as such provide excellent interoperability across
>>> web services platforms: Every major programming language supports
>>> 32-bit floating point numbers matching the float specification and
>>> 64-bit floating point numbers matching the double schema
>>> specification, so web services toolkits can just map these directly to
>>> the native language types. There may be minor differences between the
>>> programming language text representations of special values
>>> (not-a-number, positive and negative infinity, and positive and
>>> negative zero) and those used by schema, but the toolkits can easily
>>> handle translation.
>>>
>>> /*Figure 1. Schema numeric types*/
>>>
>>> It's when you go down the *decimal* branch of the tree that you start
>>> running into problems. decimal itself is defined as a string of any
>>> number of *decimal* digits, with an optional leading sign and optional
>>> decimal point. *integer*, the direct descendant of *decimal*, matches
>>> a subset of the values corresponding to *decimal* in that it allows
>>> any number of decimal digits, with an optional leading sign, but does
>>> not allow a decimal point. The descendants of *integer* further
>>> restrict the allowed values, in the case of *nonPositiveInteger * and
>>> *nonNegativeInteger * by prohibiting values respectively greater than
>>> or less than zero, and in the case of *long* by limiting the range of
>>> values to a 64-bit 2s-complement equivalent. *int, short*, and *byte*
>>> further restrict the range, to 32-bit, 16-bit, and 8-bit 2s-complement
>>> respectively, while the *unsigned* variations match unsigned values of
>>> the same number of bits.
>>>
>>> All major programming languages support values matching the *long,
>>> int*, and *short* schema types along the main branch of the tree, but
>>> the other variations create potential problems. Java, for instance,
>>> doesn't include primitive types corresponding to *unsignedLong* or
>>> *unsignedInt* . Java web services frameworks generally work around this
>>> lack of language support by using special classes rather than
>>> primitives for these types, but this makes the web service interface
>>> somewhat awkward and can create performance issues (since primitives
>>> are generally much faster than object types when used in calculations) .
>>>
>>>
>>>
>>> Even the *decimal* and *integer* types present problems. Most Java
>>> toolkits handle these using the standard j/ava.lang.BigDecim al and
>>> java.lang.BigIntege r/ classes, which suffer from poor performance but
>>> support values of unlimited size. .Net instead uses a fixed-size
>>> 128-bit representation, which limits the possible value range (as
>>> allowed by the schema specification) but provides relatively good
>>> performance.
>>>
>>> The schema numeric types are confusing and inconsistent (why a
>>> *nonPositiveInteger * type, but no *nonPositiveDecimal * type, for
>>> instance?), and generally just represent syntactic sugar in any case
>>> (since the ranges can instead be implemented using simpleType
>>> restriction) . For these reasons it's best to avoid using most of these
>>> types in your schema definitions, especially those intended for use
>>> with web services. Use specific sized types (*double* and *float* for
>>> real numbers, and *long* and *int* for integers) where possible, since
>>> these translate consistently to programming language primitive types.
>>> If you need to work with values beyond the range or precision possible
>>> with these sized types, understand that *decimal* and *integer* will
>>> not necessarily give you what you want due to implementation
>>> differences, and instead consider using a string and handling the
>>> conversion of the value in application code.
>>>
>>>
>>>       The Issues of Time
>>>
>>> Time-related values are another common source of problems in working
>>> with schema. Nine separate time-related datatypes are defined by
>>> schema, all based on a particular version of the Western Gregorian
>>> calendar. Unlike the numeric types, the time-related types aren't in
>>> any direct form of specialization relationship - instead, they're all
>>> considered as derived directly from the generic *anySimpleType* .
>>>
>>> The most widely-used time datatypes are *dateTime, date*, and *time*.
>>> These three datatypes share a common representation format, with
>>> *dateTime* as the general case. Here's a sample *dateTime* value, for
>>> the current time as I write this article: "2008-09-08T15: 38:53". A
>>> *date* value uses the same representation as a *dateTime*, but strips
>>> off the 'T' and the hour-minute- second values that follow (leaving
>>> "2008-09-08" , in this case); a *time* value, conversely, strips off
>>> everything up to and including the 'T', keeping only the
>>> hour-minute- second values ("15:38:53") .
>>>
>>> Seems pretty simple so far, right? Where it gets confusing is in the
>>> actual interpretation of one of these values. Dates and times vary
>>> depending on where you're located, with the variation normally
>>> expressed in terms of time zones. For instance, as I write this
>>> article in New Zealand I'm 12 hours ahead of Universal time and 19
>>> hours ahead of the Pacific Daylight Time currently in effect for the
>>> West coast of the U.S. At the same instant I wrote my sample
>>> *dateTime* value here as "2008-09-08T15: 38:53", the time in Seattle
>>> was "2008-09-07T20: 38:53".
>>>
>>> For many applications you need to specify date/times in a manner which
>>> permits relating one value to another. Schema supports this
>>> requirement by allowing date/time values to use an appended time zone
>>> indication. This time zone indication can either take the form of the
>>> letter 'Z', used to indicate a date/time Universal time (UTC) value,
>>> or an offset from Universal time in hours and minutes. So any of these
>>> *dateTime* values (and many more variations) could all be used to
>>> indicate the same instant: "2008-09-08T15: 38:53+12: 00",
>>> "2008-09-07T20: 38:53-08: 00", or "2008-09-08T03: 38:53Z".
>>>
>>> But schema doesn't /require/ that you specify a time zone indication,
>>> and without such an indication a date/time value can only be
>>> interpreted as being accurate for some arbitrary location which could
>>> be anywhere in the world. For some applications that may be just what
>>> you want - a person's birth date, for instance, is usually treated as
>>> a particular date without reference to location, and people likewise
>>> celebrate the Gregorian New Year as it occurs locally around the world
>>> - but for other applications it creates major issues. Consider the
>>> case of a conference call, for instance, where all the parties
>>> involved need to coordinate the time of the event to their local clocks.
>>>
>>> Unfortunately, schema does not allow you to distinguish between the
>>> cases where a fully-specified date/time is needed and those where a
>>> zoneless value is allowed or even expected (at least not in a way
>>> which web services toolkit can interpret - you could do this by using
>>> *simpleType* restriction patterns, but patterns are generally ignored
>>> by the toolkits). So the ambiguity of schema on this point means that
>>> toolkits need to handle values both with and without time zone
>>> indications.
>>>
>>> The need to handle both types of values creates some major headaches
>>> in terms of interpretation, especially since programming languages
>>> generally implement date/time handling based on absolute time values.
>>> There's just no way to correctly convert a schema value which is
>>> missing a time zone indication to an absolute time. Of course, that
>>> doesn't stop toolkits from doing something with such values, anyway.
>>> In most cases they convert the value as supplied by assuming it's
>>> given in terms of the local time zone, and that's often what you want
>>> - but when it's not, the resulting problems can be very difficult to
>>> isolate.
>>>
>>> Problems due to time zones are especially messy for the *date* type.
>>> Most often, people treat dates as a fixed slot on the calendar. When
>>> you sign a legal document, for instance, you'll generally fill in the
>>> date of your signature. If you agree to a new project, there'll
>>> usually be a scheduled completion date (fanciful as these scheduled
>>> dates may sometimes be). And if you're asked to show your driver's
>>> license for proof of age when making a purchase, the clerk will look
>>> at your birth date and compare it with an age cutoff. In all these
>>> cases the date is treated as having day resolution, and differences
>>> between timezones are normally ignored. But the schema date type uses
>>> an associated time zone indication, just like the dateTime and time
>>> types. This use of a time zone indication creates a disconnect between
>>> the schema date type and the common form of a date. Generally this
>>> gets handled by converting dates to the 00:00 (midnight, as the start
>>> of the day) time representing the start of that day in whatever
>>> timezone was specified. But if you then print out that date value
>>> using the local timezone, you may find it's different from what was
>>> originally specified in the document.
>>>
>>> If schema defined separate types for date/time values with time zone
>>> specifications and those without it'd be easy for applications to pick
>>> which type they wanted to use. Without this ability, it's difficult
>>> for toolkits to work around a basically flawed representation of
>>> date/time values in schema. Java's JAXB 2.0 takes what is probably the
>>> most comprehensive approach to the problem, handling all the schema
>>> date/time types with a special class
>>> (|javax.xml. datatype. XmlGregorianCale ndar|) which corresponds directly
>>> to schema representations. This approach preserves all the nuances of
>>> schema representations of values, but at the cost of passing the
>>> interpretation issues on to developers. Other toolkits generally just
>>> use defaults, such as assuming the local timezone.
>>>
>>> Given the nasty issues lurking in this area, the best general approach
>>> is probably to only use the schema date/time types for values which
>>> should be fully-specified with time zone indications, and to make sure
>>> that any documents you generate do include time zone indications. Most
>>> web services toolkits will generate the time zone indications for you
>>> on output automatically, so this last part is easy. Requiring that
>>> your input documents also use time zone indications can be more
>>> difficult, especially since documents may be going through several
>>> stages of processing. If you want to be certain you don't run into
>>> problems caused by mistaken conversion assumptions your best solution
>>> is probably to use a string type in the schema representation, so that
>>> your web service toolkit will pass the value on to your application
>>> code without trying to interpret the value.
>>>
>>> If you need zoneless date/time values (as for the birth date example),
>>> your best approach may again be to use a *string* type in the schema
>>> representation. That's not very satisfying from the standpoint of
>>> providing an accurate representation of the data in the schema, but
>>> avoids the issues with web services toolkits interpreting unzoned
>>> values as being in the local timezone.
>>>
>>>
>>>     References
>>>
>>> Data structures used internally by applications often contain multiple
>>> linkages between components, including cross-references and indirect
>>> associations. XML, on the other hand, is inherently tree-structured.
>>> It's very easy to represent one-to-many relationships in XML through
>>> containment, but any other type of relationship is problematic. Even
>>> one-to-many relationships can be inefficient. Consider the case of a
>>> document listing a customer's order history, for instance. Each order
>>> will have associated billing and shipping addresses, but these
>>> addresses are often going to be repeated from one order to the next.
>>> If you just embed the addresses inside the information for each order,
>>> you'll end up with a lot of redundant information in your documents.
>>>
>>> References can be used to get around the limitations of XML's tree
>>> structure. The idea of a reference is that you define something once
>>> in an XML document, including a unique identifier. Any time other data
>>> needs make use of that definition, you create a reference using the
>>> unique identifier.
>>>
>>> Schema directly supports two forms of references. The first, using the
>>> ID type, defines element identifiers which can be linked from anywhere
>>> in the document by using the IDREF or IDREFS types. The nice part of
>>> ID/IDREF links is that they're simple - identifiers are just names,
>>> and any type of element can define an ID value in the schema. The
>>> downside of ID/IDREF links is that they use a global context, so
>>> there's no way to say that the value used for a particular IDREF must
>>> be defined on a particular element type, and the names used as ID
>>> values must be unique within a document (even across types of
>>> elements). Some web service toolkits support using ID/IDREF links to
>>> represent references within data structures (including JAX-WS/JAXB
>>> 2.0, and Apache Axis2 when used with JiBX data binding); other
>>> toolkits (such as .Net, and Axis2 used with ADB) do not, instead
>>> treating IDREF values as simple text strings.
>>>
>>> The second type of references support by schema are key/keyref links.
>>> While ID/IDREF links are defined using datatypes, key/keyref links are
>>> instead part of the structure of a schema definition. This difference
>>> allows key/keyref links to be much more expressive than ID/IDREF
>>> links, including defining contexts within which key values are unique.
>>> But because key/keyref links are designed more for purposes of
>>> document validation than for structuring, they are complex and not
>>> generally used by data binding frameworks which convert XML data to
>>> and from data structures.
>>>
>>> So if you want to embed linkages within your XML documents and have
>>> them handled by web services toolkits, your only hope is the ID/IDREF
>>> approach. Some toolkits will support these links directly; others will
>>> just treat the identifier values as strings, but you can write
>>> application code to cross-reference the identifier and reference
>>> values and build your own links.
>>>
>>>
>>>     Conclusion
>>>
>>> In this article we've looked at some of the problems that arise when
>>> using the most common schema datatypes in web services. There are many
>>> other specialized schema datatypes beyond those mentioned in this
>>> article (a total of 42!), and some of these present other issues. As a
>>> general principle, the best approach to take in your web service
>>> schema definitions is to avoid the use of overly-specialized types
>>> (except for the numeric types that match common programming language
>>> types), and use a string type when you want full control over the
>>> interpretation of values.
>>>
>>> It's worth pointing out that although some of the issues discussed in
>>> this article could be handled better by data binding frameworks, a lot
>>> of the problems lie with schema itself. In particular, the data/time
>>> family of types are at best cumbersome to work with and at worst
>>> invite errors through the lack of distinction between zoned and
>>> unzoned value types. It's possible to pass the confusion on to the
>>> user, as JAXB does with the XmlGregorianCalenda r type, but that's not
>>> really a solution.>>
>>>
>>>
>>>
>>>       
>>
>>
>>     
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>
>
>   

------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/service-orientated-architecture/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/service-orientated-architecture/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:[email protected] 
    mailto:[email protected]

<*> To unsubscribe from this group, send an email to:
    [email protected]

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/

Re: [service-orientated-architecture] Dennis on Schema for Web Services – Par t I: Basic Datatypes

Reply via email to