Re: [service-orientated-architecture] Dennis on Sche ma for Web Services – Part I: Basic Datatypes

Anne Thomas Manes Mon, 19 Jan 2009 08:26:33 -0800

Great article, Dennis. I agree with your recommendation that schema
designers should try to avoid using XSD data types that don't match up
well with Java and .NET datatypes (and vice versa). But I have another
recommendation: if the situation allows it (i.e., if the complexity of
the process doesn't require complex object graphs and algorithms) just
don't use a compiled OO programming language and avoid the whole
XML/object impedance mismatch. Use XQuery or a scripting language
instead.


I know. It's a radical idea.

Anne

On 1/19/09, Michael Poulin <[email protected]> wrote:
> Dennis asks a few questions but I let me answer only some of the now.
>
> In my experience, XMLBeans outperformed JAXB for several time at the
> beginning and still have beeter performances (now, depending on the schema
> complexity). I use validation all the time - this is the purpose of having
> Schema as a controller of data quality. XMLBeans do not allow generate
> invalid XML because it keeps XML in parallel with Java code validating any
> and every data modification withing Java. Yes, this takes additional time
> but you know about this up-front and can mitigate this problem by right
> design.
>
> I have a couple of cases where XMLBeans saved my ... when caught wrong data
> transformation in the code and in the Web Service messages. I think, Dennis
> and me just use the same things for different purposes.
>
> I look at XML Schema (and XMLBeans/JAXB) much wider than for data binding.
> In particular, if you have a Java String and VARCHAR in database for the
> same data (string), how do you control the lengh of that Java String? So, if
> I can use XML Schema mechanism as a universal data quality controller - for
> Web communication and databases - I put its importance and priority above
> the data in the code which and require making this code usable from the
> application standpoint, by design. The worst example I know about binding is
> open-source Didgester that allows converting XML into Java with no controls
> whatsoever; this results in garbage in Java data and code.
>
> XML Schema in my use is the most important element of data formats in Java
> code (exception, obviously, is a legacy code); it goes through the most
> tough verifications and reviews itself. That is, "just go ahead and use
> whatever schema type they fancy" has never happened to me. Because of
> mentioned role of Schema, I always consider it in parallel with application
> code, not one after another, i.e. I am trying to keep balance between
> simplicity of the code, quality of data and performance-oriented
> (compensating) design.
>
> - Michael
>
>
>
>
> ________________________________
> From: Dennis Sosnoski <[email protected]>
> To: [email protected]
> Sent: Monday, January 19, 2009 5:16:39 AM
> Subject: Re: [service-orientated-architecture] Dennis on Schema for Web
> Services – Part I: Basic Datatypes
>
>
> Michael Poulin wrote:
>> Dennis wrote: "But the data binding step needs to deal with mismatches
>> between schema data types and structures and programming language data
>> types and structures, and these mismatches can create problems for
>> applications. .." and he talks about JAXB 2.0 but does not even mention
>> XMLBeans, strange.
>
> XMLBeans is not really a data binding tool, instead implementing a data
> binding facade over an XML store. This does mean that XMLBeans offers
> considerably lower performance than normal data binding tools, and
> programmers generally find the XMLBeans generated APIs more difficult to
> use. But the issues discussed in this first article all apply equally
> well to XMLBeans as to JAXB 2.0 - XMLBeans just uses its own GDate class
> rather than an XMLGregorianCalenda r, and also has it's own variants for
> other schema types (which in many cases just add another layer of
> wrapper around existing Java types).
>
>>
>> My approach to the problem mentioned above is right opposite - I
>> believe that XML Schema, as the data quality controlling mechanism,has
>> to dictate the format of the data in the programming language. I do
>> not know if C#allows such control but Java certainly does. I use it
>> since the first announcement of XMLBeans by BEA, when JAXB walked 'in
>> short bridges under the table'.
>
> That's certainly a valid approach, but the result is that you end up
> with code which is not necessarily very usable from the application
> standpoint.
>
>>
>> I still think that XMLBeans are better than JAXB, at least, due to
>> full support of the XML Schema, but I did not check it out recently.
>
> I could make points in favor of each. Performance and ease of use aside,
> I think XMLBeans is far too lenient about letting you generate invalid
> XML - if you don't set a required value, XMLBeans happily spits out XML
> without that element or attribute present. Of course you can turn on
> validation so that it checks the output, but that has a very substantial
> impact on performance. I believe frameworks which automatically report
> an error when you try to marshal XML with missing required components
> are safer.
>
>>
>> So, instead of screwing XML Schema to satisfy clumsy Java code, I do
>> enforce quality of transmitted data onto the receiver and allow it to
>> deal with its own data quality problems. I do not mean such
>> irresponsible constructs in the Schema like 'any'; it does not control
>> data quality. However, with a few exceptions, I use XML Schema to
>> generate Java objects and to use the latter on the sender and receiver
>> sides of the Web Service communication (actually, I use only
>> document/literal style).
>
> So you'd tell people to just go ahead and use whatever schema type they
> fancy (e.g., nonPositiveInteger) without regard to the usefulness of the
> type or the impact this has on generated code? Seems silly to me if you
> know in advance that the values can be handled with a
> double/float/ int/long representation, but to each their own.
>
> Out of curiosity, how do you deal with the lack of any distinction
> between completely- vs. incompletely- specified schema date/time values
> mentioned in the article?
>
> - Dennis
>
>>
>> - Michael
>>
>> ------------ --------- --------- --------- --------- --------- -
>> *From:* Gervas Douglas <gervas.douglas@ gmail.com>
>> *To:* service-orientated- architecture@ yahoogroups. com
>> *Sent:* Sunday, January 18, 2009 9:01:54 PM
>> *Subject:* [service-orientated -architecture] Dennis on Schema for Web
>> Services – Part I: Basic Datatypes
>>
>> *You can view the following article at:
>>
>> http://www.infoq. com/articles/ schema-for- ws-part1; jsessionid=
>> A4FA64435750D836 AA32113976421FFE
>>
>> Gervas*
>>
>> <<XML message exchange is the basis of most varieties of web services,
>> including both SOAP and REST approaches. The use of XML creates some
>> drawbacks, including potential issues with performance, but it also
>> provides a level of abstraction which allows for loose coupling
>> between the parties involved in an exchange. In order for that loose
>> coupling to really work, though, you need to be able to define the
>> structure of XML documents being exchanged in a way which allows
>> verification of correct documents. The W3C's XML Schema definition
>> language (which will be referred to as just "schema" for the rest of
>> this article) is the approach most widely used for these message
>> structure definitions.
>>
>> Most web service applications don't work with XML documents directly,
>> instead going through a data binding conversion layer within a web
>> service toolkit. This is convenient for application developers, since
>> it means they can work directly with data structures in their
>> programming language of choice. But the data binding step needs to
>> deal with mismatches between schema data types and structures and
>> programming language data types and structures, and these mismatches
>> can create problems for applications. If you want your web services to
>> provide consistent, cross-platform compatibility (which is generally
>> the whole point of using web services in the first place), you need to
>> design your schema definitions to avoid potential problem areas - or
>> at least be aware of the risks involved in using problematic schema
>> features.
>>
>> In this series of articles we're going to look at various types of
>> problems that arise from the mismatch between schema and web service
>> data bindings. For this first article we'll start at the most basic
>> level, looking at simple data types and the problems they create.
>>
>>
>>     Representing Numbers
>>
>> Numeric values are about as basic as you can get when it comes to
>> business data. Given the importance of numbers, you might think that
>> this would be an area where schema worked smoothly and consistently.
>> And in an abstract sense, it really does - but when schema gets
>> applied by web services toolkits you can still run into a multitude of
>> problems.
>>
>> Part of the issue is the sheer variety of built-in schema numeric
>> datatypes. Figure 1 shows the portions of the schema datatype tree
>> involved in this area. To understand it, think in terms of
>> specialization - the further you move down one of the branches of the
>> upside-down tree, the more specialized the data that is represented by
>> a type. At the top layer, directly under the generic anySimpleType,
>> are the three basic numeric types float, decimal, and double. float
>> and double are terminal types, matching the IEEE standard for floating
>> point numbers, and as such provide excellent interoperability across
>> web services platforms: Every major programming language supports
>> 32-bit floating point numbers matching the float specification and
>> 64-bit floating point numbers matching the double schema
>> specification, so web services toolkits can just map these directly to
>> the native language types. There may be minor differences between the
>> programming language text representations of special values
>> (not-a-number, positive and negative infinity, and positive and
>> negative zero) and those used by schema, but the toolkits can easily
>> handle translation.
>>
>> /*Figure 1. Schema numeric types*/
>>
>> It's when you go down the *decimal* branch of the tree that you start
>> running into problems. decimal itself is defined as a string of any
>> number of *decimal* digits, with an optional leading sign and optional
>> decimal point. *integer*, the direct descendant of *decimal*, matches
>> a subset of the values corresponding to *decimal* in that it allows
>> any number of decimal digits, with an optional leading sign, but does
>> not allow a decimal point. The descendants of *integer* further
>> restrict the allowed values, in the case of *nonPositiveInteger * and
>> *nonNegativeInteger * by prohibiting values respectively greater than
>> or less than zero, and in the case of *long* by limiting the range of
>> values to a 64-bit 2s-complement equivalent. *int, short*, and *byte*
>> further restrict the range, to 32-bit, 16-bit, and 8-bit 2s-complement
>> respectively, while the *unsigned* variations match unsigned values of
>> the same number of bits.
>>
>> All major programming languages support values matching the *long,
>> int*, and *short* schema types along the main branch of the tree, but
>> the other variations create potential problems. Java, for instance,
>> doesn't include primitive types corresponding to *unsignedLong* or
>> *unsignedInt* . Java web services frameworks generally work around this
>> lack of language support by using special classes rather than
>> primitives for these types, but this makes the web service interface
>> somewhat awkward and can create performance issues (since primitives
>> are generally much faster than object types when used in calculations) .
>>
>>
>>
>> Even the *decimal* and *integer* types present problems. Most Java
>> toolkits handle these using the standard j/ava.lang.BigDecim al and
>> java.lang.BigIntege r/ classes, which suffer from poor performance but
>> support values of unlimited size. .Net instead uses a fixed-size
>> 128-bit representation, which limits the possible value range (as
>> allowed by the schema specification) but provides relatively good
>> performance.
>>
>> The schema numeric types are confusing and inconsistent (why a
>> *nonPositiveInteger * type, but no *nonPositiveDecimal * type, for
>> instance?), and generally just represent syntactic sugar in any case
>> (since the ranges can instead be implemented using simpleType
>> restriction) . For these reasons it's best to avoid using most of these
>> types in your schema definitions, especially those intended for use
>> with web services. Use specific sized types (*double* and *float* for
>> real numbers, and *long* and *int* for integers) where possible, since
>> these translate consistently to programming language primitive types.
>> If you need to work with values beyond the range or precision possible
>> with these sized types, understand that *decimal* and *integer* will
>> not necessarily give you what you want due to implementation
>> differences, and instead consider using a string and handling the
>> conversion of the value in application code.
>>
>>
>>       The Issues of Time
>>
>> Time-related values are another common source of problems in working
>> with schema. Nine separate time-related datatypes are defined by
>> schema, all based on a particular version of the Western Gregorian
>> calendar. Unlike the numeric types, the time-related types aren't in
>> any direct form of specialization relationship - instead, they're all
>> considered as derived directly from the generic *anySimpleType* .
>>
>> The most widely-used time datatypes are *dateTime, date*, and *time*.
>> These three datatypes share a common representation format, with
>> *dateTime* as the general case. Here's a sample *dateTime* value, for
>> the current time as I write this article: "2008-09-08T15: 38:53". A
>> *date* value uses the same representation as a *dateTime*, but strips
>> off the 'T' and the hour-minute- second values that follow (leaving
>> "2008-09-08" , in this case); a *time* value, conversely, strips off
>> everything up to and including the 'T', keeping only the
>> hour-minute- second values ("15:38:53") .
>>
>> Seems pretty simple so far, right? Where it gets confusing is in the
>> actual interpretation of one of these values. Dates and times vary
>> depending on where you're located, with the variation normally
>> expressed in terms of time zones. For instance, as I write this
>> article in New Zealand I'm 12 hours ahead of Universal time and 19
>> hours ahead of the Pacific Daylight Time currently in effect for the
>> West coast of the U.S. At the same instant I wrote my sample
>> *dateTime* value here as "2008-09-08T15: 38:53", the time in Seattle
>> was "2008-09-07T20: 38:53".
>>
>> For many applications you need to specify date/times in a manner which
>> permits relating one value to another. Schema supports this
>> requirement by allowing date/time values to use an appended time zone
>> indication. This time zone indication can either take the form of the
>> letter 'Z', used to indicate a date/time Universal time (UTC) value,
>> or an offset from Universal time in hours and minutes. So any of these
>> *dateTime* values (and many more variations) could all be used to
>> indicate the same instant: "2008-09-08T15: 38:53+12: 00",
>> "2008-09-07T20: 38:53-08: 00", or "2008-09-08T03: 38:53Z".
>>
>> But schema doesn't /require/ that you specify a time zone indication,
>> and without such an indication a date/time value can only be
>> interpreted as being accurate for some arbitrary location which could
>> be anywhere in the world. For some applications that may be just what
>> you want - a person's birth date, for instance, is usually treated as
>> a particular date without reference to location, and people likewise
>> celebrate the Gregorian New Year as it occurs locally around the world
>> - but for other applications it creates major issues. Consider the
>> case of a conference call, for instance, where all the parties
>> involved need to coordinate the time of the event to their local clocks.
>>
>> Unfortunately, schema does not allow you to distinguish between the
>> cases where a fully-specified date/time is needed and those where a
>> zoneless value is allowed or even expected (at least not in a way
>> which web services toolkit can interpret - you could do this by using
>> *simpleType* restriction patterns, but patterns are generally ignored
>> by the toolkits). So the ambiguity of schema on this point means that
>> toolkits need to handle values both with and without time zone
>> indications.
>>
>> The need to handle both types of values creates some major headaches
>> in terms of interpretation, especially since programming languages
>> generally implement date/time handling based on absolute time values.
>> There's just no way to correctly convert a schema value which is
>> missing a time zone indication to an absolute time. Of course, that
>> doesn't stop toolkits from doing something with such values, anyway.
>> In most cases they convert the value as supplied by assuming it's
>> given in terms of the local time zone, and that's often what you want
>> - but when it's not, the resulting problems can be very difficult to
>> isolate.
>>
>> Problems due to time zones are especially messy for the *date* type.
>> Most often, people treat dates as a fixed slot on the calendar. When
>> you sign a legal document, for instance, you'll generally fill in the
>> date of your signature. If you agree to a new project, there'll
>> usually be a scheduled completion date (fanciful as these scheduled
>> dates may sometimes be). And if you're asked to show your driver's
>> license for proof of age when making a purchase, the clerk will look
>> at your birth date and compare it with an age cutoff. In all these
>> cases the date is treated as having day resolution, and differences
>> between timezones are normally ignored. But the schema date type uses
>> an associated time zone indication, just like the dateTime and time
>> types. This use of a time zone indication creates a disconnect between
>> the schema date type and the common form of a date. Generally this
>> gets handled by converting dates to the 00:00 (midnight, as the start
>> of the day) time representing the start of that day in whatever
>> timezone was specified. But if you then print out that date value
>> using the local timezone, you may find it's different from what was
>> originally specified in the document.
>>
>> If schema defined separate types for date/time values with time zone
>> specifications and those without it'd be easy for applications to pick
>> which type they wanted to use. Without this ability, it's difficult
>> for toolkits to work around a basically flawed representation of
>> date/time values in schema. Java's JAXB 2.0 takes what is probably the
>> most comprehensive approach to the problem, handling all the schema
>> date/time types with a special class
>> (|javax.xml. datatype. XmlGregorianCale ndar|) which corresponds directly
>> to schema representations. This approach preserves all the nuances of
>> schema representations of values, but at the cost of passing the
>> interpretation issues on to developers. Other toolkits generally just
>> use defaults, such as assuming the local timezone.
>>
>> Given the nasty issues lurking in this area, the best general approach
>> is probably to only use the schema date/time types for values which
>> should be fully-specified with time zone indications, and to make sure
>> that any documents you generate do include time zone indications. Most
>> web services toolkits will generate the time zone indications for you
>> on output automatically, so this last part is easy. Requiring that
>> your input documents also use time zone indications can be more
>> difficult, especially since documents may be going through several
>> stages of processing. If you want to be certain you don't run into
>> problems caused by mistaken conversion assumptions your best solution
>> is probably to use a string type in the schema representation, so that
>> your web service toolkit will pass the value on to your application
>> code without trying to interpret the value.
>>
>> If you need zoneless date/time values (as for the birth date example),
>> your best approach may again be to use a *string* type in the schema
>> representation. That's not very satisfying from the standpoint of
>> providing an accurate representation of the data in the schema, but
>> avoids the issues with web services toolkits interpreting unzoned
>> values as being in the local timezone.
>>
>>
>>     References
>>
>> Data structures used internally by applications often contain multiple
>> linkages between components, including cross-references and indirect
>> associations. XML, on the other hand, is inherently tree-structured.
>> It's very easy to represent one-to-many relationships in XML through
>> containment, but any other type of relationship is problematic. Even
>> one-to-many relationships can be inefficient. Consider the case of a
>> document listing a customer's order history, for instance. Each order
>> will have associated billing and shipping addresses, but these
>> addresses are often going to be repeated from one order to the next.
>> If you just embed the addresses inside the information for each order,
>> you'll end up with a lot of redundant information in your documents.
>>
>> References can be used to get around the limitations of XML's tree
>> structure. The idea of a reference is that you define something once
>> in an XML document, including a unique identifier. Any time other data
>> needs make use of that definition, you create a reference using the
>> unique identifier.
>>
>> Schema directly supports two forms of references. The first, using the
>> ID type, defines element identifiers which can be linked from anywhere
>> in the document by using the IDREF or IDREFS types. The nice part of
>> ID/IDREF links is that they're simple - identifiers are just names,
>> and any type of element can define an ID value in the schema. The
>> downside of ID/IDREF links is that they use a global context, so
>> there's no way to say that the value used for a particular IDREF must
>> be defined on a particular element type, and the names used as ID
>> values must be unique within a document (even across types of
>> elements). Some web service toolkits support using ID/IDREF links to
>> represent references within data structures (including JAX-WS/JAXB
>> 2.0, and Apache Axis2 when used with JiBX data binding); other
>> toolkits (such as .Net, and Axis2 used with ADB) do not, instead
>> treating IDREF values as simple text strings.
>>
>> The second type of references support by schema are key/keyref links.
>> While ID/IDREF links are defined using datatypes, key/keyref links are
>> instead part of the structure of a schema definition. This difference
>> allows key/keyref links to be much more expressive than ID/IDREF
>> links, including defining contexts within which key values are unique.
>> But because key/keyref links are designed more for purposes of
>> document validation than for structuring, they are complex and not
>> generally used by data binding frameworks which convert XML data to
>> and from data structures.
>>
>> So if you want to embed linkages within your XML documents and have
>> them handled by web services toolkits, your only hope is the ID/IDREF
>> approach. Some toolkits will support these links directly; others will
>> just treat the identifier values as strings, but you can write
>> application code to cross-reference the identifier and reference
>> values and build your own links.
>>
>>
>>     Conclusion
>>
>> In this article we've looked at some of the problems that arise when
>> using the most common schema datatypes in web services. There are many
>> other specialized schema datatypes beyond those mentioned in this
>> article (a total of 42!), and some of these present other issues. As a
>> general principle, the best approach to take in your web service
>> schema definitions is to avoid the use of overly-specialized types
>> (except for the numeric types that match common programming language
>> types), and use a string type when you want full control over the
>> interpretation of values.
>>
>> It's worth pointing out that although some of the issues discussed in
>> this article could be handled better by data binding frameworks, a lot
>> of the problems lie with schema itself. In particular, the data/time
>> family of types are at best cumbersome to work with and at worst
>> invite errors through the lack of distinction between zoned and
>> unzoned value types. It's possible to pass the confusion on to the
>> user, as JAXB does with the XmlGregorianCalenda r type, but that's not
>> really a solution.>>
>>
>>
>>
>
>
>
>

------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/service-orientated-architecture/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/service-orientated-architecture/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:[email protected] 
    mailto:[email protected]

<*> To unsubscribe from this group, send an email to:
    [email protected]

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/

Re: [service-orientated-architecture] Dennis on Sche ma for Web Services – Part I: Basic Datatypes

Reply via email to