The Truth About XML was: openEHR Subversion => Github move progress

Thomas Beale Fri, 29 Mar 2013 16:19:19 +0000

On 29/03/2013 14:15, Tim Cook wrote:
> Hi Tom,
>
> I have amended the Subject Line since the thread has diverged a bit.
>
> [comments inline]
>
> On Thu, Mar 28, 2013 at 9:55 AM, Thomas Beale
> <thomas.beale at oceaninformatics.com>  wrote:
>> one of the problems with LinkEHR (which does have many good features) is
>> that it is driven off XSD. In principle, XSD is already a deformation of any
>> but the most trivial object model, due to its non-OO semantics. As time goes
>> on, it is clear that the XSD expression of data models like openEHR, 13606
>> etc will be more and more heavily optimised for XML data. This guarantees
>> such XSDs will be a further deformation of the original object model - the
>> view that programmers use.
> I agree with you that you cannot represent an object model, fully, in
> XML Schema language.
> However, you seem to promote the idea that object oriented modelling
> is the only information modelling approach[1].
> This is a critical failure. The are many ways to engineer software
> using many different modelling approaches.
> So abstract information modelling, as you have noted, does not
> necessarily fit all possible software modelling approaches and it is
> unrealistic to think that it does. In desiging the openEHR model you
> chose to use object oriented modelling. The openEHR reference
> implementation uses a rather obscure, though quite pure,
> implementation language, Eiffel. I think that history has shown that
> this has caused some issues in development in other object oriented
> languages.


Hi Tim,

I don't see any problem here. The extant open 'reference implementation' 
of openEHR has been in Java for years now, and secondarily in Ruby 
(openEHR.jp <http://openehr.jp/>) and C# (codeplex.com 
<http://openehr.codeplex.com/>). The original Eiffel prototype was from 
nearly 10 years ago and was simply how I prototyped things from the GEHR 
project, while other OO languages matured.

I am not sure that we have suffered any critical failure - can you point 
it out?


>> So now if you build archetypes based on the XSD,
>> you are not defining models of object data that software can use (apart from
>> the low layer that deals with XML data conversion). I am unclear how any
>> tool based on XSD can be used for modelling object data (and that's nearly
>> all domain data in the world today, due to the use of object-oriented
>> programming languages).
> I think that if you look, you will find that "nearly all of the domain
> data in the world" exists in SQL models, not object oriented models.
> So this is a rather biased statement designed to fit your message.
> Not a representation of reality.

ok, so I'll clarify what I meant a bit: most domain (i.e. industry 
vertical) applications are being written in object languages these days 
- Java, Python, C#, C++, Ruby, etc.  The software developer's view of 
the data is normally via the 'class' construct of those languages. You 
are right of course that the vast majority of the data physically 
resides in some RDBMS or other. However, the table view isn't the 
primary 'model' of the data for I would guess a majority of software 
systems development these days. There are of course major exceptions - 
systems written totally or mainly in SQL stored procedures or whatever, 
but new developments don't tend to go this route. In terms of sheer 
amount of data, these latter systems are probably still in the majority 
- since tax databases, military systems etc, legacy bank systems are 
written this way, but in terms of numbers of software projects, I am 
pretty sure the balance is heavily in the other direction.

> That said, the abstract concept of multi-level modelling, where there
> is the separation of a generic reference model from the domain concept
> models is very crucial. Another crucial factor is implementability; as
> promoted by the openEHR Foundation mantra, "implementation,
> implementation, implementation".
>
> The last and possibly most crucial issue relates to implementability,
> which is the availability of a talent pool and tooling. In order to
> attract more than a handful of users to a technology there needs to
> exist some level of talent as well as robust and commonly available
> tools.
>
> The two previous paragraphs are the reasons that the Multi-Level
> Healthcare Information Modelling (MLHIM) project exists.

well, since the primary openEHR projects are in Java, Ruby, C#, PHP, 
etc, I don't see where the disconnect between the projects and the 
talent pool is. I think if you look at the 'who is using it' pages 
<http://www.openehr.org/who_is_using_openehr/>, and also the openEHR 
Github projects <https://github.com/openEHR>, you won't find much that 
doesn't connect to the mainstream.

> MLHIM is modeled from the ground up around the W3C XML Schema Language 1.1 
> [2] .
> The reason for this is that the family of XML technologies are the
> most ubiquitous tools throughout the global information processing
> domain today. There is a significant number of open source and
> proprietary tools from parser/validators to various levels of editors,
> readily available. While serious XML development is not taught in all
> university computer science programs, every student does get
> introduced to XML in some manner.
>
> The relationship of XML with emerging knowledge modelling tools like
> Prot?g? in languages such as OWL[3] and vocabularies expressed in
> RDF/XML[4] is an obvious advantage. There is an enormous skills pool
> available for using XML data with REST APIs and in translating XML to
> JSON for over-the-wire communications. There are thousands of websites
> with information on how to do these things. It is irrelevant which
> programming language you choose to use; Java, Eiffel, Ruby, Lua,
> Python, etc. there are XML binding tools and access to XML validators.
> There are tried and true methods of storing XML data in SQL databases,
> XML databases and NoSQL databases. XQuery and XPath are very robust
> and well known. Another big advantage is having the ability to do data
> validation using commonly available tools in a complete path; from the
> instance data to concept model to the reference model to the W3C XML
> Schema specification to the W3C XML specification,

<NB: in the below I am talking about the industry standard XSD 1.0, not 
the 9-month old XML Schema 1.1 spec>

well I don't really have anything to add to any of that. For the moment, 
industry (including openEHR, which publishes XSDs for all its models for 
years now) is still using XML, although one has to wonder how long that 
will go on 
<http://www.drdobbs.com/web-development/after-xml-json-then-what/240151851?cid=DDJ_nl_upd_2013-03-27_h&elq=d97916a977fc47dcbdbbe30eff6d55de>.

But XML schema as an /information modelling /language has been of no 
serious use, primarily because its inheritance model is utterly broken. 
There are two competing notions of specialisation - restriction and 
extension. Restriction is not a tool you can use in object-land because 
the semantics are additive down the inheritance hierarchy, but you can 
of course try and use it for constraint modelling. Although it is 
generally too weak for anything serious, and most projects I have seen 
going this route eventually give in and build tools to interpolate 
Schematron statements to do the job properly. Now you have two 
languages, plus you are mixing object (additive) and constraint 
(subtractive) modelling.

Add to this the fact that the inheritance rules for XML attributes and 
Elements are different, and you have a modelling disaster area.

James Clark, designer of Relax NG, sees inheritance in XML as a design 
flaw (from http://www.thaiopensource.com/relaxng/design.html#section:15 ):
             ... The support for inheritance in W3C XML Schema is 
probably the major contributor to the considerable complexity of W3C XML 
Schema Part 1. Yet, the inheritance mechanisms in W3C XML Schema do not 
allow W3C XML Schema to express any constraints that cannot be expressed 
in RELAX NG. Although W3C XML Schema has a very complex type system with 
two type hierarchies, one for elements (called substitution groups) and 
one for complex types, it supports only single inheritance. However, 
modern object-oriented languages, such as Java and C#, support multiple 
inheritance (at least for interfaces). Thus, in general the inheritance 
structure of a class hierarchy cannot be represented in a schema. 
Inheritance has proven to be very useful in modeling languages such as 
UML. However, I would argue that trying to make an XML schema language 
also be a modeling language is not a good idea. An XML schema language 
has to be concerned with syntactic details, such as whether to use 
elements or attributes, which are irrelevant to the conceptual model. 
Instead, I believe it is better to use a standard modeling language such 
as UML, which provides full multiple inheritance, to do conceptual 
modeling, and then generate schemas and class definitions from the model 
[5]....

Difficulties in using type restriction (i.e. subtyping) in XSD seem 
well-known - here 
<http://www.xml.com/pub/a/2002/11/20/schemas.html?page=4#restriction>. 
Not to mention the inability to deal with generic types of any kind, 
e.g. Interval<Date>, necessitating the creation of numerous fake types.

And of course, the underlying inefficiency and messiness of the data are 
serious problems as well. Google and Facebook (and I think Amazon) don't 
move data around internally in XML for this reason.

None of this is to say that XML or XML-schema can't be 'used' - I don't 
know of any product or project in openEHR space that doesn't use it 
somewhere, and of course it's completely ubiquitous in the general IT 
world. What I am saying here is that the minute you try to express your 
information model primarily in XSD, you are in a world of pain.

My lessons from projects using XSD are:

  * XSDs are good for one thing: describing the contents of XML
    documents. That's it.
      o but what we need are models that can describe data, software,
        documents, documentation, interfaces, etc
  * get imported data out of XML as soon as possible, and into a
    tractable computational formalism
  * treat XSDs as interface specifications, to be generated from the
    underlying primary information models, not as any kind of primary
    expression in their own right
  * Define XSDs with as little inheritance as possible, avoid subtyping,
    i.e. define types as standalone, regardless of the duplication.
  * Maximise the space optimisation of the data, no matter what it
    takes. It usually requires all kinds of tricks, heavy use of XML
    attributes, structure flattening from the object model and so on. If
    you don't do this, any XML data storage or will cost twice what it
    should and web services using XML be horribly slow.

I know there are all kinds of tricks to mitigate these problems, I've 
seen a lot of them. The fact that there is a mini-tech sector around XSD 
problem mitigation / optimisation testifies to the difficulty of this 
technology.

XML Schema 1.1 introduces useful things that may reduce some of the 
above problems (good overview here 
<https://blogs.oracle.com/rammenon/entry/xml_schema_11_what_you_need_to>), 
however as far as I can tell, its inheritance model is not much better 
than XSD 1.0 (although you can now inherit attributes properly, so 
that's good).

> With XML Schema Language 1.1, we have the ability to build complex
> structures using substitution groups and do very intricate data
> analysis and validation, across models, using XPath in assert
> statements. All without having to resort to RelaxNG or Schematron.
> There are also tools and experience in using XML Schemas to
> automatically generate generic XForms for presentation and data entry.
> So maybe we had to make concessions in deciding to use XML technology
> in MLHIM.  However, I cannot think of anything that is missing at this
> point.


well I guess the main thing is seamlessness between your information 
model and your programming model view. I am not saying it's the only 
way, but the approach in openEHR was oriented towards making sure that 
expressions of the information model, including all its semantics, are 
as close as possible to the software developer's programming model. If 
we had done the primary specifications in XML, there would always be a 
significant disconnect between the models and the software (actually, 
the specs would have been nearly impossible to write). Not to mention, 
life would be hard for working with all the other data formats now in 
use, including JSON, and various binary formats.

An approach that has emerged in industrial openEHR systems in the last 
few years is to /generate /message XSDs from templates - 1 XSD per 
template, and write a generic XML <=> canonical data conversion gateway. 
This means we can do all modelling in powerful formalisms like UML 2, 
EMF/Ecore (for the information models) and all constraint modelling in 
ADL / AOM 1.5, and treat XML as one possible data transport.

 From what I can see, the major direction in information modelling for 
the future will be Eclipse Modelling Framework, using Ecore-based 
models. This is where I think the computational expression of openEHR's 
Reference Model will move to. The OHT Model Driven Health Tools (MDHT) 
project is already showing the way on this, at the same time adopting 
ADL 1.5 concepts for constraint modelling.

I have no experience with XSD 1.1, and I think it will be years before 
mainstream industry catches up with it. But it may be that it does what 
is needed.


>
> I want to close by saying that I am grateful for the work done in the
> openEHR community. In my more than ten years of involvement with five
> years on the ARB, I learned a lot!  I learned how to do things right
> as well as what can go wrong.  MLHIM represents those lessons learned.
>
>
>

we'll obviously differ on our analysis of what is the best modelling 
formalism. The above are the conclusions I have come to over the years. 
Others may have other, better ideas, and it may be that an XSD 1.1 
modelling effort in openEHR could make sense.

I think the key thing would have been to ensure that the archetypes 
could be shared across openEHR and MLHIM. Archetypes are pretty widely 
used these days, and there are many projects now creating them. I don't 
know if this is still possible; if not, it presents clinicians with the 
dilemma: model in ADL/AOM, or model in MLHIM? Replicated models aren't 
fun to maintain...

- thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.openehr.org/pipermail/openehr-technical_lists.openehr.org/attachments/20130329/e2870dde/attachment-0001.html>

The Truth About XML was: openEHR Subversion => Github move progress

Reply via email to