Hey folks, Following on from the discussion about adl path / xpath, XSDs, and so forth. Here's something semi-random that I've been thinking about for a while and I thought I might share. I figured now might be a good time for that, with the work ongoing on openehr 2.0.
Status ------ ADL is a rich schema language which allows for constructs that you cannot properly express in XML schema. The openEHR reference model is a rich model which incorporates many common data types. This is (supposed to be?) great for modellers, leads to some very fun exercises in tool-building for openehr implementation experts, and is not great at all for joe programmer, who becomes very dependent on bob programmer [1]. One way or another, joe programmer is converting (with some help from bob programmer's magic along the way) the smart and flexible data structures into the stupid and rigid data structures that are typical for business programming and the tools and languages he has available to him. Is this a required situation due to the inherent complexity of handling medical information? That seems to be the general gist of the argument that led to GEHR and openEHR, but I'm really not so sure I buy it. I'm also not convinced at all that long-term data archival and interoperability goals are really met all that well by any format based on complex base data structures. Radical simplification ---------------------- By reducing the set of base types and base primitives in openehr, it _should_ be possible to produce a revised architecture that keeps _most_ of the core values of openehr, like two-level modeling on top of a generic architecture, makes implementation much easier, and makes use by joe programmer much easier still. I imagine most of the loss is in model conciseness. I know discussions have been had here and elsewhere about "dumbing down to what you can conceivably get done with XML tools", but I'd actually like to take the thought experiment a bit further. In this case, the thoughts are about ADL and associated data format(s). Lowest common denominator ------------------------- Try and imagine openehr as it is today, and then: * enforce (just) unicode characters, and UTF-8 encoding * reduce the available collection primitives to just Map and List, losing List, Set and Interval (probably Set and Interval live in the reference model, but built conceptually they're built out of other types) * remove Cluster, Element and ItemStructure from the reference model, forcing use of nested maps and lists only, throughout * reduce primitives to string, true, false, null (ugh), int (32-bit), bigint (64-bit), float (32 bit), double (64 bit) * keep the concept of datetime as a text/value subtype Why? So any openehr data can be unambiguously serialized and deserialized to json, that crappy lowest common denominator format that joe and all his friends and all their tools can work with out of the box. Maybe we can still be smart enough to have something non-crappy... ...next, imagine that the map indices are now at-codes, meaning at-codes are required to be unique within maps. At-codes are also made mandatory for all model elements that are not primitives, though simple types can have reference-model-provided details. Finally, imagine that at-codes are not at-codes, but codes matching [a-zA-Z][a-zA-Z0-9_-]{1,36}. Why? So your canonical JSON representation can be logical, even beautiful. Where are we now? { person_name: { first_name: [ {first_name_part: {value: "Jan"}}, {first_name_part: {value: "Peter"}} ], last_name: {value: "Balkenende"} } } Err, what? ---------- Wait, is that really openEHR data? Am I serious? Yeah, I think so. But at what cost? What did we just lose? I imagine ADL would become quite a bit simpler/more regular archetype (adl_version=2.0) openEHR-DEMOGRAPHIC-MAP.person_details.v1 definition MAP[details] optional matches { ENTRY[identities] optional matches { LIST[identity_list] matches { MAP[person_name] optional matches { -- you can have at most one first name consisting of 1 or more parts ENTRY[first_name] optional matches { LIST[first_name_parts] { DV_TEXT[first_name_part] occurrences {1..*} matches { value matches {*} } } } -- you must have exactly one last name ENTRY[last_name] matches { DV_TEXT[last_name_value] matches { value matches {*} } } } } } } which could in turn much more easily translate to XSD [2] <xs:element name="person_name" type="PERSON_NAME"> <xs:complexType name="PERSON_NAME"> <xs:complexContent> <xs:restriction base="openehr:MAP"> ... <xs:element name="first_name_part" type="FIRST_NAME_PART"> <xs:complexType name="FIRST_NAME_PART"> <xs:complexContent> <xs:restriction base="openehr:DV_TEXT"> so that we could have pretty xml <person xmlns:o="http://openehr.org/xsd/v2" xmlns="http://openehr.org/ckm/xsd/openEHR-DEMOGRAPHIC-CLUSTER.person_name.v1"> <details> <identities> <!-- list is hidden xs:sequence --> <person_name> <first_name> <first_name_part> <!-- first_name_part is-a DV_TEXT mitigates <value><value>...</value></value> --> <o:value>Jan</o:value> </first_name_part> <first_name_part> <o:value>Peter</o:value> </first_name_part> </first_name> <last_name> <last_name_value> <o:value>Balkenende</o:value> </last_name_value> </last_name> </person_name> </identities> </details> </person> which would also be much more efficient to store and index, and lead to more intuitive xpath //first_name_part[1]/value/text() (: Jan :) //first_name_part[2]/value/text() (: Peter :) //last_name_value/value/text() (: Balkenende :) Reflections ----------- Basically, this involves sacrificing some modeling language power for increased implementation feasibility and tool interoperability. Most importantly it will reduce the cognitive dissonance that openEHR's two level modelling introduces for joe programmer. Now, modeling language power is of use, isn't it? Surely the great power of ADL is not to be tampered with, conciseness of expression must not be lost? Well, Hmm. Obviously, significantly reducing what clinical models are possible to express is not an option at all. But, on the other hand, you can convert pretty much any data structure into a tree (and we can keep Link around for approximating graph constructs), and convert any tree into a combination of maps and lists (just look at DOM...), and so it seems that this would then really place a burden mostly on the evolution of more powerful modeling tools, to allow expressing rich concepts using the more limited underlying modeling language without getting stuck [3]. I think this may actually be reasonable: surveying the archetypes in the CKM, they are predominantly using pretty simple composition and polymorphism, and simply binary relations. The archetypes can get pretty big and they can have some interesting constraints, but they're not that complex structurally. However, if it turns out this proposed loss of modeling power isn't acceptable, I think the ORM (object role modelling) community has shown a successful way to have the best of several worlds: we could keep ADL (and its supporting tooling) around, mostly as it is, but then introduce a standardized translation from it to an intermediary, dumber, schema form (like ORM can be machine-translated to ER or UML), say Flattened ADL, or FADL, which is then the basis for data formats and system connectivity. But, and this is perhaps the sour pill to swallow, ORM has _also_ shown that the seductiveness of a really powerful modeling language [4] is a great way to forever remain a relatively small and obscure community [5] while the majority of IT is off making big bucks by building theoretically (even provably) inferior systems. In a landscape still full of HL7v2, 80s-style SQL, and 90s-style data entry forms, perhaps the strategy with the most chance of long-term success for openEHR is to actually let go of the shiniest tools. Not meant as a call to action, just some food for thought :-) cheers, Leo [1] of course this leads to a reasonable business model for jane manager, bob's boss, who gets to sell bob's shiny things to joe...but only so long as joe doesn't revolt...and the marketing is hard... [2] as long as there's archetype slots, any purely XSD based validation is not going to be a unless you annoated instance data with the schema, but I can't imagine we'd want to consider giving up slots... [3] so this turns out great for jane anyway, since she prefers the smart&rich customers...? [4] if you're not familiar with ORM or other fact-based modelling, basically, the approach has been around forever (much longer than OO), and it kicks UMLs _ass_. GEHR and openEHR would no doubt have looked even prettier if they'd been expressed using ORM instead of UML. [5] and a frustrated community at times too...hmm, I guess you might say ORM is to UML as Lisp is to Java... -- This e-mail message is intended exclusively for the addressee(s). Please inform us immediately if you are not the addressee.