On 14-02-16 00:04, Birger Haarbrandt wrote:

Hi Bert,

I'm not arguing that you can represent most data in XML. I'm just concerned that mangling high volume or specialized data like for example sensor data, genom data and geo-spatial data into a document format might not work too well. Also, when the ER-diagram of non-openEHR data is fairly complex, producing a meaningful XSD and XML documents might not be that quick and easy (at least I don't know of a industry-strength tool that can help with this task. However, I may be wrong about this and I'd be happy to learn).


I agree, long ranges of data are not well represented in XML. It has too much overhead. (Although there are other solutions for that which are easy to integrate with XML, but that aside)

So handle XML as an intermediate representation, good for software to handle, it can represent objects very good. So it fits good to a Object Oriented paradigm. OpenEHR also works along this paradigm. XML is a format which has good support for validating and it can represent objects very good. It is also widely understood, and almost every development-environment has standard support for XML.

There are two kind of related matured industries-supports I am looking for. That is a good, well defined query language, and as an extension on this, a validation environment. XQuery and Schematron are excellent technologies which fit very good to the two-level modeling (OpenEHR) paradigm, because they are path-based.

JSON is also very good, and it is leaner, especially if sender and receiver have deep knowledge about the data (which is the case in OpenEHR), then JSON is better. But the industry support for JSON is, as far as I know, not as good as it is for XML. But on the other hand, it is easy to migrate from XML to JSON and vice versa, even without or structure data-loss, see for example
http://www.utilities-online.info/xmltojson/

I don't believe that XML-databases actually store XML. Oracle, for example, breaks it up in a relational structure. But I don't know the internals of others well. The worst solution, however for storing XML would be really storing XML. In the solution I presented in my email. it is not XML in which I want to store data, that is path-value combination (in fact, in detail it differs somewhat, this is the base idea. The elaborated idea is 10 times as efficient.)

Because, regarding to storage, their are other criteria than for validating and communicating data. In storage speed and efficiency are very important, and also, a very good and fast implementation of AQL (or XQuery) And when data are retrieved, they can be represented in JSON or XML, or whatever one likes, even support for native American smoke signals is possible, these are again representations.


Regarding performance, we did some tests on SQL Server 2012 last year. As I have only experience with this particular database, it might well be that my critique does not apply to Oracle or Marklogic!


I am not very impressed by these database-tests, there are so many side-factors which are not taken into account. The JDBC-drivers, for example, the used communication-protocols, the indexes, the code of the supporting software-layers, the quality of the query-engine, the operating system, the file-system, the network-card-driver, etc, etc.
You are testing complete different stacks of technologies.
It is like testing chain, and then concluding that the last shackle is no good because the chain breaks somewhere in the middle.

But there is indeed a problem with the old database technologies, and that is that they are build for data-manipulation. There are good reasons to do that, a bank does not want to process every day your complete history, but wants to know you current savings and mortgage position. So they modify your current data constantly. The Codd normalization is also designed for efficiency and integrity in the context of datamanipulation.

When you use a database out of the box then you will see features which are needed for constant manipulation. But you don't need them, because medical data are immutable. This is very important.

Just a minute ago I compared a simple SQL Query with an XQuery on our data repository. I simply wanted to get all validated blood pressure values and their corresponding datetimes of a pediatric icu. Using the plain relational representation of the data (we automatically map data from compositions to tables), it takes under 1 second to get all 329.273 rows. Having a full index on the blood pressure fragment of the composition (this is needed to get the internal tabular representation of the data) and a secondary index on the paths, querying of the same rows still takes 30 seconds (without, it would be 2 minutes. No surprise). Additionally, the size of the data increases from 10MB to 270MB.

I can assure you that my database storage requires only a few indexes, and also very fast indexes, because data are immutable.
The disadvantage of my solution is that it is not out of the box.
The most important job to do is let the query engine work with the data-storage, but there are now new ways to work with grammars, and I don't think this is very difficult.

W3 has a lot of information for XQuery grammars
https://www.w3.org/TR/xquery-xpath-parsing/
https://www.w3.org/TR/xquery-30/

When this is done, a database-configuration, designed for speed, on every RDB-engine can be used to create this data-processing method.

But I see that we are talking indeed in different tracks of approaching the problem. You test out of the box solutions, many people do. And I think that out of the box, nothing is good enough, because they were not thinking of OpenEHR but of a million other customer-requirements when designing their database. And how good and how well designed and how professional and well maintained, they will not remove those characteristics which stand in your way.

This is the reality we face in out system, therefore, I consider XQuery and XML not an option for us to do analysis in this database layer. As said, this might not apply to a better implementation of XML by other vendors but I'd love to see some real-world numbers.

Just some thoughts and experiences, I'm not a dedicated database expert, therefore, I would not be sad if I'm proven wrong :)


Embrace the good news ;-)

Bert
_______________________________________________
openEHR-technical mailing list
openEHR-technical@lists.openehr.org
http://lists.openehr.org/mailman/listinfo/openehr-technical_lists.openehr.org

Reply via email to