Dear Erik and all (This email might appear a bit long but it actually makes just two points a) Data Synthesizer Tool, b)Availability of Realistic Subject data)
A) Data Synthesizer Tool I absolutely agree on the "data synthesizer" tool. It is something i would like to do as a test case for parsing an archetype's definition node and generating a representative object because in this case, each and every node defined in the spec would have to be handled. It's not that much of a time consuming task if you already have the RM builder. The AM provides everything that is needed (For example: http://postimage.org/image/mcytss26f/ bounds for primitive types, cardinality / multiplicity for other data structures), so instead of just creating an object from the RM and attaching it in a hierarchy (just by calling its constructor maybe), some values would have to be generated and attached to its fields as well. Once the RM object is constructed it can be serialized to anything (XML included) (and there goes a first "test base") From this perspective, it is absolutely essential that the XSDs are valid (to ensure a valid structure) and also (Seref's got a very good point) that the archetypes are valid to ensure a valid content. B) Availability of Realistic Subject Data As far as clinically realistic datasets are concerned, i would like to suggest the following: The Alzheimer's Disease Neuroimaging Initiative (ADNI) in the US is a long term project that collects, longitudinally, various clinical parameters from subjects at various stages in the disease (http://adni.loni.ucla.edu/). At the moment, the dataset contains about 800 subjects. Each subject would have 4-5 sessions associated with it (at 6 month intervals usually) and for each session a number of parameters would be collected such as MMSE scores, ADAS Cog scores, received medication, lab tests and others as well as imaging biomarkers (MRI mostly). A basic "demographics" section is also available for each subject. (To put it in the context of a visualisation, the story that these data reveal is the progression of AD on a subject / population of subjects which is very interesting.) The data are made available as CSV files (about 12 MB just for the numerical data). An application must be made to ADNI to obtain the data. As redistribution of the data is prohibited (http://adni.loni.ucla.edu/wp-content/uploads/how_to_apply/ADNI_DSP_Policy.pdf) we would be working towards a tool that would accept a set of ADNI CSV files and transform them into a local openEHR enabled repository. The task here would be to create some archetypes / templates that reflect the structure of the data shared by ADNI and then scan the CSVs and populate the openEHR enabled repository. The CSV files are not in the best of conditions (the structure has been changed from version to version, certain fields (such as dates) might be in a number of different formats, the terminology is not exactly standardised, etc). For us (ctmnd.org) to work on these files we have created an SQL database and a set of scripts that sanitize and import the CSVs. I would be interested in turning this database into an openEHR enabled repository (whether a set of XML files or "proper" openEHR database) because it can be used for a number of things (especially for testing AQL). If you think that this can be of help, let me know how we can progress with it. Obviously the tool can be made available to everybody who can then apply to download the ADNI data locally. I am not so sure about the data (even if they become totally anonymised), i will have to check, but in any case, going from "I have nothing" to "I have a database of multi-modal data from 800 subjects that is more realistic than test data" is got to worth the trouble of converting the CSVs. Looking forward to hearing from you Athanasios Anastasiou