Dear all,

I published a first prototype of the "Medical Microdata Compendium", a 
collection of open medical and pharmacological datasets with markup conforming 
to the recently updated schema.org and the microdata format. The long-term goal 
of this project is to provide structured medical and pharmacological 
information to search engines to enable better decision making by doctors and 
patients. The far more humble short-term goal is to research how microdata can 
be used for retrieving and querying biomedical information, and to come up with 
interesting demonstrations and use-cases.

The data can be viewed here:

http://samwald.info/medical_microdata/

At the moment this is a flat list of web pages, with each page describing a 
formulated pharmaceutical or a substance. The data were derived from the 
DailyMed and DrugBank datasets from the LODD collection. 

Example of a DrugBank resource:
http://samwald.info/medical_microdata/drugbank_resource_drugs_DB00175.html

Example of a DailyMed resource:
http://samwald.info/medical_microdata/dailymed_resource_drugs_3580.html

You can extract the structured data from these pages with a variety of tools. 
For example, You can use the Sindice inspector:
http://inspector.sindice.com/inspect?url=http%3A%2F%2Fsamwald.info%2Fmedical_microdata%2Fdrugbank_resource_drugs_DB00175.html

At the moment I am evaluating how different search engines can cope with the 
data. For example, the microdata can already be used by Google Custom Search 
Engines. Other 'semantic' search engines such as http://sindice.com/ or the 
medical search engine developed by the http://khresmoi.eu/ project should also 
be evaluated. 

If you are interested in joining the effort to evaluate how semantic markup can 
be used to improve medical information search and decision making, please send 
me an e-mail! I would like to see this work published as a journal paper, and 
could use some co-authors. I appreciate every feedback or idea!

Regarding the Medical Microdata Compendium, there are several issues that still 
need to be taken care of:

1) The DailyMed resources are still riddled with character encoding issues -- 
this is a problem of the LODD data source and will be remedied by switching to 
a newer version of this dataset, Richard's 'Linked Structured Product Labels'.
2) Only a fraction of the properties of the source datasets have been mapped, 
namely those where a close fit between a property in the source dataset and 
schema.org could be found. This means that a lot of useful data is not 
captured. I will look into using the proposed schema.org extension mechanism to 
see if it could help to capture these additional properties and types.
3) More datasets need to be converted, such as ClinicalTrials.gov (and its 
linked data mirror http://linkedct.org/). This will also help to better 
demonstrate interlinking of different datasets (e.g., from disease to drug to 
ongoing clinical trials in the area).
4) The generation of http://schema.org/MedicalCode entities needs to be fixed. 
Also, we need to check how we can align with controlled vocabularies that 
already have URIs (e.g. to BioPortal taxonomies)
5) General clean-up, code formatting and improvement of web design

Cheers,
Matthias Samwald

Reply via email to