Medical Microdata Compendium (Open Biomedical Datasets with schema.org annotation) -- was: Re: New proposal: health & medical extensions to schema.org

Matthias Samwald Wed, 04 Jul 2012 14:23:51 -0700

Dear all,

I published a first prototype of the "Medical Microdata Compendium", a 
collection of open medical and pharmacological datasets with markup conforming 
to the recently updated schema.org and the microdata format. The long-term goal 
of this project is to provide structured medical and pharmacological 
information to search engines to enable better decision making by doctors and 
patients. The far more humble short-term goal is to research how microdata can 
be used for retrieving and querying biomedical information, and to come up with 
interesting demonstrations and use-cases.

The data can be viewed here:

http://samwald.info/medical_microdata/

At the moment this is a flat list of web pages, with each page describing a
formulated pharmaceutical or a substance. The data were derived from the
DailyMed and DrugBank datasets from the LODD collection.

Example of a DrugBank resource:
http://samwald.info/medical_microdata/drugbank_resource_drugs_DB00175.html

Example of a DailyMed resource:
http://samwald.info/medical_microdata/dailymed_resource_drugs_3580.html

You can extract the structured data from these pages with a variety of tools.
For example, You can use the Sindice inspector:
http://inspector.sindice.com/inspect?url=http%3A%2F%2Fsamwald.info%2Fmedical_microdata%2Fdrugbank_resource_drugs_DB00175.html

At the moment I am evaluating how different search engines can cope with the
data. For example, the microdata can already be used by Google Custom Search
Engines. Other 'semantic' search engines such as http://sindice.com/ or the
medical search engine developed by the http://khresmoi.eu/ project should also
be evaluated.

If you are interested in joining the effort to evaluate how semantic markup can
be used to improve medical information search and decision making, please send
me an e-mail! I would like to see this work published as a journal paper, and
could use some co-authors. I appreciate every feedback or idea!

Regarding the Medical Microdata Compendium, there are several issues that still
need to be taken care of:

1) The DailyMed resources are still riddled with character encoding issues --
this is a problem of the LODD data source and will be remedied by switching to
a newer version of this dataset, Richard's 'Linked Structured Product Labels'.
2) Only a fraction of the properties of the source datasets have been mapped,
namely those where a close fit between a property in the source dataset and
schema.org could be found. This means that a lot of useful data is not
captured. I will look into using the proposed schema.org extension mechanism to
see if it could help to capture these additional properties and types.
3) More datasets need to be converted, such as ClinicalTrials.gov (and its
linked data mirror http://linkedct.org/). This will also help to better
demonstrate interlinking of different datasets (e.g., from disease to drug to
ongoing clinical trials in the area).
4) The generation of http://schema.org/MedicalCode entities needs to be fixed.
Also, we need to check how we can align with controlled vocabularies that
already have URIs (e.g. to BioPortal taxonomies)
5) General clean-up, code formatting and improvement of web design

Cheers,
Matthias Samwald

Medical Microdata Compendium (Open Biomedical Datasets with schema.org annotation) -- was: Re: New proposal: health & medical extensions to schema.org

Reply via email to