Author: rwesten
Date: Thu Mar 29 16:33:15 2012
New Revision: 1306969
URL: http://svn.apache.org/viewvc?rev=1306969&view=rev
Log:
added additional technical details about the demo
Modified:
incubator/stanbol/trunk/demos/ehealth/README.md
Modified: incubator/stanbol/trunk/demos/ehealth/README.md
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/demos/ehealth/README.md?rev=1306969&r1=1306968&r2=1306969&view=diff
==============================================================================
--- incubator/stanbol/trunk/demos/ehealth/README.md (original)
+++ incubator/stanbol/trunk/demos/ehealth/README.md Thu Mar 29 16:33:15 2012
@@ -25,7 +25,7 @@ This demo uses the following datasets:
* __[Dailymed](http://dailymed.nlm.nih.gov/dailymed/)__([RDF
version](http://www4.wiwiss.fu-berlin.de/dailymed/))): Published by the
National Library of Medicine, this dataset provides high quality information
about marketed drugs.
* __[SIDER](http://sideeffects.embl.de/)__([RDF
version](http://www4.wiwiss.fu-berlin.de/sider)): SIDER contains information on
marketed drugs and their adverse effects. The information is extracted from
public documents and package inserts.
-*
__[Diseasome](http://www.nd.edu/~networks/Publication%20Categories/03%20Journal%20Articles/Biology/HumanDisease_PNAS-V104-p8685(14My07).pdf)__([RDF
version](http://www4.wiwiss.fu-berlin.de/diseasome)): The human disease
network publishes a network of 4,300 disorders and disease genes linked by
known disorder-gene associations for exploring all known phenotype and disease
gene associations, indicating the common genetic origin of many diseases.
+*
__[Diseasome](http://www.nd.edu/~networks/Publication%20Categories/03%20Journal%20Articles/Biology/HumanDisease_PNAS-V104-p8685%2814My07%29.pdf)__([RDF
version](http://www4.wiwiss.fu-berlin.de/diseasome)): The human disease
network publishes a network of 4,300 disorders and disease genes linked by
known disorder-gene associations for exploring all known phenotype and disease
gene associations, indicating the common genetic origin of many diseases.
* __[DrugBank](http://www.drugbank.ca/)__([RDF
version](http://www4.wiwiss.fu-berlin.de/drugbank)): A repository of almost
5000 FDA-approved small molecule and biotech drugs. It contains detailed
information about drugs including chemical, pharmacological and pharmaceutical
data; along with comprehensive drug target data such as sequence, structure,
and pathway information.
Note that the RDF versions of this dataset used by this dataset is hosted [by
the [Freie Universität
Berlin](http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/)
@@ -65,10 +65,13 @@ After that the you will be able to
* use the datasets with the [Stanbol
Entityub](http://localhost:8080/entityhub/site/ehealth/)(url:
http:{host}:{port}/{alias}/entityhub/site/ehealth)
* extract ehealth related terms by using the [Stanbol
Enhancer](http://localhost:8080/enhancer/chain/ehealth) (url:
http:{host}:{port}/{alias}/enhancer/chain/ehealth)
+---
-## Backround information about this demo
+__NOTE__: The remaining part of this document provides detailed information
about this demo and provides information on how to customize it further to
specific needs. Users that want only use this demo will not need to read this
part.
-### Indexing
+---
+
+## Indexing
The configuration used for indexing can be found at
@@ -81,7 +84,7 @@ It contains of the following parts:
* __fieldboost.properties__: configuration for the field boosts. TODO: link to
dock
* __ehealth/__: the SolrCore configuration used for indexing. This is used in
this example to customize how Solr indexes labels and ID. See the following
section for details.
-#### Customizing the Solr Schema used for indexing
+### Customizing the Solr Schema used for indexing
The default SolrCore configuration used by the Apache Entityhub is contained
in the SolrYard module and can be found
[here](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/yard/solr/src/main/resources/solr/core/default.solrindex.zip).
This configuration will be used if no customized configuration is present in
"{indexing-root}/indexing/config/{name}" where {name} refers to the value of
the property "name" in the "indexing.properties".
@@ -131,4 +134,143 @@ Such field types are than applied to spe
<field name="@/drugbank:smilesStringIsomeric/" type="string" indexed="true"
stored="true" multiValued="true"/>
[...]
-Field
+The defined field names must include the prefixes used by the Apache Entityhub
to represent RDF types. In this case '@' refers to a plain literal without a
defined language and '/' is used as separator between the prefix, property and
postfix.
+
+### Customized Mappings for the used Datasets
+
+Such mappings are configured by the "mappings.txt" file in the
"{indexing-root}/indexing/config" directory.
+
+NOTE that for this demo the "mapping.txt" file is located at
"./src/main/indexing/conifg/mapping.txt" and copied by the "./indexing.sh"
script to the "./target/indexing/indexing/config" folder. Users that want to
modify the mappings should edit the mappings.txt file under "./src"!.
+
+While this demo defines a lot of mappings a lot of them could be omitted,
because they do just validate data types. In the following some of those data
types mappings are shown.
+
+ diseasome:geneId | d=xsd:anyURI
+ drugbank:creationDate | d=xsd:dateTime
+ drugbank:patientInformationInsert | d=xsd:anyURI
+
+Data type mappings are only needed if the dataset does not correctly specify
the XSD datatype for literal values. Typically this happens for numbers that
are stored as plain literals.
+
+Important are field mappings such as the following mappings for SKOS preferred
labels.
+
+ drugbank:genericName > skos:prefLabel
+ diseasome:name > skos:prefLabel
+ dailymed:fullName > skos:prefLabel
+
+This specific set of mappings allow to search for entities of the three
different datasets by using one and the same property. This is extremely useful
for finding those entities form text parsed to the enhancer, because one needs
only to configure a single KeywordExtractionEngine instance to cover them all.
+
+A similar configuration is used for the various IDs specified for drugs. Those
are all mapped to the "skos:notation" field. This allows to easily identify
them regardless of the ID known by the User or mentioned in an text. Here are
those mappings.
+
+ drugbank:ahfsCode | d=xsd:string > skos:notation
+ drugbank:atcCode | d=xsd:string > skos:notation
+ drugbank:dpdDrugIdNumber | d=xsd:string > skos:notation
+ drugbank:pdbHomologyId | d=xsd:string > skos:notation
+ drugbank:inchiKey | d=xsd:string > skos:notation
+ drugbank:primaryAccessionNo | d=xsd:string > skos:notation
+ drugbank:secondaryAccessionNumber | d=xsd:string > skos:notation
+
+Note also the wildcard mappings for the used namespaces
+
+ dailymed:*
+ drugbank:*
+ diseasome:*
+ sider:*
+
+that ensures that all properties of those namespaces get indexed. This also
ensures that even if a mapping like
+
+ drugbank:genericName > skos:prefLabel
+
+is defined also
+
+ drugbank:genericName
+
+will be present in the indexed dataset. Without those wildcard mappings one
would need to explicitly define both
+
+ drugbank:genericName > skos:prefLabel
+ drugbank:genericName
+
+to get the same result.
+
+### LDPath mappings
+
+While the default mapping language supports a lot of use cases for mapping,
converting and filtering of properties it is by far not as capable as
[LDpath](http://code.google.com/p/ldpath/). Because of that the indexing tools
has also support for using LDPath to process entities by using the
"LdpathProcessor".
+
+A typical configuration of this processor (in the "indexing.properties" file)
would look like
+
+
org.apache.stanbol.entityhub.indexing.core.processor.LdpathProcessor,ldpath:ldpath-mapping.txt,append:true;
+
+This configuration says that the LDPath program is read from a file with the
name "ldpath-mapping.txt" within the same directory and that the results of the
transformation are appended to the indexed entity. If append is deactivated
that the data of the parsed entity will be replaced by the results of the
LDPath statement.
+
+A typical usage example of the LdpathProcessor processor are type specific
mappings such as
+
+ skos:prefLabel = .[rdf:type is diseasome:genes]/rdfs:label;
+
+This specifies that only for entities of the type "diseasome:genes" the
rdfs:label is mapped to skos:prefLabel.
+
+
+__NOTEs__:
+
+* The LdpathProcessor has only access to the local properties of the currently
indexed entity. LDPath statements that refer other information such as paths
with a lengths > 1 or inverse properties will not work
+* Processors can be chained by defining multiple Processor instances in the
configuration and separating them with ';'. This allows to use multiple
LdpathProcessor instances and/or to chain LdpathProcessor(s) with others such
as the "FiledMapperProcessor". Processors are executed as defined within the
configuration of the "entityProcessor" property.
+* When using the FiledMapperProcessor on results of the LdpathProcessor make
sure that the fields defined in the LDpath statements are indexed by the
FiledMapperProcessor. Otherwise such values will NOT be indexed!
+
+
+### Indexing Datasets separately
+
+This demo indexes all four datasets in a single step. However this is not
required. With a simple trick it is possible to index different datasets with
different indexing configurations to the same target. This section describes
how this could be achieved and why users might want to do this.
+
+This demo uses Solr as target for the indexing process. Theoretically there
might be several possibility, but currently this is the only available
IndexingDestination implementation. The SolrIdnex used to store the data is
located at "{indexing-root}/indexing/destination/indexes/default/{name}. If
this directory does not alread exist it is initialized by the indexing tool
based on the SolrCore configuration in "{indexing-root}/indexing/config/{name}"
or the default SolrCore configuration of not present. However if it already
exists than this core is used and the data of the current indexing process are
added to the existing SolrCore.
+
+Because of that is is possible to subsequently add information of different
datasets to the same SolrIndex. However users need to know that if the
different dataset contain the same entity (resource with the same URI) the
information of the second dataset will replace those of the first. Nonetheless
this would allow in the given demo to create separate configurations (e.g.
mappings) for all four datasets while still ensuring the indexed data are
contained in the same SolrIndex.
+
+This might be useful in situations where the same property (e.g. rdfs:label)
is used by the different datasets in different ways. Because than one could
create a mapping for dataset1 that maps rdfs:label > skos:prefLabel and for
dataset2 an mapping that ensures that rdfs:label > skos:altLabel.
+
+Workflows like that can be easily implemented by shell scrips or by setting
soft links in the file system.
+
+### Entity Filters
+
+Often users will only be interested in specific Entities of a dataset (e.g.
only in Drugs but not in drug interactions, genes, side effects â¦). In such
cases Entity Filters can be used to specify what entities should be indexed and
what entities can be safely ignored.
+
+This can be achieved by using the "FieldValueFilter" actually a special
implementation of an EntityProcessor. It is included by default within the
"indexing.properties" configuration, but it is deactivated by the default
configuration within the "entityTypes.properties". Detailed information on how
to correctly configure this filter are provided within the
"entityTypes.properties" file. To give an example the following configuration
would just index drugs (of all datasets), diseases and organizations. All other
entities such as sider:side_effects and dailymed:ingredients would be skipped.
+
+ field=rdf:type
+ values= drugbank:drugs; ailymed:drugs; sider:drugs; tcm:Medicine;
diseasome:diseases; dailymed:organization
+
+FieldValueFilter supports only a single field/value combination and entities
are selected if they do match at least a single of the defined values. Users
that need to filter for several fields and/or multiple values can configure
multiple instances. This is achieved by adding the "FieldValueFilter" multiple
times as entityProcessor in the "indexing.properties" file but with different
config parameters. Here is an example of such an configuration
+
+
entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:filter1;org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:filter2;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor
+
+Make shure that the "{indexing-root}/indexing/config" contains both a
"filter1.properties" and "filter2.properties" file with the according filter
rules. Only Entities that pass both filters will be indexed.
+
+## Querying and traversing the ehealth dataset
+
+This section assumes that this demo is running on a Apache Stanbol server
(version 0.9.0-incubating or later). Readers that do not run their own server
or have not yet installed this demo are encouraged to do so. If you do not want
to do that you can also use the [Stambol test
server](http://dev.iks-project.eu:8081) hosted by the IKS project. However all
the links used by this demo will point to "http://localhost:8080". So you will
need to edit the used commands.
+
+### Traversing owl:sameAs
+
+Sider, Drugbank and Dailymed are interlinked with each other but do define a
lot of different sets of properties. The following example shows how to collect
information about a drug based on following "owl:sameAs" relations defined
in-between Dailymed, Sider and DrugBank.
+
+ name = dailymed:name;
+ activeIngredient = dailymed:activeIngredient/rdfs:label;
+ indication = dailymed:indication;
+ dosage = dailymed:dosage;
+ adverseReaction = dailymed:adverseReaction;
+ warning = dailymed:boxedWarning;
+ contraindication = dailymed:contraindication;
+
+ sideEffect = (owl:sameAs)+/sider:sideEffect/rdfs:label;
+
+ genericName = (owl:sameAs)+/drugbank:genericName;
+ key = (owl:sameAs)+/drugbank:inchiKey;
+ indication = (owl:sameAs)+/drugbank:indication;
+ foodInteraction = (owl:sameAs)+/drugbank:foodInteraction;
+ toxicity = (owl:sameAs)+/drugbank:toxicity;
+ pharmacology = (owl:sameAs)+/drugbank:pharmacology;
+
+Here [LDpath](http://code.google.com/p/ldpath/) is used to collect the
interesting information. "(owl:sameAs)+" is used to build the transitive
closure over the "owl:sameAs" properties. This LDpath program ensures that the
context is an entity if the type "dailymed:drugs".
+
+LDPath statements like that can be used with the
+
+* [ehealth/ldpath](http://localhost:8080/entityhub/site/ehealth/ldpath)
endpoint to request the information for a single drug
+* [ehealth/find](http://localhost:8080/entityhub/site/ehealth/find) endpoint
to search for "dailymed:name" (make sure the language field is empty if you use
the UI)
+* [ehealth/query](http://localhost:8080/entityhub/site/ehealth/query) endpoint
to make any kind of field queries.
+