README.md

rwesten Thu, 29 Mar 2012 09:33:41 -0700

Author: rwesten
Date: Thu Mar 29 16:33:15 2012
New Revision: 1306969

URL: http://svn.apache.org/viewvc?rev=1306969&view=rev
Log:
added additional technical details about the demo


Modified:
    incubator/stanbol/trunk/demos/ehealth/README.md

Modified: incubator/stanbol/trunk/demos/ehealth/README.md
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/trunk/demos/ehealth/README.md?rev=1306969&r1=1306968&r2=1306969&view=diff
==============================================================================
--- incubator/stanbol/trunk/demos/ehealth/README.md (original)
+++ incubator/stanbol/trunk/demos/ehealth/README.md Thu Mar 29 16:33:15 2012
@@ -25,7 +25,7 @@ This demo uses the following datasets:
 
 * __[Dailymed](http://dailymed.nlm.nih.gov/dailymed/)__([RDF 
version](http://www4.wiwiss.fu-berlin.de/dailymed/))): Published by the 
National Library of Medicine, this dataset provides high quality information 
about marketed drugs.
 * __[SIDER](http://sideeffects.embl.de/)__([RDF 
version](http://www4.wiwiss.fu-berlin.de/sider)): SIDER contains information on 
marketed drugs and their adverse effects. The information is extracted from 
public documents and package inserts.
-* 
__[Diseasome](http://www.nd.edu/~networks/Publication%20Categories/03%20Journal%20Articles/Biology/HumanDisease_PNAS-V104-p8685(14My07).pdf)__([RDF
 version](http://www4.wiwiss.fu-berlin.de/diseasome)): The human disease 
network publishes a network of 4,300 disorders and disease genes linked by 
known disorder-gene associations for exploring all known phenotype and disease 
gene associations, indicating the common genetic origin of many diseases.
+* 
__[Diseasome](http://www.nd.edu/~networks/Publication%20Categories/03%20Journal%20Articles/Biology/HumanDisease_PNAS-V104-p8685%2814My07%29.pdf)__([RDF
 version](http://www4.wiwiss.fu-berlin.de/diseasome)): The human disease 
network publishes a network of 4,300 disorders and disease genes linked by 
known disorder-gene associations for exploring all known phenotype and disease 
gene associations, indicating the common genetic origin of many diseases.
 * __[DrugBank](http://www.drugbank.ca/)__([RDF 
version](http://www4.wiwiss.fu-berlin.de/drugbank)): A repository of almost 
5000 FDA-approved small molecule and biotech drugs. It contains detailed 
information about drugs including chemical, pharmacological and pharmaceutical 
data; along with comprehensive drug target data such as sequence, structure, 
and pathway information.
 
 Note that the RDF versions of this dataset used by this dataset is hosted [by 
the [Freie UniversitÃ¤t 
Berlin](http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/)
@@ -65,10 +65,13 @@ After that the you will be able to 
 * use the datasets with the [Stanbol 
Entityub](http://localhost:8080/entityhub/site/ehealth/)(url: 
http:{host}:{port}/{alias}/entityhub/site/ehealth)
 * extract ehealth related terms by using the [Stanbol 
Enhancer](http://localhost:8080/enhancer/chain/ehealth) (url: 
http:{host}:{port}/{alias}/enhancer/chain/ehealth)
 
+---
 
-## Backround information about this demo
+__NOTE__: The remaining part of this document provides detailed information 
about this demo and provides information on how to customize it further to 
specific needs. Users that want only use this demo will not need to read this 
part.
 
-### Indexing
+---
+
+## Indexing
 
 The configuration used for indexing can be found at
 
@@ -81,7 +84,7 @@ It contains of the following parts:
 * __fieldboost.properties__: configuration for the field boosts. TODO: link to 
dock
 * __ehealth/__: the SolrCore configuration used for indexing. This is used in 
this example to customize how Solr indexes labels and ID. See the following 
section for details.
 
-#### Customizing the Solr Schema used for indexing
+### Customizing the Solr Schema used for indexing
 
 The default SolrCore configuration used by the Apache Entityhub is contained 
in the SolrYard module and can be found 
[here](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/yard/solr/src/main/resources/solr/core/default.solrindex.zip).
 This configuration will be used if no customized configuration is present in 
"{indexing-root}/indexing/config/{name}" where {name} refers to the value of 
the property "name" in the "indexing.properties".
 
@@ -131,4 +134,143 @@ Such field types are than applied to spe
    <field name="@/drugbank:smilesStringIsomeric/" type="string" indexed="true" 
stored="true" multiValued="true"/>
    [...] 
 
-Field
+The defined field names must include the prefixes used by the Apache Entityhub 
to represent RDF types. In this case '@' refers to a plain literal without a 
defined language and '/' is used as separator between the prefix, property and 
postfix.
+
+### Customized Mappings for the used Datasets
+
+Such mappings are configured by the "mappings.txt" file in the 
"{indexing-root}/indexing/config" directory. 
+
+NOTE that for this demo the "mapping.txt" file is located at 
"./src/main/indexing/conifg/mapping.txt" and copied by the "./indexing.sh" 
script to the "./target/indexing/indexing/config" folder. Users that want to 
modify the mappings should edit the mappings.txt file under "./src"!.
+
+While this demo defines a lot of mappings a lot of them could be omitted, 
because they do just validate data types. In the following some of those data 
types mappings are shown.
+
+    diseasome:geneId | d=xsd:anyURI
+    drugbank:creationDate | d=xsd:dateTime
+    drugbank:patientInformationInsert | d=xsd:anyURI
+
+Data type mappings are only needed if the dataset does not correctly specify 
the XSD datatype for literal values. Typically this happens for numbers that 
are stored as plain literals.
+
+Important are field mappings such as the following mappings for SKOS preferred 
labels.
+
+    drugbank:genericName > skos:prefLabel
+    diseasome:name > skos:prefLabel
+    dailymed:fullName > skos:prefLabel
+
+This specific set of mappings allow to search for entities of the three 
different datasets by using one and the same property. This is extremely useful 
for finding those entities form text parsed to the enhancer, because one needs 
only to configure a single KeywordExtractionEngine instance to cover them all.
+
+A similar configuration is used for the various IDs specified for drugs. Those 
are all mapped to the "skos:notation" field. This allows to easily identify 
them regardless of the ID known by the User or mentioned in an text. Here are 
those mappings.
+
+    drugbank:ahfsCode | d=xsd:string > skos:notation
+    drugbank:atcCode | d=xsd:string > skos:notation
+    drugbank:dpdDrugIdNumber | d=xsd:string > skos:notation
+    drugbank:pdbHomologyId | d=xsd:string > skos:notation
+    drugbank:inchiKey | d=xsd:string > skos:notation
+    drugbank:primaryAccessionNo | d=xsd:string > skos:notation
+    drugbank:secondaryAccessionNumber | d=xsd:string > skos:notation
+
+Note also the wildcard mappings for the used namespaces
+
+    dailymed:*
+    drugbank:*
+    diseasome:*
+    sider:*
+    
+that ensures that all properties of those namespaces get indexed. This also 
ensures that even if a mapping like 
+
+    drugbank:genericName > skos:prefLabel
+
+is defined also
+
+    drugbank:genericName
+
+will be present in the indexed dataset. Without those wildcard mappings one 
would need to explicitly define both
+
+    drugbank:genericName > skos:prefLabel
+    drugbank:genericName
+
+to get the same result.
+
+### LDPath mappings
+
+While the default mapping language supports a lot of use cases for mapping, 
converting and filtering of properties it is by far not as capable as 
[LDpath](http://code.google.com/p/ldpath/). Because of that the indexing tools 
has also support for using LDPath to process entities by using the 
"LdpathProcessor". 
+
+A typical configuration of this processor (in the "indexing.properties" file) 
would look like
+
+    
org.apache.stanbol.entityhub.indexing.core.processor.LdpathProcessor,ldpath:ldpath-mapping.txt,append:true;
+
+This configuration says that the LDPath program is read from a file with the 
name "ldpath-mapping.txt" within the same directory and that the results of the 
transformation are appended to the indexed entity. If append is deactivated 
that the data of the parsed entity will be replaced by the results of the 
LDPath statement.
+
+A typical usage example of the LdpathProcessor processor are type specific 
mappings such as
+
+    skos:prefLabel = .[rdf:type is diseasome:genes]/rdfs:label;
+
+This specifies that only for entities of the type "diseasome:genes" the 
rdfs:label is mapped to skos:prefLabel. 
+
+
+__NOTEs__:
+
+* The LdpathProcessor has only access to the local properties of the currently 
indexed entity. LDPath statements that refer other information such as paths 
with a lengths > 1 or inverse properties will not work
+* Processors can be chained by defining multiple Processor instances in the 
configuration and separating them with ';'. This allows to use multiple 
LdpathProcessor instances and/or to chain LdpathProcessor(s) with others such 
as the "FiledMapperProcessor". Processors are executed as defined within the 
configuration of the "entityProcessor" property. 
+* When using the FiledMapperProcessor on results of the LdpathProcessor make 
sure that the fields defined in the LDpath statements are indexed by the 
FiledMapperProcessor. Otherwise such values will NOT be indexed!
+
+
+### Indexing Datasets separately
+
+This demo indexes all four datasets in a single step. However this is not 
required. With a simple trick it is possible to index different datasets with 
different indexing configurations to the same target. This section describes 
how this could be achieved and why users might want to do this.
+
+This demo uses Solr as target for the indexing process. Theoretically there 
might be several possibility, but currently this is the only available 
IndexingDestination implementation. The SolrIdnex used to store the data is 
located at "{indexing-root}/indexing/destination/indexes/default/{name}. If 
this directory does not alread exist it is initialized by the indexing tool 
based on the SolrCore configuration in "{indexing-root}/indexing/config/{name}" 
or the default SolrCore configuration of not present. However if it already 
exists than this core is used and the data of the current indexing process are 
added to the existing SolrCore.
+
+Because of that is is possible to subsequently add information of different 
datasets to the same SolrIndex. However users need to know that if the 
different dataset contain the same entity (resource with the same URI) the 
information of the second dataset will replace those of the first. Nonetheless 
this would allow in the given demo to create separate configurations (e.g. 
mappings) for all four datasets while still ensuring the indexed data are 
contained in the same SolrIndex.
+
+This might be useful in situations where the same property (e.g. rdfs:label) 
is used by the different datasets in different ways. Because than one could 
create a mapping for dataset1 that maps rdfs:label > skos:prefLabel and for 
dataset2 an mapping that ensures that rdfs:label > skos:altLabel.
+
+Workflows like that can be easily implemented by shell scrips or by setting 
soft links in the file system.
+
+###  Entity Filters
+
+Often users will only be interested in specific Entities of a dataset (e.g. 
only in Drugs but not in drug interactions, genes, side effects â¦). In such 
cases Entity Filters can be used to specify what entities should be indexed and 
what entities can be safely ignored.
+
+This can be achieved by using the "FieldValueFilter" actually a special 
implementation of an EntityProcessor. It is included by default within the 
"indexing.properties" configuration, but it is deactivated by the default 
configuration within the "entityTypes.properties". Detailed information on how 
to correctly configure this filter are provided within the 
"entityTypes.properties" file. To give an example the following configuration 
would just index drugs (of all datasets), diseases and organizations. All other 
entities such as  sider:side_effects and dailymed:ingredients would be skipped.
+
+    field=rdf:type
+    values= drugbank:drugs; ailymed:drugs; sider:drugs; tcm:Medicine; 
diseasome:diseases; dailymed:organization
+
+FieldValueFilter supports only a single field/value combination and entities 
are selected if they do match at least a single of the defined values. Users 
that need to filter for several fields and/or multiple values can configure 
multiple instances. This is achieved by adding the "FieldValueFilter" multiple 
times as entityProcessor in the "indexing.properties" file but with different 
config parameters. Here is an example of such an configuration
+
+     
entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:filter1;org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:filter2;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor
+
+Make shure that the "{indexing-root}/indexing/config" contains both a 
"filter1.properties" and "filter2.properties" file with the according filter 
rules. Only Entities that pass both filters will be indexed.
+
+## Querying and traversing the ehealth dataset
+
+This section assumes that this demo is running on a Apache Stanbol server 
(version 0.9.0-incubating or later). Readers that do not run their own server 
or have not yet installed this demo are encouraged to do so. If you do not want 
to do that you can also use the [Stambol test 
server](http://dev.iks-project.eu:8081) hosted by the IKS project. However all 
the links used by this demo will point to "http://localhost:8080";. So you will 
need to edit the used commands.
+
+### Traversing owl:sameAs
+
+Sider, Drugbank and Dailymed are interlinked with each other but do define a 
lot of different sets of properties. The following example shows how to collect 
information about a drug based on following "owl:sameAs" relations defined 
in-between Dailymed, Sider and DrugBank. 
+
+    name = dailymed:name;
+    activeIngredient = dailymed:activeIngredient/rdfs:label;
+    indication = dailymed:indication;
+    dosage = dailymed:dosage;
+    adverseReaction = dailymed:adverseReaction;
+    warning = dailymed:boxedWarning;
+    contraindication = dailymed:contraindication;
+    
+    sideEffect = (owl:sameAs)+/sider:sideEffect/rdfs:label;
+          
+    genericName = (owl:sameAs)+/drugbank:genericName;
+    key = (owl:sameAs)+/drugbank:inchiKey;
+    indication = (owl:sameAs)+/drugbank:indication;
+    foodInteraction = (owl:sameAs)+/drugbank:foodInteraction;
+    toxicity = (owl:sameAs)+/drugbank:toxicity;
+    pharmacology = (owl:sameAs)+/drugbank:pharmacology;
+
+Here [LDpath](http://code.google.com/p/ldpath/) is used to collect the 
interesting information. "(owl:sameAs)+" is used to build the transitive 
closure over the "owl:sameAs" properties. This LDpath program ensures that the 
context is an entity if the type "dailymed:drugs".
+
+LDPath statements like that can be used with the
+
+* [ehealth/ldpath](http://localhost:8080/entityhub/site/ehealth/ldpath) 
endpoint to request the information for a single drug
+* [ehealth/find](http://localhost:8080/entityhub/site/ehealth/find) endpoint 
to search for "dailymed:name" (make sure the language field is empty if you use 
the UI)
+* [ehealth/query](http://localhost:8080/entityhub/site/ehealth/query) endpoint 
to make any kind of field queries.
+

svn commit: r1306969 - /incubator/stanbol/trunk/demos/ehealth/README.md

Reply via email to