Author: buildbot
Date: Mon Jun 2 07:54:08 2014
New Revision: 910868
Log:
Staging update by buildbot for stanbol
Modified:
websites/staging/stanbol/trunk/content/ (props changed)
websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html
Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Jun 2 07:54:08 2014
@@ -1 +1 @@
-1599100
+1599105
Modified:
websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html
==============================================================================
--- websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html
(original)
+++ websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html Mon
Jun 2 07:54:08 2014
@@ -94,8 +94,7 @@
<p>The aim of this usage scenario is to provide Apache Stanbol users with all
the required knowledge to customize Apache Stanbol to be used in their specific
domain. This includes</p>
<ul>
<li>Two possibilities to manage custom Vocabularies<ol>
-<li>via the RESTful interface provided by a Managed Site or<br />
-</li>
+<li>via the RESTful interface provided by a Managed Site or </li>
<li>by using a ReferencedSite with a full local index</li>
</ol>
</li>
@@ -104,11 +103,11 @@
<li>Configuring the Stanbol Enhancer to make use of the indexed and imported
Vocabularies</li>
</ul>
<h2 id="overview">Overview</h2>
-<p>The following figure shows the typical Enhancement workflow that may start
with some preprocessing steps (e.g. the conversion of rich text formats to
plain text) followed by the Natural Language Processing phase. Next 'Semantic
Lifting' aims to connect the results of text processing and link it to the
application domain of the user. During Postprocessing those results may get
further refined.
-<p style="text-align: center;"> followed by the Natural Language Processing phase. Next 'Semantic
Lifting' aims to connect the results of text processing and link it to the
application domain of the user. During Postprocessing those results may get
further refined.</p>
+<p style="text-align: center;"></p>
+
<p>This usage scenario is all about the Semantic Lifting phase. This phase is
most central to for how well enhancement results to match the requirements of
the users application domain. Users that need to process health related
documents will need to provide vocabularies containing life science related
entities otherwise the Stanbol Enhancer will not perform as expected on those
documents. Similar processing Customer requests can only work if Stanbol has
access to data managed by the CRM.</p>
-<p>This scenario aims to provide Stanbol users with all information necessary
to use Apache Stanbol in scenarios where domain specific vocabularies are
required.<br />
-</p>
+<p>This scenario aims to provide Stanbol users with all information necessary
to use Apache Stanbol in scenarios where domain specific vocabularies are
required. </p>
<h2 id="managing-custom-vocabularies-with-the-stanbol-entityhub">Managing
Custom Vocabularies with the Stanbol Entityhub</h2>
<p>By default the Stanbol Enhancer does use the Entityhub component for
linking Entities with mentions in the processed text. While Users may extend
the Enhancer to allow the usage of other sources this is outside of the scope
of this scenario.</p>
<p>The Stanbol Entityhub provides two possibilities to manage vocabularies</p>
@@ -124,13 +123,13 @@
<li>the <em><a
href="components/entityhub/managedsite#configuration-of-the-yardsite">YardSite</a></em>
- the component that implements the ManagedSite interface.</li>
</ol>
<p>After completing those two steps an empty Managed site should be ready to
use available under</p>
-<div class="codehilite"><pre><span class="n">http:</span><span
class="sr">//</span><span class="p">{</span><span class="n">stanbol</span><span
class="o">-</span><span class="n">host</span><span class="p">}</span><span
class="sr">/entityhub/si</span><span class="n">tes</span><span
class="sr">/{managed-site-name}/</span>
+<div class="codehilite"><pre><span class="n">http</span><span
class="p">:</span><span class="o">//</span><span class="p">{</span><span
class="n">stanbol</span><span class="o">-</span><span
class="n">host</span><span class="p">}</span><span class="o">/</span><span
class="n">entityhub</span><span class="o">/</span><span
class="n">sites</span><span class="o">/</span><span class="p">{</span><span
class="n">managed</span><span class="o">-</span><span
class="n">site</span><span class="o">-</span><span class="n">name</span><span
class="p">}</span><span class="o">/</span>
</pre></div>
<p>and users can start to upload the Entities of the controlled vocabulary by
using the RESTful interface such as</p>
-<div class="codehilite"><pre><span class="n">curl</span> <span
class="o">-</span><span class="n">i</span> <span class="o">-</span><span
class="n">X</span> <span class="n">PUT</span> <span class="o">-</span><span
class="n">H</span> <span class="s">"Content-Type:
application/rdf+xml"</span> <span class="o">-</span><span
class="n">T</span> <span class="p">{</span><span class="n">rdf</span><span
class="o">-</span><span class="n">xml</span><span class="o">-</span><span
class="n">data</span><span class="p">}</span> <span class="o">\</span>
- <span
class="s">"http://{stanbol-host}/entityhub/site/{managed-site-name}/entity"</span>
+<div class="codehilite"><pre><span class="n">curl</span> <span
class="o">-</span><span class="nb">i</span> <span class="o">-</span><span
class="n">X</span> <span class="n">PUT</span> <span class="o">-</span><span
class="n">H</span> "<span class="n">Content</span><span
class="o">-</span><span class="n">Type</span><span class="p">:</span> <span
class="n">application</span><span class="o">/</span><span
class="n">rdf</span><span class="o">+</span><span class="n">xml</span>"
<span class="o">-</span><span class="n">T</span> <span class="p">{</span><span
class="n">rdf</span><span class="o">-</span><span class="n">xml</span><span
class="o">-</span><span class="n">data</span><span class="p">}</span> <span
class="o">\</span>
+ "<span class="n">http</span><span class="p">:</span><span
class="o">//</span><span class="p">{</span><span class="n">stanbol</span><span
class="o">-</span><span class="n">host</span><span class="p">}</span><span
class="o">/</span><span class="n">entityhub</span><span class="o">/</span><span
class="n">site</span><span class="o">/</span><span class="p">{</span><span
class="n">managed</span><span class="o">-</span><span
class="n">site</span><span class="o">-</span><span class="n">name</span><span
class="p">}</span><span class="o">/</span><span class="n">entity</span>"
</pre></div>
@@ -208,7 +207,7 @@ org.apache.stanbol.entityhub.indexing.ge
</ul>
<p>You find both files in the
<code>{indexing-working-dir}/indexing/dist/</code> folder.</p>
<p>After the installation your data will be available at</p>
-<div class="codehilite"><pre><span class="n">http:</span><span
class="sr">//</span><span class="p">{</span><span class="n">stanbol</span><span
class="o">-</span><span class="n">instance</span><span class="p">}</span><span
class="sr">/entityhub/si</span><span class="n">te</span><span
class="o">/</span><span class="p">{</span><span class="n">name</span><span
class="p">}</span>
+<div class="codehilite"><pre><span class="n">http</span><span
class="p">:</span><span class="o">//</span><span class="p">{</span><span
class="n">stanbol</span><span class="o">-</span><span
class="n">instance</span><span class="p">}</span><span class="o">/</span><span
class="n">entityhub</span><span class="o">/</span><span
class="n">site</span><span class="o">/</span><span class="p">{</span><span
class="n">name</span><span class="p">}</span>
</pre></div>
@@ -242,14 +241,22 @@ For the configuration of this engine you
<li>"{name}Linking - the <a
href="components/enhancer/engines/namedentitytaggingengine.html">Named Entity
Tagging Engine</a> for your vocabulary as configured above.</li>
</ul>
<p>Both the <a href="components/enhancer/chains/weightedchain.html">weighted
chain</a> and the <a href="components/enhancer/chains/listchain.html">list
chain</a> can be used for the configuration of such a chain.</p>
-<h3 id="configuring-named-entity-linking_1">Configuring Named Entity
Linking</h3>
-<p>First it is important to note the difference between <em>Named Entity
Linking</em> and <em>Entity Linking</em>. While <em>Named Entity Linking</em>
only considers <em>Named Entities</em> detected by NER (Named Entity
Recognition) <em>Entity Linking</em> does work on Words (Tokens). Because of
that is has much lower NLP requirements and can even operate for languages
where only word tokenization is supported. However extraction results AND
performance do greatly improve with POS (Part of Speech) tagging support. Also
Chunking (Noun Phrase detection), NER and Lemmatization results can be consumed
by Entity Linking to further improve extraction results. For details see the
documentation of the <a
href="components/enhancer/engines/entitylinking#linking-process">Entity Linking
Process</a>.</p>
+<h3 id="configuring-entity-linking">Configuring Entity Linking</h3>
+<p>First it is important to note the difference between <em>Named Entity
Linking</em> and <em>Entity Linking</em>. While <em>Named Entity Linking</em>
only considers <em>Named Entities</em> detected by NER (Named Entity
Recognition) <em>Entity Linking</em> does work on Words (Tokens). As NER
support is only available for a limited number of languages <em>Named Entity
Linking</em> is only an option for those languages. <em>Entity Linking</em>
only require correct tokenization of the text. So it can be used for nearly
every language. However <em>NOTE</em> that POS (Part of Speech) tagging will
greatly improve quality and also speed as it allows to only lookup Nouns. Also
Chunking (Noun Phrase detection), NER and Lemmatization results are considered
by Entity Linking to improve vocabulary lookups. For details see the
documentation of the <a
href="components/enhancer/engines/entitylinking#linking-process">Entity Linking
Process</a>.</p>
<p>The second big difference is that <em>Named Entity Linking</em> can only
support Entity types supported by the NER modles (Persons, Organizations and
Places). <em>Entity Linking</em> does not have this restriction. This advantage
comes also with the disadvantage that Entity Lookups to the Controlled
Vocabulary are only based on Label similarities. <em>Named Entity Linking</em>
does also use the type information provided by NER.</p>
-<p>To use <em>Entity Linking</em> with a custom Vocabulary Users need to
configure an instance of the <a
href="components/enhancer/engines/entityhublinking">Entityhub Linking
Engine</a>. While this Engine provides more than twenty configuration
parameters the following list provides an overview about the most important.
For detailed information please see the documentation of the Engine.</p>
+<p>To use <em>Entity Linking</em> with a custom Vocabulary Users need to
configure an instance of the <a
href="components/enhancer/engines/entityhublinking">Entityhub Linking
Engine</a> or a <a href="components/enhancer/engines/lucenefstlinking">FST
Linking engine</a>. While both of those Engines provides 20+ configuration
parameters only very few of them are required for a working configuration.</p>
<ol>
-<li>The "Name" of the enhancement engine. It is recommended to use something
like "{name}Extraction" - where {name} is the name of the Entityhub Site</li>
-<li>The name of the "Managed- / Referenced Site" holding your vocabulary. Here
you have to configure the {name}</li>
-<li>The "Label Field" is the URI of the property in your vocabulary providing
the labels used for matching. You can only use a single field. If you want to
use values of several fields you have two options: (1) to adapt your indexing
configuration to copy the values of those fields to a single one (e.g. the
values of "skos:prefLabel" and "skos:altLabel" are copied to "rdfs:label" in
the default configuration of the Entityhub indexing tool (see
{indexing-working-dir}/indexing/config/mappings.txt) (2) to configure multiple
EntityubLinkingEngines - one for each label field. Option (1) is preferable as
long as you do not need to use different configurations for the different
labels.</li>
+<li>The "Name" of the enhancement engine. It is recommended to use something
like "{name}Extraction" or "{name}-linking" - where {name} is the name of the
Entityhub Site</li>
+<li>The link to the data source<ul>
+<li>in case of the Entityhub Linking Engine this is the name of the "Managed-
/ Referenced Site" holding your vocabulary - so if you followed this scenario
you need to configure the {name}</li>
+<li>in case of the FST linking engine this is the link to the SolrCore with
the index of your custom vocabulary. If you followed this scenario you need to
configure the {name} and set the field name encoding to "SolrYard".</li>
+</ul>
+</li>
+<li>The configuration of the field used for linking<ul>
+<li>in case of the Entityhub Linking Engine the "Label Field" needs to be set
to the URI of the property holding the labels. You can only use a single field.
If you want to use values of several fields you need to adapt your indexing
configuration to copy the values of those fields to a single one (e.g. by
adding <code>skos:prefLabel > rdfs:label</code> and <code>skos:altLabel >
rdfs:label</code> to the
<code>{indexing-working-dir}/indexing/config/mappings.txt</code> config.</li>
+<li>in case of the FST Linking engine you need to provide the <a
href="components/enhancer/engines/lucenefstlinking#fst-tagging-configuration">FST
Tagging Configuration</a>. If you store your labels in the
<code>rdfs:label</code> field and you want to support all languages present in
your vocabulary use <code>*;field=rdfs:label;generate=true</code>.
<em>NOTE</em> that <code>generate=true</code> is required to allow the engine
to (re)create FST models at runtime.</li>
+</ul>
+</li>
<li>The "Link ProperNouns only": If the custom Vocabulary contains Proper
Nouns (Named Entities) than this parameter should be activated. This options
causes the Entity Linking process to not making queries for commons nouns and
by that receding the number of queries agains the controlled vocabulary by
~70%. However this is not feasible if the vocabulary does contain Entities that
are common nouns in the language. </li>
<li>The "Type Mappings" might be interesting for you if your vocabulary
contains custom types as those mappings can be used to map 'rdf:type's of
entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's -
created by the Apache Stanbol Enhancer to annotate occurrences of extracted
entities in the parsed text. See the <a
href="components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax">type
mapping syntax</a> and the <a
href="enhancementusage.html#entity-tagging-with-disambiguation-support">usage
scenario for the Apache Stanbol Enhancement Structure</a> for details.</li>
</ol>
@@ -260,7 +267,7 @@ For the configuration of this engine you
<li>opennlp-token - <a
href="components/enhancer/engines/opennlptokenizer">OpenNLP based Word
tokenization</a>. Works for all languages where white spaces can be used to
tokenize.</li>
<li>opennlp-pos - <a href="components/enhancer/engines/opennlppos">OpenNLP
Part of Speech tagging</a></li>
<li>opennlp-chunker - The <a
href="components/enhancer/engines/opennlpchunker">OpenNLP chunker</a> provides
Noun Phrases</li>
-<li>"{name}Extraction - the <a
href="components/enhancer/engines/entityhublinking">Entityhub Linking
Engine</a> configured for the custom vocabulary.</li>
+<li>"{name}Extraction - the <a
href="components/enhancer/engines/entityhublinking">Entityhub Linking
Engine</a> or <a
href="components/enhancer/engines/lucenefstlinking#fst-tagging-configuration">FST
Tagging Configuration</a> configured for the custom vocabulary.</li>
</ul>
<p>Both the <a href="components/enhancer/chains/weightedchain.html">weighted
chain</a> and the <a href="components/enhancer/chains/listchain.html">list
chain</a> can be used for the configuration of such a chain.</p>
<p>The documentation of the Stanbol NLP processing module provides <a
href="components/enhancer/nlp/#stanbol-enhancer-nlp-support">detailed
Information</a> about integrated NLP frameworks and suupported languages.</p>
@@ -271,7 +278,7 @@ For the configuration of this engine you
3) a "dbpedia-proper-noun-linking" chain showing <em>Named Entity Linking</em>
based on DBpedia</p>
<p><strong>Change the enhancement chain bound to "/enhancer"</strong></p>
<p>The enhancement chain bound to </p>
-<div class="codehilite"><pre><span class="n">http:</span><span
class="sr">//</span><span class="p">{</span><span class="n">stanbol</span><span
class="o">-</span><span class="n">host</span><span class="p">}</span><span
class="o">/</span><span class="n">enhancer</span>
+<div class="codehilite"><pre><span class="n">http</span><span
class="p">:</span><span class="o">//</span><span class="p">{</span><span
class="n">stanbol</span><span class="o">-</span><span
class="n">host</span><span class="p">}</span><span class="o">/</span><span
class="n">enhancer</span>
</pre></div>