customvocabulary.html

buildbot Mon, 02 Jun 2014 00:55:31 -0700

Author: buildbot
Date: Mon Jun  2 07:54:08 2014
New Revision: 910868

Log:
Staging update by buildbot for stanbol


Modified:
    websites/staging/stanbol/trunk/content/   (props changed)
    websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html

Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Jun  2 07:54:08 2014
@@ -1 +1 @@
-1599100
+1599105

Modified: 
websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html
==============================================================================
--- websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html 
(original)
+++ websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html Mon 
Jun  2 07:54:08 2014
@@ -94,8 +94,7 @@
 <p>The aim of this usage scenario is to provide Apache Stanbol users with all 
the required knowledge to customize Apache Stanbol to be used in their specific 
domain. This includes</p>
 <ul>
 <li>Two possibilities to manage custom Vocabularies<ol>
-<li>via the RESTful interface provided by a Managed Site or<br />
-</li>
+<li>via the RESTful interface provided by a Managed Site or  </li>
 <li>by using a ReferencedSite with a full local index</li>
 </ol>
 </li>
@@ -104,11 +103,11 @@
 <li>Configuring the Stanbol Enhancer to make use of the indexed and imported 
Vocabularies</li>
 </ul>
 <h2 id="overview">Overview</h2>
-<p>The following figure shows the typical Enhancement workflow that may start 
with some preprocessing steps (e.g. the conversion of rich text formats to 
plain text) followed by the Natural Language Processing phase. Next 'Semantic 
Lifting' aims to connect the results of text processing and link it to the 
application domain of the user. During Postprocessing those results may get 
further refined.
-<p style="text-align: center;">![Typical Enhancement 
Workflow](enhancementworkflow.png "The typical Enhancement Chain includes the 
</p>
+<p>The following figure shows the typical Enhancement workflow that may start 
with some preprocessing steps (e.g. the conversion of rich text formats to 
plain text) followed by the Natural Language Processing phase. Next 'Semantic 
Lifting' aims to connect the results of text processing and link it to the 
application domain of the user. During Postprocessing those results may get 
further refined.</p>
+<p style="text-align: center;">![Typical Enhancement 
Workflow](enhancementworkflow.png)</p>
+
 <p>This usage scenario is all about the Semantic Lifting phase. This phase is 
most central to for how well enhancement results to match the requirements of 
the users application domain. Users that need to process health related 
documents will need to provide vocabularies containing life science related 
entities otherwise the Stanbol Enhancer will not perform as expected on those 
documents. Similar processing Customer requests can only work if Stanbol has 
access to data managed by the CRM.</p>
-<p>This scenario aims to provide Stanbol users with all information necessary 
to use Apache Stanbol in scenarios where domain specific vocabularies are 
required.<br />
-</p>
+<p>This scenario aims to provide Stanbol users with all information necessary 
to use Apache Stanbol in scenarios where domain specific vocabularies are 
required.  </p>
 <h2 id="managing-custom-vocabularies-with-the-stanbol-entityhub">Managing 
Custom Vocabularies with the Stanbol Entityhub</h2>
 <p>By default the Stanbol Enhancer does use the Entityhub component for 
linking Entities with mentions in the processed text. While Users may extend 
the Enhancer to allow the usage of other sources this is outside of the scope 
of this scenario.</p>
 <p>The Stanbol Entityhub provides two possibilities to manage vocabularies</p>
@@ -124,13 +123,13 @@
 <li>the <em><a 
href="components/entityhub/managedsite#configuration-of-the-yardsite">YardSite</a></em>
 - the component that implements the ManagedSite interface.</li>
 </ol>
 <p>After completing those two steps an empty Managed site should be ready to 
use available under</p>
-<div class="codehilite"><pre><span class="n">http:</span><span 
class="sr">//</span><span class="p">{</span><span class="n">stanbol</span><span 
class="o">-</span><span class="n">host</span><span class="p">}</span><span 
class="sr">/entityhub/si</span><span class="n">tes</span><span 
class="sr">/{managed-site-name}/</span>
+<div class="codehilite"><pre><span class="n">http</span><span 
class="p">:</span><span class="o">//</span><span class="p">{</span><span 
class="n">stanbol</span><span class="o">-</span><span 
class="n">host</span><span class="p">}</span><span class="o">/</span><span 
class="n">entityhub</span><span class="o">/</span><span 
class="n">sites</span><span class="o">/</span><span class="p">{</span><span 
class="n">managed</span><span class="o">-</span><span 
class="n">site</span><span class="o">-</span><span class="n">name</span><span 
class="p">}</span><span class="o">/</span>
 </pre></div>
 
 
 <p>and users can start to upload the Entities of the controlled vocabulary by 
using the RESTful interface such as</p>
-<div class="codehilite"><pre><span class="n">curl</span> <span 
class="o">-</span><span class="n">i</span> <span class="o">-</span><span 
class="n">X</span> <span class="n">PUT</span> <span class="o">-</span><span 
class="n">H</span> <span class="s">&quot;Content-Type: 
application/rdf+xml&quot;</span> <span class="o">-</span><span 
class="n">T</span> <span class="p">{</span><span class="n">rdf</span><span 
class="o">-</span><span class="n">xml</span><span class="o">-</span><span 
class="n">data</span><span class="p">}</span> <span class="o">\</span>
-    <span 
class="s">&quot;http://{stanbol-host}/entityhub/site/{managed-site-name}/entity&quot;</span>
+<div class="codehilite"><pre><span class="n">curl</span> <span 
class="o">-</span><span class="nb">i</span> <span class="o">-</span><span 
class="n">X</span> <span class="n">PUT</span> <span class="o">-</span><span 
class="n">H</span> &quot;<span class="n">Content</span><span 
class="o">-</span><span class="n">Type</span><span class="p">:</span> <span 
class="n">application</span><span class="o">/</span><span 
class="n">rdf</span><span class="o">+</span><span class="n">xml</span>&quot; 
<span class="o">-</span><span class="n">T</span> <span class="p">{</span><span 
class="n">rdf</span><span class="o">-</span><span class="n">xml</span><span 
class="o">-</span><span class="n">data</span><span class="p">}</span> <span 
class="o">\</span>
+    &quot;<span class="n">http</span><span class="p">:</span><span 
class="o">//</span><span class="p">{</span><span class="n">stanbol</span><span 
class="o">-</span><span class="n">host</span><span class="p">}</span><span 
class="o">/</span><span class="n">entityhub</span><span class="o">/</span><span 
class="n">site</span><span class="o">/</span><span class="p">{</span><span 
class="n">managed</span><span class="o">-</span><span 
class="n">site</span><span class="o">-</span><span class="n">name</span><span 
class="p">}</span><span class="o">/</span><span class="n">entity</span>&quot;
 </pre></div>
 
 
@@ -208,7 +207,7 @@ org.apache.stanbol.entityhub.indexing.ge
 </ul>
 <p>You find both files in the 
<code>{indexing-working-dir}/indexing/dist/</code> folder.</p>
 <p>After the installation your data will be available at</p>
-<div class="codehilite"><pre><span class="n">http:</span><span 
class="sr">//</span><span class="p">{</span><span class="n">stanbol</span><span 
class="o">-</span><span class="n">instance</span><span class="p">}</span><span 
class="sr">/entityhub/si</span><span class="n">te</span><span 
class="o">/</span><span class="p">{</span><span class="n">name</span><span 
class="p">}</span>
+<div class="codehilite"><pre><span class="n">http</span><span 
class="p">:</span><span class="o">//</span><span class="p">{</span><span 
class="n">stanbol</span><span class="o">-</span><span 
class="n">instance</span><span class="p">}</span><span class="o">/</span><span 
class="n">entityhub</span><span class="o">/</span><span 
class="n">site</span><span class="o">/</span><span class="p">{</span><span 
class="n">name</span><span class="p">}</span>
 </pre></div>
 
 
@@ -242,14 +241,22 @@ For the configuration of this engine you
 <li>"{name}Linking - the <a 
href="components/enhancer/engines/namedentitytaggingengine.html">Named Entity 
Tagging Engine</a> for your vocabulary as configured above.</li>
 </ul>
 <p>Both the <a href="components/enhancer/chains/weightedchain.html">weighted 
chain</a> and the <a href="components/enhancer/chains/listchain.html">list 
chain</a> can be used for the configuration of such a chain.</p>
-<h3 id="configuring-named-entity-linking_1">Configuring Named Entity 
Linking</h3>
-<p>First it is important to note the difference between <em>Named Entity 
Linking</em> and <em>Entity Linking</em>. While <em>Named Entity Linking</em> 
only considers <em>Named Entities</em> detected by NER (Named Entity 
Recognition) <em>Entity Linking</em> does work on Words (Tokens). Because of 
that is has much lower NLP requirements and can even operate for languages 
where only word tokenization is supported. However extraction results AND 
performance do greatly improve with POS (Part of Speech) tagging support. Also 
Chunking (Noun Phrase detection), NER and Lemmatization results can be consumed 
by Entity Linking to further improve extraction results. For details see the 
documentation of the <a 
href="components/enhancer/engines/entitylinking#linking-process">Entity Linking 
Process</a>.</p>
+<h3 id="configuring-entity-linking">Configuring Entity Linking</h3>
+<p>First it is important to note the difference between <em>Named Entity 
Linking</em> and <em>Entity Linking</em>. While <em>Named Entity Linking</em> 
only considers <em>Named Entities</em> detected by NER (Named Entity 
Recognition) <em>Entity Linking</em> does work on Words (Tokens). As NER 
support is only available for a limited number of languages <em>Named Entity 
Linking</em> is only an option for those languages. <em>Entity Linking</em> 
only require correct tokenization of the text. So it can be used for nearly 
every language. However <em>NOTE</em> that POS (Part of Speech) tagging will 
greatly improve quality and also speed as it allows to only lookup Nouns. Also 
Chunking (Noun Phrase detection), NER and Lemmatization results are considered 
by Entity Linking to improve vocabulary lookups. For details see the 
documentation of the <a 
href="components/enhancer/engines/entitylinking#linking-process">Entity Linking 
Process</a>.</p>
 <p>The second big difference is that <em>Named Entity Linking</em> can only 
support Entity types supported by the NER modles (Persons, Organizations and 
Places). <em>Entity Linking</em> does not have this restriction. This advantage 
comes also with the disadvantage that Entity Lookups to the Controlled 
Vocabulary are only based on Label similarities. <em>Named Entity Linking</em> 
does also use the type information provided by NER.</p>
-<p>To use <em>Entity Linking</em> with a custom Vocabulary Users need to 
configure an instance of the <a 
href="components/enhancer/engines/entityhublinking">Entityhub Linking 
Engine</a>. While this Engine provides more than twenty configuration 
parameters the following list provides an overview about the most important. 
For detailed information please see the documentation of the Engine.</p>
+<p>To use <em>Entity Linking</em> with a custom Vocabulary Users need to 
configure an instance of the <a 
href="components/enhancer/engines/entityhublinking">Entityhub Linking 
Engine</a> or a <a href="components/enhancer/engines/lucenefstlinking">FST 
Linking engine</a>. While both of those Engines provides 20+ configuration 
parameters only very few of them are required for a working configuration.</p>
 <ol>
-<li>The "Name" of the enhancement engine. It is recommended to use something 
like "{name}Extraction" - where {name} is the name of the Entityhub Site</li>
-<li>The name of the "Managed- / Referenced Site" holding your vocabulary. Here 
you have to configure the {name}</li>
-<li>The "Label Field" is the URI of the property in your vocabulary providing 
the labels used for matching. You can only use a single field. If you want to 
use values of several fields you have two options: (1) to adapt your indexing 
configuration to copy the values of those fields to a single one (e.g. the 
values of "skos:prefLabel" and "skos:altLabel" are copied to "rdfs:label" in 
the default configuration of the Entityhub indexing tool (see 
{indexing-working-dir}/indexing/config/mappings.txt) (2) to configure multiple 
EntityubLinkingEngines - one for each label field. Option (1) is preferable as 
long as you do not need to use different configurations for the different 
labels.</li>
+<li>The "Name" of the enhancement engine. It is recommended to use something 
like "{name}Extraction" or "{name}-linking" - where {name} is the name of the 
Entityhub Site</li>
+<li>The link to the data source<ul>
+<li>in case of the Entityhub Linking Engine this is the name of the "Managed- 
/ Referenced Site" holding your vocabulary - so if you followed this scenario 
you need to configure the {name}</li>
+<li>in case of the FST linking engine this is the link to the SolrCore with 
the index of your custom vocabulary. If you followed this scenario you need to 
configure the {name} and set the field name encoding to "SolrYard".</li>
+</ul>
+</li>
+<li>The configuration of the field used for linking<ul>
+<li>in case of the Entityhub Linking Engine the "Label Field" needs to be set 
to the URI of the property holding the labels. You can only use a single field. 
If you want to use values of several fields you need to adapt your indexing 
configuration to copy the values of those fields to a single one (e.g. by 
adding <code>skos:prefLabel &gt; rdfs:label</code> and <code>skos:altLabel &gt; 
rdfs:label</code> to the 
<code>{indexing-working-dir}/indexing/config/mappings.txt</code> config.</li>
+<li>in case of the FST Linking engine you need to provide the <a 
href="components/enhancer/engines/lucenefstlinking#fst-tagging-configuration">FST
 Tagging Configuration</a>. If you store your labels in the 
<code>rdfs:label</code> field and you want to support all languages present in 
your vocabulary use <code>*;field=rdfs:label;generate=true</code>. 
<em>NOTE</em> that <code>generate=true</code> is required to allow the engine 
to (re)create FST models at runtime.</li>
+</ul>
+</li>
 <li>The "Link ProperNouns only": If the custom Vocabulary contains Proper 
Nouns (Named Entities) than this parameter should be activated. This options 
causes the Entity Linking process to not making queries for commons nouns and 
by that receding the number of queries agains the controlled vocabulary by 
~70%. However this is not feasible if the vocabulary does contain Entities that 
are common nouns in the language. </li>
 <li>The "Type Mappings" might be interesting for you if your vocabulary 
contains custom types as those mappings can be used to map 'rdf:type's of 
entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's - 
created by the Apache Stanbol Enhancer to annotate occurrences of extracted 
entities in the parsed text. See the <a 
href="components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax">type
 mapping syntax</a> and the <a 
href="enhancementusage.html#entity-tagging-with-disambiguation-support">usage 
scenario for the Apache Stanbol Enhancement Structure</a> for details.</li>
 </ol>
@@ -260,7 +267,7 @@ For the configuration of this engine you
 <li>opennlp-token - <a 
href="components/enhancer/engines/opennlptokenizer">OpenNLP based Word 
tokenization</a>. Works for all languages where white spaces can be used to 
tokenize.</li>
 <li>opennlp-pos - <a href="components/enhancer/engines/opennlppos">OpenNLP 
Part of Speech tagging</a></li>
 <li>opennlp-chunker - The <a 
href="components/enhancer/engines/opennlpchunker">OpenNLP chunker</a> provides 
Noun Phrases</li>
-<li>"{name}Extraction - the <a 
href="components/enhancer/engines/entityhublinking">Entityhub Linking 
Engine</a> configured for the custom vocabulary.</li>
+<li>"{name}Extraction - the <a 
href="components/enhancer/engines/entityhublinking">Entityhub Linking 
Engine</a> or <a 
href="components/enhancer/engines/lucenefstlinking#fst-tagging-configuration">FST
 Tagging Configuration</a> configured for the custom vocabulary.</li>
 </ul>
 <p>Both the <a href="components/enhancer/chains/weightedchain.html">weighted 
chain</a> and the <a href="components/enhancer/chains/listchain.html">list 
chain</a> can be used for the configuration of such a chain.</p>
 <p>The documentation of the Stanbol NLP processing module provides <a 
href="components/enhancer/nlp/#stanbol-enhancer-nlp-support">detailed 
Information</a> about integrated NLP frameworks and suupported languages.</p>
@@ -271,7 +278,7 @@ For the configuration of this engine you
 3) a "dbpedia-proper-noun-linking" chain showing <em>Named Entity Linking</em> 
based on DBpedia</p>
 <p><strong>Change the enhancement chain bound to "/enhancer"</strong></p>
 <p>The enhancement chain bound to </p>
-<div class="codehilite"><pre><span class="n">http:</span><span 
class="sr">//</span><span class="p">{</span><span class="n">stanbol</span><span 
class="o">-</span><span class="n">host</span><span class="p">}</span><span 
class="o">/</span><span class="n">enhancer</span>
+<div class="codehilite"><pre><span class="n">http</span><span 
class="p">:</span><span class="o">//</span><span class="p">{</span><span 
class="n">stanbol</span><span class="o">-</span><span 
class="n">host</span><span class="p">}</span><span class="o">/</span><span 
class="n">enhancer</span>
 </pre></div>

svn commit: r910868 - in /websites/staging/stanbol/trunk/content: ./ docs/trunk/customvocabulary.html

Reply via email to