keywordlinkingengineconfig.png

buildbot Fri, 16 Mar 2012 01:02:02 -0700

Author: buildbot
Date: Fri Mar 16 08:01:29 2012
New Revision: 808822

Log:
Staging update by buildbot for stanbol


Added:
    
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
   (with props)
Modified:
    websites/staging/stanbol/trunk/content/   (props changed)
    
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html

Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Fri Mar 16 08:01:29 2012
@@ -1 +1 @@
-1297881
+1301363

Modified: 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
==============================================================================
--- 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
 (original)
+++ 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
 Fri Mar 16 08:01:29 2012
@@ -46,7 +46,7 @@
 <ul>
 <li><a href="/stanbol/docs/trunk/downloads.html">Overview</a></li>
 </ul>
-<h1 id="the_asf">The ASF</h1>
+<h1 id="the-asf">The ASF</h1>
 <ul>
 <li><a href="http://www.apache.org";>Apache Software Foundation</a></li>
 <li><a href="http://www.apache.org/foundation/thanks.html";>Thanks</a></li>
@@ -57,9 +57,63 @@
   
   <div id="content">
     <h1 class="title">The Keyword Linking Engine: custom vocabularies and 
multiple languages</h1>
-    <p>The <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/";>KeywordLinkingEngine</a>
 is a re-implementation of the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/";>TaxonomyLinkingEngine</a>
 which is more modular and therefore better suited for future improvements and 
extensions as requested by <a 
href="https://issues.apache.org/jira/browse/STANBOL-303";>STANBOL-303</a>. </p>
-<p>Currently the main advantage of using this engine is its ability to support 
multiple languages and provide enhancement results specific to custom 
vocabulary. </p>
-<h2 id="multiple_language_support">Multiple Language Support</h2>
+    <p>The KeywordLinkingEngine is intended to be used to extract occurrences 
of Entities part of a Controlled Vocabulary in content parsed to the Stanbol 
Enhancer. To do this words appearing within the text are compared with labels 
of entities. The Stanbol Entityhub is used to lookup Entities based on their 
labels.</p>
+<p>This documentation first provides information about the configuration 
options of this engine. This section is mainly intended for users of this 
engine. The remaining part of this document is rather technical and intended to 
be read by developers that want to extend this engine or want to know the 
technical details.</p>
+<h2 id="configuration">Configuration</h2>
+<p>The KeywordLinkingEnigne provides a lot of configuration possibilities. 
This section provides describes the different option based on the configuration 
dialog as shown by the Apache Felix Webconsole. </p>
+<p><img alt="KeywordLinkingEngine configuration" 
src="keywordlinkingengineconfig.png" title="The configuration dialog as shown 
by the Apache Felix web console" /></p>
+<p>The example in the scene shows an configuration that is used to extract 
Drugs base on various IDs (e.g. the ATC code and the nchi key) that are all 
stored as values of the skos:notation property. This example is used to 
emphasize on newer features like case sensitive mapping, keyword tokenizer and 
also customized type mappings. Similar configurations would be also need to 
extract product ids, ISBN number or more generally concepts of an thesaurus 
based on there notation.</p>
+<h3 id="configuration-parameter">Configuration Parameter</h3>
+<ul>
+<li><strong>Name</strong>(stanbol.enhancer.engine.name): The name of the 
Enhancement Engine. This name is used to refer an <a 
href="index.html">EnhancementEngine</a> in <a 
href="enhancementchain.html">EnhancementChain</a>s</li>
+<li><strong>Referenced 
Site</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId):
 The name of the ReferencedSite of the Stanbol Entityhub that holds the 
controlled vocabulary to be used for extracting Entities. "entityhub" or 
"local" can be used to extract Entities managed directly by the Entityhub.</li>
+<li><strong>Label 
Field</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.nameField):
 The name of the property used to lookup Entities. Only a single field is 
supported for performance reasons. Users that want to use values of several 
fields should collect such values by an according configuration in the 
mappings.txt used during indexing. This <a 
href="../../customvocabulary.html">usage scenario</a> provides more information 
on this.</li>
+<li><strong>Case 
Sensitivity</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.caseSensitive):
 This allows to activate/deactivate case sensitive matching. It is important to 
understand that even with case sensitivity activated an Entity with the label 
such as "Anaconda" will be suggested for the mention of "anaconda" in the text. 
The main difference will be the confidence value of such a suggestion as with 
case sensitivity activated the starting letters "A" and "a" are NOT considered 
to be matching. See the second technical part for details about the matching 
process. Case Sensitivity is deactivated by default. It is recommended to be 
activated if controlled vocabularies contain abbreviations similar to commonly 
used words e.g. CAN for Canada.</li>
+<li><strong>Type 
Field</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.typeField):
 Values of this field are used as values of the "fise:entity-types" property of 
created "fise:EntityAnnotation"s. The default is "rdf:type".</li>
+<li><strong>Redirect 
Field</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField)
 and <strong>Redirect 
Mode</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode):
 Redirects allow to tell the KeywordLinkingEngine to follow a specific property 
in the knowledge base for matched entities. This feature e.g. allows to follow 
redirects from "USA" to "United States" as defined in Wikipedia. See 
"Processing of Entity Suggestions" for details. Possible valued for the 
Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses 
label, type informations of redirected entities, but keeps the URI of the 
extracted entity; "FOLLOW" - follows the redirect</li>
+<li><strong>Min Token 
Length</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength):
 While the KeywordLinkingEngine preferable uses POS (part-of-speach) taggers to 
determine if a word should matched with the controlled vocabulary the minimum 
token length provides a fall back if (a) no POS tagger is available for the 
language of the parsed text or (b) if the confidence of the POS tagger is lower 
than the threshold.</li>
+<li><strong>Keyword 
Tokenizer</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer):
 This allows to use a special Tokenizer for matching keywords and alpha numeric 
IDs. Typical language specific Tokenizers tend to split such IDs in several 
tokens and therefore might prevent a correct matching. This Tokenizer should 
only be activated if the KeywordLinkingEngine is configured to match against 
IDs like ISBN numbers, Product IDs ... It should not be used to match against 
natural language labels. </li>
+<li><strong>Suggestions</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions):
 The maximum number of suggested Entities.</li>
+<li><strong>Languages</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)
 and <strong>Default Matching 
Language</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.defaultMatchingLanguage):
 The first allows to specify languages that should be processed by this engine. 
This is e.g. useful if the controlled vocabulary only contains labels in for a 
specific language but does not formally specify this information (by setting 
the "xml:lang" property for labels). The default matching language can be used 
to work around the exact opposite case. As an example in DBpedia labels do get 
the language of the dataset they are extracted from (e.g. all data extracted 
from en.wikipedia.org will get "xml:lang=en"). The default matching language 
allows to tell the KeywordLinkingEngine to use labels of that language for 
matching regardless of the language of the parsed content. In the case of 
DBpedia this allows e.g. to match persons mentioned in an Ita
 lian text with the english labels extracted from en.wikipedia.org. Details 
about natural language processing features used by this engine are provided in 
the section "Multiple Language Support"</li>
+<li><strong>Type 
Mappings</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings):
 The FISE enhancement structure (as used by the Stanbol Enhancer) distinguishes 
TextAnnotations and EntityAnnotations. The Keyword linking engine needs to 
create both types of Annotations: TextAnnotations selecting the words that 
match some Entities in the Controlled Vocabulary and EntityAnnotations that 
represent an Entity suggested for a TextAnnotation. The Type Mappings are used 
to determine the "dc:type" of the TextAnnotation based on the types of the 
suggested Entity. The default configuration comes with mappings for Persons, 
Organizations, Places and Concepts but this fields allows to define additional 
mappings. For details see the section "Type Mapping Syntax" and "Processing of 
Entity Suggestions".</li>
+<li><strong>Dereference 
Entities</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.dereference):
 If enabled this engine adds additional information about the suggested 
Entities to the Metadata of the enhanced content item.</li>
+<li><strong>Ranking</strong>(service.ranking): This property is used of two 
engines do use the same <strong>Name</strong>. In such cases the one with the 
higher ranking will be used to enhance content items. Typically users will not 
need to change this.</li>
+</ul>
+<p>Additionally the following properties can be configured via a configuration 
file:</p>
+<ul>
+<li><strong>Minimum Found 
Tokens</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.minFoundTokens):
 This allows to tell the KeywordLinking Engine how to deal with Entities that 
do not exactly match words in the text. Typical Examples are "George W. Bush" 
-&gt; "George Walker Bush". This parameter allows the minimum number of tokens 
that need to match. The default value is '2'. Note that this does not apply for 
exact matches. Setting this to a high value can be used to force a mode that 
will only consider entities where all tokens of the label match the mention in 
the text.</li>
+<li><strong>Minimum Pos Tag 
Probability</strong>(org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability):
 The minimum probability of a POS (part-of-speech) tag. Tags with a lower 
probability will be ignored. In such cases the configured value for the 
<strong>Min Token Length</strong> will apply. The value MUST BE in the range 
[0..1]</li>
+</ul>
+<h3 id="type-mappings-syntax">Type Mappings Syntax</h3>
+<p>The Type Mappings are used to determine the "dc:type" of the TextAnnotation 
based on the types of the suggested Entity. The field "Type Mappings" 
(property: org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings) 
can be used to customize such mappings.</p>
+<p>This field uses the following syntax</p>
+<div class="codehilite"><pre><span class="p">{</span><span 
class="n">uri</span><span class="p">}</span>
+<span class="p">{</span><span class="n">source</span><span class="p">}</span> 
<span class="o">&gt;</span> <span class="p">{</span><span 
class="n">target</span><span class="p">}</span>
+<span class="p">{</span><span class="n">source1</span><span 
class="p">};</span> <span class="p">{</span><span class="n">source2</span><span 
class="p">};</span> <span class="o">...</span> <span class="p">{</span><span 
class="n">sourceN</span><span class="p">}</span> <span class="o">&gt;</span> 
<span class="p">{</span><span class="n">target</span><span class="p">}</span>
+</pre></div>
+
+
+<p>The first variant is a shorthand for {uri} &gt; {uri} and therefore 
specifies that the {uri} should be used as 'dc:type' for TextAnnotations if the 
matched entity is of type {uri}. The second variant matches a {source} URI to a 
{target}. Variant three shows the possibility to match multiple URIs to the 
same target in a single configuration line.</p>
+<p>Both 'ns:localName' and full qualified URIs are supported. For supported 
namespaces see the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/generic/servicesapi/src/main/java/org/apache/stanbol/entityhub/servicesapi/defaults/NamespaceEnum.java";>NamespaceEnum</a>.
 Information about accepted (INFO) and ignored (WARN) type mappings are 
available in the logs.</p>
+<p>Some Examples of additional Mappings for the e-health domain:</p>
+<div class="codehilite"><pre><span class="err">drugbank:drugs;</span> <span 
class="err">dbp-ont:Drug;</span> <span class="err">dailymed:drugs;</span> <span 
class="err">sider:drugs;</span> <span class="err">tcm:Medicine</span> <span 
class="err">&gt;</span> <span class="err">drugbank:drugs</span>
+<span class="err">diseasome:diseases;</span> <span 
class="err">linkedct:condition;</span> <span class="err">tcm:Disease</span> 
<span class="err">&gt;</span> <span class="err">diseasome:diseases</span> 
+<span class="err">sider:side_effects</span>
+<span class="err">dailymed:ingredients</span>
+<span class="err">dailymed:organization</span> <span class="err">&gt;</span> 
<span class="err">dbp-ont:Organisation</span>
+</pre></div>
+
+
+<p>The first two lines map some will known Classes that represent drugs and 
diseases to 'drugbank:drugs' and 'diseasome:diseases'. The third and fourth 
line define 1:1 mappings for side effects and ingredients and the last line 
adds 'dailymed:organization' as an additional mapping to DBpedia Ontology 
Organisation.</p>
+<p>The following mappings are predefined by the KeywordLinkingEngine.</p>
+<div class="codehilite"><pre><span class="n">dbp</span><span 
class="o">-</span><span class="n">ont:Person</span><span class="p">;</span> 
<span class="n">foaf:Person</span><span class="p">;</span> <span 
class="n">schema:Person</span> <span class="o">&gt;</span> <span 
class="n">dbp</span><span class="o">-</span><span class="n">ont:Person</span>
+<span class="n">dbp</span><span class="o">-</span><span 
class="n">ont:Organisation</span><span class="p">;</span> <span 
class="n">dbp</span><span class="o">-</span><span 
class="n">ont:Newspaper</span><span class="p">;</span> <span 
class="n">schema:Organization</span> <span class="o">&gt;</span> <span 
class="n">dbp</span><span class="o">-</span><span 
class="n">ont:Organisation</span>
+<span class="n">dbp</span><span class="o">-</span><span 
class="n">ont:Place</span><span class="p">;</span> <span 
class="n">schema:Place</span><span class="p">;</span> <span 
class="n">gml:_Feature</span> <span class="o">&gt;</span> <span 
class="n">dbp</span><span class="o">-</span><span class="n">ont:Place</span>
+<span class="n">skos:Concept</span>
+</pre></div>
+
+
+<h2 id="multiple-language-support">Multiple Language Support</h2>
 <p>The KeywordLinkingEngine supports the extraction of keywords in multiple 
languages. However, the performance and to some extend also the quality of the 
enhancements depend on how well a language is supported by the used NLP 
framework (currently OpenNLP).
 The following list provides a short overview about the different language 
specific component/configurations:</p>
 <ul>
@@ -67,15 +121,14 @@ The following list provides a short over
 <li><strong>Multi-lingual labels of the controlled vocabulary:</strong> 
Entities are matched based on labels of the current language and labels without 
any defined language. e.g. English labels will not be matched against German 
language texts. Therefore it is important to have a controlled vocabulary that 
includes labels in the language of the texts you want to enhance.</li>
 <li><strong>Natural Language Processing support:</strong> The 
KeywordLinkingEngine is able to use <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetector.html";>Sentence
 Detectors</a>, <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTagger.html";>POS
 (Part of Speech) taggers</a> and <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html";>Chunkers</a>.
 If such components are available for a language then they are used to optimize 
the enhancement process.</li>
 <li><strong>Sentence detector:</strong> If a sentence detector is present the 
memory footprint of the engines improves, because Tokens, POS tags and Chunks 
are only kept for the currently active sentence. If no sentence detector is 
available the entire content is treated as a single sentence.</li>
-<li><strong>Tokenizer:</strong> A (word) <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html";>tokenizer</a>
 is required for the enhancement process. If no specific tokenizer is available 
for a given language, then the <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html";>OpenNLP
 SimpleTokenizer</a> is used as default. How well this tokenizer works will 
depend on the language.</li>
+<li><strong>Tokenizer:</strong> A (word) <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html";>tokenizer</a>
 is required for the enhancement process. If no specific tokenizer is available 
for a given language, then the <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html";>OpenNLP
 SimpleTokenizer</a> is used as default. The parameter <strong>Keyword 
Tokenizer</strong> can be used to force the usage of a special Tokenizer that 
is optimized for matching keyword. This Tokenizer ensures that alpha-numeric 
IDs are not tokenized to ensure correct matching of such tokens. If this option 
is enabled than any language specific Tokenizer will be ignored in favor of the 
KeywordTokenizer.</li>
 <li><strong>POS tagger:</strong> POS (Part-of-Speech) taggers annotate tokens 
with their type. Because of the KeywordLinkingEngine is only interested in 
Nouns, Foreign Words and Numbers, the presence of such a tagger allows to skip 
a lot of the tokens and to improve performance. However POS taggers use 
different sets of tags for different languages. Because of that it is not 
enough that a POS tagger is available for a language there MUST BE also a 
configuration of the POS tags representing Nouns.</li>
 <li><strong>Chunker:</strong> There are two types of Chunkers. First the <a 
href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html";>Chunkers</a>
 as provided by OpenNLP (based on statistical models) and second a <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java";>POS
 tag based Chunker</a> provided by the openNLP bundle of Stanbol. Currently the 
availability of a Chunker does not have a big influence on the performance nor 
the quality of the Enhancements.</li>
-<li><strong>Configuration:</strong> The set of languages to be annotated can 
be configured for the KeywordLinkingEngine. An empty configuration indicates 
that texts in any language should be processed. By using this configuration it 
is possible to configure different KeywordLinkingEngine instances for different 
languages (e.g. with different configurations)</li>
 </ul>
-<h2 id="keyword_extraction_and_linking_workflow">Keyword extraction and 
linking workflow</h2>
+<h2 id="keyword-extraction-and-linking-workflow">Keyword extraction and 
linking workflow</h2>
 <p>Basically the text is parsed from the beginning to the end and words are 
looked up in the configured controlled vocabulary.</p>
-<h3 id="text_processing">Text Processing</h3>
-<p>The <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/AnalysedContent.java";>AnalysedContent</a>
 Interface is used to access natural language text that was already processed 
by an NLP framework. Currently there is only a single implementation based on 
the commons.opennlp <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/TextAnalyzer.java";>TextAnalyzer</a>
 utility. In general this part is still very focused on OpenNLP. Making it also 
usable together with other NLP frameworks would probably need some 
re-factoring.</p>
+<h3 id="text-processing">Text Processing</h3>
+<p>The <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/AnalysedContent.java";>AnalysedContent</a>
 Interface is used to access natural language text that was already processed 
by a NLP framework. Currently there is only a single implementation based on 
the commons.opennlp <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/TextAnalyzer.java";>TextAnalyzer</a>
 utility. In general this part is still very focused on OpenNLP. Making it also 
usable together with other NLP frameworks would probably need some 
re-factoring.</p>
 <p>The current state of the processing is represented by the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/ProcessingState.java";>ProcessingState</a>.
 Based on the capabilities of the NLP framework for the current language it 
provides a the following set of information:</p>
 <ul>
 <li><strong>AnalysedSentence:</strong> If a sentence detector is present, than 
this represent the current sentence of the text. If not, then the whole text is 
represented as a single sentence. The AnalysedSentence also provides access to 
POS tags and Chunks (if available)</li>
@@ -84,7 +137,7 @@ The following list provides a short over
 <li><strong>TokenIndex:</strong> The index of the currently active token 
relative to the AnalysedSentence.</li>
 </ul>
 <p>The ProcessingState provides means to navigate to the next token. If chunks 
are present tokens that are outside of chunks are ignored.</p>
-<h3 id="entity_lookup">Entity Lookup</h3>
+<h3 id="entity-lookup">Entity Lookup</h3>
 <p>A "OR" query with [1..MAX_SEARCH_TOKENS] tokens is used to lookup entities 
via the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java";>EntitySearcher</a>
 interface. If the actual implementation cut off results, than it must be 
ensured that Entities that match both tokens are ranked first.
 Currently there are two implementations of this interface: (1) for the 
Entityhub (<a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntityhubSearcher.java";>EntityhubSearcher</a>)
 and (2) for ReferencedSites (<a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/ReferencedSiteSearcher.java";>ReferencedSiteSearcher</a>).
 There is also an <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/test/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/TestSearcherImpl.java";>Implementation</a>
 that holds entities in-memory, however currently this is only used for unit 
tests.</p>
 <p>Queries do use the configured <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getNameField()
 and the language of labels is restricted to the current language or labels 
that do not define any language.</p>
@@ -94,7 +147,7 @@ Currently there are two implementations 
 <li>If this method returns NULL or no POS tags are available, then all Tokens 
longer than <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getMinSearchTokenLength()
 (default=3) are considered as processable.</li>
 </ul>
 <p>Typically the next MAX_SEARCH_TOKENS processable tokens are used for a 
lookup. However the current Chunk/Sentence is never left in the search for 
processable tokens.</p>
-<h3 id="matching_of_found_entities">Matching of found Entities:</h3>
+<h3 id="matching-of-found-entities">Matching of found Entities:</h3>
 <p>All labels (values of the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getNameField()
 field) in the language of the content or without any defined language are 
candidates for matches.</p>
 <p>For each label that fulfills the above criteria the following steps are 
processed. The best result is used as the result of the whole matching 
process:</p>
 <ul>
@@ -109,15 +162,15 @@ Currently there are two implementations 
 <li>a label matches at least <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getMinFoundTokens()
 (default=2) are matching with the text. This ensures that "<a 
href="http://en.wikipedia.org/wiki/Rupert_Murdoch";>Rupert Murdoch</a>" is not 
suggested for "<a href="http://en.wikipedia.org/wiki/Rupert";>Rupert</a>" but on 
the other hand "Barack Hussein Obama" is suggested for "Barack Obama". Setting 
"minFoundToken" to values less than two will usually cause a lot of false 
positives, but would also come up with a suggestion for "Barack Obama" if the 
content contains the word "Obama".</li>
 </ul>
 <p>The described matching process is currently directly part of the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java";>EntityLinker</a>.
 To support different matching strategies this would need to be externalized 
into an own "EntityLabelMatcher" interface.</p>
-<h3 id="processing_of_entity_suggestions">Processing of Entity Suggestions</h3>
+<h3 id="processing-of-entity-suggestions">Processing of Entity Suggestions</h3>
 <p>In case there are one or more <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java";>Suggestion</a>s
 of Entities for the current position within the text a <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/LinkedEntity.java";>LinkedEntity</a>
 instance is created.</p>
 <p>LinkedEntity is an object model representing the Stanbol Enhancement 
Structure. After the processing of the parsed content is completed, the 
LinkedEntities are "serialized" as RDF triples to the metadata of the 
ContentItem.</p>
 <p>TextAnnotations as defined in the <a 
href="http://wiki.iks-project.eu/index.php/EnhancementStructure";>Stanbol 
Enhancement Structure</a> do use the <a 
href="http://www.dublincore.org/documents/dcmi-terms/#terms-type";>dc:type</a> 
property to provide the general type of the extracted Entity. However suggested 
Entities might have very specific types. Therefore the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>
 provides the possibility to map the specific types of the Entity to types used 
for the dc:type property of TextAnnotations. The <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.DEFAULT_ENTITY_TYPE_MAPPINGS
 contains some pred
 efined mappings.
-<em>Note that the field used to retrieve the types of an suggested Entity can 
be configured by the EntityLinkerConfig. The default value for the type field 
is "rdf:type".</em></p>
+<em>Note that the field used to retrieve the types of a suggested Entity can 
be configured by the EntityLinkerConfig. The default value for the type field 
is "rdf:type".</em></p>
 <p>In some cases suggested entities might redirect to others. In the case of 
Wikipedia/DBpedia this is often used to link from acronyms like <a 
href="http://en.wikipedia.org/w/index.php?title=IMF&amp;redirect=no";>IMF</a> to 
the real entity <a 
href="http://en.wikipedia.org/wiki/International_Monetary_Fund";>International 
Monetary Fund</a>. But also some Thesauri define labels as own Entities with an 
URI and users might want to use the URI of the Concept rather than one of the 
label.
 To support such use cases the KeywordLinkingEngine has support for redirects. 
Users can first configure the redirect mode (ignore, copy values, follow) and 
secondly the field used to search for redirects (default=rdfs:seeAlso).
 If the redirect mode != ignore for each suggestion the Entities referenced by 
the configured redirect field are retrieved. In case of the "copy values" mode 
the values of the name, and type field are copied. In case of the "follow" mode 
the suggested entity is replaced with the first redirected entity.</p>
-<h3 id="confidence_for_suggestions">Confidence for Suggestions</h3>
+<h3 id="confidence-for-suggestions">Confidence for Suggestions</h3>
 <p>The confidence for suggestions is calculated based on the following 
algorithm:</p>
 <p>Input Parameters</p>
 <ul>
@@ -137,7 +190,7 @@ confidence = (match/max_matched)^2 * (ma
 <li>"New York City" matched against the text "New York Rangers" - assuming 
that "New York Rangers" is the best match - results in a confidence of (2/3)^2 
* (2/2) * (2/3) = 0,3; Note that the best match "New York Rangers" has 
max_matched=3 and gets a confidence of 1.</li>
 </ul>
 <p>The calculation of the confidence is currently direct part of the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java";>EntityLinker</a>.
 To support different matching strategies this would need to be externalized 
into an own interface.</p>
-<h2 id="future_plans_for_the_taxonomylinkingengine">Future Plans for the 
TaxonomyLinkingEngine</h2>
+<h2 id="future-plans-for-the-taxonomylinkingengine">Future Plans for the 
TaxonomyLinkingEngine</h2>
 <p>The TaxonomyLinkingEngine is still available and fully functional. However 
it is marked as deprecated and not included in any of the launchers. Current 
users are encouraged to switch over to the KeywordLinkingEngine. </p>
 <p>In the future it is planed to repurpose the TaxonomyLinkingEngine as a 
special version of the KeywordLinkingEngine with a specialized configuration 
and feature set targeted for (hierarchical) Taxonomies. </p>
 <p>This will include: </p>

Added: 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
==============================================================================
Binary file - no diff available.

Propchange: 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
------------------------------------------------------------------------------
    svn:mime-type = image/png

svn commit: r808822 - in /websites/staging/stanbol/trunk/content: ./ stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png

Reply via email to