Author: buildbot
Date: Thu Sep 22 15:43:29 2011
New Revision: 796111
Log:
Staging update by buildbot
Modified:
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
Modified:
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
==============================================================================
---
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
(original)
+++
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
Thu Sep 22 15:43:29 2011
@@ -56,9 +56,17 @@ The following list provides a short over
<li><strong>Multi-lingual labels of the controlled vocabulary:</strong>
Entities are matched based on labels of the current language and labels without
any defined language. e.g. English labels will not be matched against German
language texts. Therefore it is important to have a controlled vocabulary that
includes labels in the language of the texts you want to enhance.</li>
<li>
<p><strong>Natural Language Processing support:</strong> The
KeywordLinkingEngine is able to use <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetector.html">Sentence
Detectors</a>, <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTagger.html">POS
(Part of Speech) taggers</a> and <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html">Chunkers</a>.
If such components are available for a language then they are used to optimize
the enhancement process.</p>
+</li>
+<li>
<p><strong>Sentence detector:</strong> If a sentence detector is present the
memory footprint of the engines improves, because Tokens, POS tags and Chunks
are only kept for the currently active sentence. If no sentence detector is
available the entire content is treated as a single sentence.</p>
+</li>
+<li>
<p><strong>Tokenizer:</strong> A (word) <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html">tokenizer</a>
is required for the enhancement process. If no specific tokenizer is available
for a given language, then the <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html">OpenNLP
SimpleTokenizer</a> is used as default. How well this tokenizer works will
depend on the language.</p>
+</li>
+<li>
<p><strong>POS tagger:</strong> POS (Part-of-Speech) taggers annotate tokens
with their type. Because of the KeywordLinkingEngine is only interested in
Nouns, Foreign Words and Numbers, the presence of such a tagger allows to skip
a lot of the tokens and to improve performance. However POS taggers use
different sets of tags for different languages. Because of that it is not
enough that a POS tagger is available for a language there MUST BE also a
configuration of the POS tags representing Nouns.</p>
+</li>
+<li>
<p><strong>Chunker:</strong> There are two types of Chunkers. First the <a
href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html">Chunkers</a>
as provided by OpenNLP (based on statistical models) and second a <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java">POS
tag based Chunker</a> provided by the openNLP bundle of Stanbol. Currently the
availability of a Chunker does not have a big influence on the performance nor
the quality of the Enhancements.</p>
</li>
<li>