Author: buildbot
Date: Thu Jul 12 06:21:34 2012
New Revision: 825542
Log:
Staging update by buildbot for stanbol
Modified:
websites/staging/stanbol/trunk/content/ (props changed)
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Jul 12 06:21:34 2012
@@ -1 +1 @@
-1360368
+1360538
Modified:
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
==============================================================================
---
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
(original)
+++
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
Thu Jul 12 06:21:34 2012
@@ -97,6 +97,7 @@
<li><strong>Type Field</strong>
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.typeField)</em>:
Values of this field are used as values of the "fise:entity-types" property of
created "<a
href="../enhancementstructure.html#fiseentityannotation">fise:EntityAnnotation</a>"s.
The default is "rdf:type".</li>
<li><strong>Redirect Field</strong>
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField)</em>
and <strong>Redirect Mode</strong>
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode)</em>:
Redirects allow to tell the KeywordLinkingEngine to follow a specific property
in the knowledge base for matched entities. This feature e.g. allows to follow
redirects from "USA" to "United States" as defined in Wikipedia. See
"Processing of Entity Suggestions" for details. Possible valued for the
Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses
label, type informations of redirected entities, but keeps the URI of the
extracted entity; "FOLLOW" - follows the redirect</li>
<li><strong>Min Token Length</strong>
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)</em>:
While the KeywordLinkingEngine preferable uses POS (part-of-speach) taggers to
determine if a word should matched with the controlled vocabulary the minimum
token length provides a fall back if (a) no POS tagger is available for the
language of the parsed text or (b) if the confidence of the POS tagger is lower
than the threshold.</li>
+<li><strong>Minimum Token Match Factor</strong>
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)</em>:
If a Token of the text is compared with a Token of an Entity Label the
similarity of those two is expressed in the range [0..1]. The minimum token
match factor specifies the minimum similarity of two Tokens so that they are
considered to match. Lower similarity scores are not considered as match. This
parameter is important as it e.g. allows inflected forms of words to match.
However it also may result in false positives of similar words. users should
note that the similarity score is also used for calculating the confidence. So
similarity scores < 1 but higher than the configured minimum token match
factor will reduce the confidence of suggested Entities.</li>
<li><strong>Keyword Tokenizer</strong>
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer)</em>:
This allows to use a special Tokenizer for matching keywords and alpha numeric
IDs. Typical language specific Tokenizers tend to split such IDs in several
tokens and therefore might prevent a correct matching. This Tokenizer should
only be activated if the KeywordLinkingEngine is configured to match against
IDs like ISBN numbers, Product IDs ... It should not be used to match against
natural language labels. </li>
<li><strong>Suggestions</strong>
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions)</em>:
The maximum number of suggested Entities.</li>
<li><strong>Languages</strong>
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)</em>
and <strong>Default Matching Language</strong>
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.defaultMatchingLanguage)</em>:
The first allows to specify languages that should be processed by this engine.
This is e.g. useful if the controlled vocabulary only contains labels in for a
specific language but does not formally specify this information (by setting
the "xml:lang" property for labels). The default matching language can be used
to work around the exact opposite case. As an example in DBpedia labels do get
the language of the dataset they are extracted from (e.g. all data extracted
from en.wikipedia.org will get "xml:lang=en"). The default matching language
allows to tell the KeywordLinkingEngine to use labels of that language for
matching regardless of the language of the parsed content. In the case of
DBpedia this allows e.g. to match persons
mentioned in an Italian text with the english labels extracted from
en.wikipedia.org. Details about natural language processing features used by
this engine are provided in the section "Multiple Language Support"</li>
@@ -161,9 +162,20 @@ The following list provides a short over
<li><strong>Token:</strong> The currently processed word part of the chunk and
the sentence.</li>
<li><strong>TokenIndex:</strong> The index of the currently active token
relative to the AnalysedSentence.</li>
</ul>
-<p>The ProcessingState provides means to navigate to the next token. If chunks
are present tokens that are outside of chunks are ignored.</p>
+<p>Processing is done based on Tokens (words). The ProcessingState provides
means to navigate to the next token. If Chunks are present tokens that are
outside of chunks are ignored. Only 'processable' tokens are considered to
lookup entities (see the next section for details). If a Token is processable
is determined as follows</p>
+<ul>
+<li>Only Tokens within a Chunk are considered. If no Chunks are available all
Tokens.</li>
+<li>If POS tags are available AND POS tags considered as NOUNS are configured
(see <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTagsCollectionEnum.java">PosTagsCollectionEnum</a>)
than POS tags are considered for deciding if a Token is processable<ul>
+<li>The minimum POS tag probability is <code>0.667</code></li>
+<li>Tokens with a POS tag representing a NOUN and a probability >=
minPosTagProb are marked as processable</li>
+<li>Tokens with a POS tag NOT representing a NOUN and a probability >=
minPosTagProb/2 are marked as NOT processable</li>
+</ul>
+</li>
+<li>If POS tags are NOT available or the NOUN POS tags configuration is
missing the minimum token length
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)</em>
is used as fallback. This means that all Tokens equals or longer than this
value are marked as processable.</li>
+</ul>
+<p>This algorithm was introduced by <a
href="https://issues.apache.org/jira/browse/STANBOL-685">STANBOL-658</a></p>
<h3 id="entity-lookup">Entity Lookup</h3>
-<p>A "OR" query with [1..MAX_SEARCH_TOKENS] tokens is used to lookup entities
via the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java">EntitySearcher</a>
interface. If the actual implementation cut off results, than it must be
ensured that Entities that match both tokens are ranked first.
+<p>A "OR" query with [1..MAX_SEARCH_TOKENS] processable tokens is used to
lookup entities via the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java">EntitySearcher</a>
interface. If the actual implementation cut off results, than it must be
ensured that Entities that match both tokens are ranked first.
Currently there are two implementations of this interface: (1) for the
Entityhub (<a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntityhubSearcher.java">EntityhubSearcher</a>)
and (2) for ReferencedSites (<a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/ReferencedSiteSearcher.java">ReferencedSiteSearcher</a>).
There is also an <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/test/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/TestSearcherImpl.java">Implementation</a>
that holds entities in-memory, however currently this is only used for unit
tests.</p>
<p>Queries do use the configured <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getNameField()
and the language of labels is restricted to the current language or labels
that do not define any language.</p>
<p>Only "processable" tokens are used to lookup entities. If a token is
processable is determined as follows:</p>
@@ -176,15 +188,20 @@ Currently there are two implementations
<p>All labels (values of the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getNameField()
field) in the language of the content or without any defined language are
candidates for matches.</p>
<p>For each label that fulfills the above criteria the following steps are
processed. The best result is used as the result of the whole matching
process:</p>
<ul>
-<li>All tokens (of the text) following the current position are searched
within the label.</li>
-<li>As of now, tokens MUST appear in the correct order within a label (e.g.
"Murdoch Rupert" will NOT match "Rupert Murdoch")</li>
-<li>On the first processable token of the text that is not present within the
label matching is canceled. (see the definition of processable token above)</li>
-<li>On the second non-processable token not found in the label the matching is
also canceled (e.g. "University of Michigan" will match "University
Michigan")</li>
+<li>Tokens (of the text) following the current position are searched within
the label. This also includes non-processable Tokens. <ul>
+<li>Processable Tokens MUST match with Tokens in the Label. A maximum number
of <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getMaxNotFound()
non-processable Tokens may not match.</li>
+<li>Token order is important. Tokens in the Entity Label are allied to be
skipped (e.g. the text 'Barack Obama' will match the label 'Barack Hussein
Obama' because Hussein is allowed to be skipped. The other way around it would
be no match because processable Tokens in the Text are not allied to be
skipped)</li>
+</ul>
+</li>
+<li>If the first Token of the Label is not matches preceding Tokens of the
Text are matched against the Label. This is done to ensure that Entities that
use adjectives in their labels (e.g. "great improvement", "Gute
Deutschkenntnisse") are matched. In addition this also helps to match named
entities (e.g. person names) as the first token of those mentions are sometimes
erroneously classified adjectives by POS taggers.</li>
+<li>Tokens that appear in the wrong order (e.g. the text 'Obama, Barack' with
the label 'Barack Obama' are matched with a factor of <code>0.7</code>.
Currently only exact matches are considered.</li>
</ul>
+<p>If two tokens match is calculated by dividing the longest matching part
from the begin of the Token to the maximum length of the two tokens. e.g.
'German' would match with 'Germany' with <code>5/6=0.83</code>. The result of
this comparison is the token similarity. If this similarity is greater equals
than the configured minimum token similarity factor
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)</em>
than those tokens are considered to match. The token similarity is also used
for calculating the confidence.<br />
+</p>
<p>Entities are <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java">Suggested</a>
if:</p>
<ul>
-<li>a label does match exactly with the text following the current position it
the entity is suggested. (e.g. <a
href="http://en.wikipedia.org/wiki/Passerine">Passerine</a>)</li>
-<li>a label matches at least <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getMinFoundTokens()
(default=2) are matching with the text. This ensures that "<a
href="http://en.wikipedia.org/wiki/Rupert_Murdoch">Rupert Murdoch</a>" is not
suggested for "<a href="http://en.wikipedia.org/wiki/Rupert">Rupert</a>" but on
the other hand "Barack Hussein Obama" is suggested for "Barack Obama". Setting
"minFoundToken" to values less than two will usually cause a lot of false
positives, but would also come up with a suggestion for "Barack Obama" if the
content contains the word "Obama".</li>
+<li>a label does match exactly with the current position in the text. This is
if all tokens of the Label match with the Tokens of the text. Note that tokens
are considered to match if the similarity is greater equals than the minimum
token match factor.</li>
+<li>partial matches are considered if more than <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getMinFoundTokens()
(default=2) processable tokens match. Non-processable tokens are not
considered for this. This ensures that "<a
href="http://en.wikipedia.org/wiki/Rupert_Murdoch">Rupert Murdoch</a>" is not
suggested for "<a href="http://en.wikipedia.org/wiki/Rupert">Rupert</a>" but on
the other hand "Barack Hussein Obama" is suggested for "Barack Obama".</li>
</ul>
<p>The described matching process is currently directly part of the <a
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java">EntityLinker</a>.
To support different matching strategies this would need to be externalized
into an own "EntityLabelMatcher" interface.</p>
<h3 id="processing-of-entity-suggestions">Processing of Entity Suggestions</h3>
Modified:
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
==============================================================================
Binary files - no diff available.