Author: rwesten
Date: Thu Jul 12 06:21:24 2012
New Revision: 1360538
URL: http://svn.apache.org/viewvc?rev=1360538&view=rev
Log:
updated the documentation of the KeywordLinkingEngine to reflect changes
introduced by STANBOL-685 and STANBOL-686
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext?rev=1360538&r1=1360537&r2=1360538&view=diff
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
(original)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
Thu Jul 12 06:21:24 2012
@@ -21,6 +21,7 @@ The example in the scene shows an config
* __Type Field__
_(org.apache.stanbol.enhancer.engines.keywordextraction.typeField)_: Values of
this field are used as values of the "fise:entity-types" property of created
"[fise:EntityAnnotation](../enhancementstructure.html#fiseentityannotation)"s.
The default is "rdf:type".
* __Redirect Field__
_(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField)_ and
__Redirect Mode__
_(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode)_:
Redirects allow to tell the KeywordLinkingEngine to follow a specific property
in the knowledge base for matched entities. This feature e.g. allows to follow
redirects from "USA" to "United States" as defined in Wikipedia. See
"Processing of Entity Suggestions" for details. Possible valued for the
Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses
label, type informations of redirected entities, but keeps the URI of the
extracted entity; "FOLLOW" - follows the redirect
* __Min Token Length__
_(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)_:
While the KeywordLinkingEngine preferable uses POS (part-of-speach) taggers to
determine if a word should matched with the controlled vocabulary the minimum
token length provides a fall back if (a) no POS tagger is available for the
language of the parsed text or (b) if the confidence of the POS tagger is lower
than the threshold.
+* __Minimum Token Match Factor__
_(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)_:
If a Token of the text is compared with a Token of an Entity Label the
similarity of those two is expressed in the range [0..1]. The minimum token
match factor specifies the minimum similarity of two Tokens so that they are
considered to match. Lower similarity scores are not considered as match. This
parameter is important as it e.g. allows inflected forms of words to match.
However it also may result in false positives of similar words. users should
note that the similarity score is also used for calculating the confidence. So
similarity scores < 1 but higher than the configured minimum token match factor
will reduce the confidence of suggested Entities.
* __Keyword Tokenizer__
_(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer)_:
This allows to use a special Tokenizer for matching keywords and alpha numeric
IDs. Typical language specific Tokenizers tend to split such IDs in several
tokens and therefore might prevent a correct matching. This Tokenizer should
only be activated if the KeywordLinkingEngine is configured to match against
IDs like ISBN numbers, Product IDs ... It should not be used to match against
natural language labels.
* __Suggestions__
_(org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions)_: The
maximum number of suggested Entities.
* __Languages__
_(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)_
and __Default Matching Language__
_(org.apache.stanbol.enhancer.engines.keywordextraction.defaultMatchingLanguage)_:
The first allows to specify languages that should be processed by this engine.
This is e.g. useful if the controlled vocabulary only contains labels in for a
specific language but does not formally specify this information (by setting
the "xml:lang" property for labels). The default matching language can be used
to work around the exact opposite case. As an example in DBpedia labels do get
the language of the dataset they are extracted from (e.g. all data extracted
from en.wikipedia.org will get "xml:lang=en"). The default matching language
allows to tell the KeywordLinkingEngine to use labels of that language for
matching regardless of the language of the parsed content. In the case of
DBpedia this allows e.g. to match persons mentioned in an Italian text with the
eng
lish labels extracted from en.wikipedia.org. Details about natural language
processing features used by this engine are provided in the section "Multiple
Language Support"
@@ -93,11 +94,20 @@ The current state of the processing is r
* __Token:__ The currently processed word part of the chunk and the sentence.
* __TokenIndex:__ The index of the currently active token relative to the
AnalysedSentence.
-The ProcessingState provides means to navigate to the next token. If chunks
are present tokens that are outside of chunks are ignored.
+Processing is done based on Tokens (words). The ProcessingState provides means
to navigate to the next token. If Chunks are present tokens that are outside of
chunks are ignored. Only 'processable' tokens are considered to lookup entities
(see the next section for details). If a Token is processable is determined as
follows
+
+* Only Tokens within a Chunk are considered. If no Chunks are available all
Tokens.
+* If POS tags are available AND POS tags considered as NOUNS are configured
(see
[PosTagsCollectionEnum](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTagsCollectionEnum.java))
than POS tags are considered for deciding if a Token is processable
+ * The minimum POS tag probability is <code>0.667</code>
+ * Tokens with a POS tag representing a NOUN and a probability >=
minPosTagProb are marked as processable
+ * Tokens with a POS tag NOT representing a NOUN and a probability >=
minPosTagProb/2 are marked as NOT processable
+* If POS tags are NOT available or the NOUN POS tags configuration is missing
the minimum token length
_(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)_
is used as fallback. This means that all Tokens equals or longer than this
value are marked as processable.
+
+This algorithm was introduced by
[STANBOL-658](https://issues.apache.org/jira/browse/STANBOL-685)
### Entity Lookup ###
-A "OR" query with [1..MAX_SEARCH_TOKENS] tokens is used to lookup entities via
the
[EntitySearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java)
interface. If the actual implementation cut off results, than it must be
ensured that Entities that match both tokens are ranked first.
+A "OR" query with [1..MAX_SEARCH_TOKENS] processable tokens is used to lookup
entities via the
[EntitySearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java)
interface. If the actual implementation cut off results, than it must be
ensured that Entities that match both tokens are ranked first.
Currently there are two implementations of this interface: (1) for the
Entityhub
([EntityhubSearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntityhubSearcher.java))
and (2) for ReferencedSites
([ReferencedSiteSearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/ReferencedSiteSearcher.java)).
There is also an
[Implementation](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/test/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/TestSearcherImpl.java)
that holds entities in-memory, however currently this is only used for unit
tests.
Queries do use the configured
[EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getNameField()
and the language of labels is restricted to the current language or labels
that do not define any language.
@@ -115,15 +125,18 @@ All labels (values of the [EntityLinkerC
For each label that fulfills the above criteria the following steps are
processed. The best result is used as the result of the whole matching process:
-* All tokens (of the text) following the current position are searched within
the label.
-* As of now, tokens MUST appear in the correct order within a label (e.g.
"Murdoch Rupert" will NOT match "Rupert Murdoch")
-* On the first processable token of the text that is not present within the
label matching is canceled. (see the definition of processable token above)
-* On the second non-processable token not found in the label the matching is
also canceled (e.g. "University of Michigan" will match "University Michigan")
+* Tokens (of the text) following the current position are searched within the
label. This also includes non-processable Tokens.
+ * Processable Tokens MUST match with Tokens in the Label. A maximum number
of
[EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getMaxNotFound()
non-processable Tokens may not match.
+ * Token order is important. Tokens in the Entity Label are allied to be
skipped (e.g. the text 'Barack Obama' will match the label 'Barack Hussein
Obama' because Hussein is allowed to be skipped. The other way around it would
be no match because processable Tokens in the Text are not allied to be skipped)
+* If the first Token of the Label is not matches preceding Tokens of the Text
are matched against the Label. This is done to ensure that Entities that use
adjectives in their labels (e.g. "great improvement", "Gute Deutschkenntnisse")
are matched. In addition this also helps to match named entities (e.g. person
names) as the first token of those mentions are sometimes erroneously
classified adjectives by POS taggers.
+* Tokens that appear in the wrong order (e.g. the text 'Obama, Barack' with
the label 'Barack Obama' are matched with a factor of <code>0.7</code>.
Currently only exact matches are considered.
+
+If two tokens match is calculated by dividing the longest matching part from
the begin of the Token to the maximum length of the two tokens. e.g. 'German'
would match with 'Germany' with <code>5/6=0.83</code>. The result of this
comparison is the token similarity. If this similarity is greater equals than
the configured minimum token similarity factor
_(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)_
than those tokens are considered to match. The token similarity is also used
for calculating the confidence.
Entities are
[Suggested](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java)
if:
-* a label does match exactly with the text following the current position it
the entity is suggested. (e.g.
[Passerine](http://en.wikipedia.org/wiki/Passerine))
-* a label matches at least
[EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getMinFoundTokens()
(default=2) are matching with the text. This ensures that "[Rupert
Murdoch](http://en.wikipedia.org/wiki/Rupert_Murdoch)" is not suggested for
"[Rupert](http://en.wikipedia.org/wiki/Rupert)" but on the other hand "Barack
Hussein Obama" is suggested for "Barack Obama". Setting "minFoundToken" to
values less than two will usually cause a lot of false positives, but would
also come up with a suggestion for "Barack Obama" if the content contains the
word "Obama".
+* a label does match exactly with the current position in the text. This is if
all tokens of the Label match with the Tokens of the text. Note that tokens are
considered to match if the similarity is greater equals than the minimum token
match factor.
+* partial matches are considered if more than
[EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getMinFoundTokens()
(default=2) processable tokens match. Non-processable tokens are not
considered for this. This ensures that "[Rupert
Murdoch](http://en.wikipedia.org/wiki/Rupert_Murdoch)" is not suggested for
"[Rupert](http://en.wikipedia.org/wiki/Rupert)" but on the other hand "Barack
Hussein Obama" is suggested for "Barack Obama".
The described matching process is currently directly part of the
[EntityLinker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java).
To support different matching strategies this would need to be externalized
into an own "EntityLabelMatcher" interface.
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png?rev=1360538&r1=1360537&r2=1360538&view=diff
==============================================================================
Binary files - no diff available.