engines: keywordlinkingengine.mdtext keywordlinkingengineconfig.png

rwesten Wed, 11 Jul 2012 23:21:52 -0700

Author: rwesten
Date: Thu Jul 12 06:21:24 2012
New Revision: 1360538

URL: http://svn.apache.org/viewvc?rev=1360538&view=rev
Log:
updated the documentation of the KeywordLinkingEngine to reflect changes 
introduced by STANBOL-685 and STANBOL-686


Modified:
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png

Modified: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext?rev=1360538&r1=1360537&r2=1360538&view=diff
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
 (original)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
 Thu Jul 12 06:21:24 2012
@@ -21,6 +21,7 @@ The example in the scene shows an config
 * __Type Field__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.typeField)_: Values of 
this field are used as values of the "fise:entity-types" property of created 
"[fise:EntityAnnotation](../enhancementstructure.html#fiseentityannotation)"s. 
The default is "rdf:type".
 * __Redirect Field__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField)_ and 
__Redirect Mode__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode)_: 
Redirects allow to tell the KeywordLinkingEngine to follow a specific property 
in the knowledge base for matched entities. This feature e.g. allows to follow 
redirects from "USA" to "United States" as defined in Wikipedia. See 
"Processing of Entity Suggestions" for details. Possible valued for the 
Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses 
label, type informations of redirected entities, but keeps the URI of the 
extracted entity; "FOLLOW" - follows the redirect
 * __Min Token Length__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)_: 
While the KeywordLinkingEngine preferable uses POS (part-of-speach) taggers to 
determine if a word should matched with the controlled vocabulary the minimum 
token length provides a fall back if (a) no POS tagger is available for the 
language of the parsed text or (b) if the confidence of the POS tagger is lower 
than the threshold.
+* __Minimum Token Match Factor__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)_: 
If a Token of the text is compared with a Token of an Entity Label the 
similarity of those two is expressed in the range [0..1]. The minimum token 
match factor specifies the minimum similarity of two Tokens so that they are 
considered to match. Lower similarity scores are not considered as match. This 
parameter is important as it e.g. allows inflected forms of words to match. 
However it also may result in false positives of similar words. users should 
note that the similarity score is also used for calculating the confidence. So 
similarity scores < 1 but higher than the configured minimum token match factor 
will reduce the confidence of suggested Entities.
 * __Keyword Tokenizer__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer)_: 
This allows to use a special Tokenizer for matching keywords and alpha numeric 
IDs. Typical language specific Tokenizers tend to split such IDs in several 
tokens and therefore might prevent a correct matching. This Tokenizer should 
only be activated if the KeywordLinkingEngine is configured to match against 
IDs like ISBN numbers, Product IDs ... It should not be used to match against 
natural language labels. 
 * __Suggestions__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions)_: The 
maximum number of suggested Entities.
 * __Languages__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)_ 
and __Default Matching Language__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.defaultMatchingLanguage)_:
 The first allows to specify languages that should be processed by this engine. 
This is e.g. useful if the controlled vocabulary only contains labels in for a 
specific language but does not formally specify this information (by setting 
the "xml:lang" property for labels). The default matching language can be used 
to work around the exact opposite case. As an example in DBpedia labels do get 
the language of the dataset they are extracted from (e.g. all data extracted 
from en.wikipedia.org will get "xml:lang=en"). The default matching language 
allows to tell the KeywordLinkingEngine to use labels of that language for 
matching regardless of the language of the parsed content. In the case of 
DBpedia this allows e.g. to match persons mentioned in an Italian text with the 
eng
 lish labels extracted from en.wikipedia.org. Details about natural language 
processing features used by this engine are provided in the section "Multiple 
Language Support"
@@ -93,11 +94,20 @@ The current state of the processing is r
 * __Token:__ The currently processed word part of the chunk and the sentence.
 * __TokenIndex:__ The index of the currently active token relative to the 
AnalysedSentence.
 
-The ProcessingState provides means to navigate to the next token. If chunks 
are present tokens that are outside of chunks are ignored.
+Processing is done based on Tokens (words). The ProcessingState provides means 
to navigate to the next token. If Chunks are present tokens that are outside of 
chunks are ignored. Only 'processable' tokens are considered to lookup entities 
(see the next section for details). If a Token is processable is determined as 
follows
+
+* Only Tokens within a Chunk are considered. If no Chunks are available all 
Tokens.
+* If POS tags are available AND POS tags considered as NOUNS are configured 
(see 
[PosTagsCollectionEnum](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTagsCollectionEnum.java))
 than POS tags are considered for deciding if a Token is processable
+    * The minimum POS tag probability is <code>0.667</code>
+    * Tokens with a POS tag representing a NOUN and a probability >= 
minPosTagProb are marked as processable
+    * Tokens with a POS tag NOT representing a NOUN and a probability >= 
minPosTagProb/2 are marked as NOT processable
+* If POS tags are NOT available or the NOUN POS tags configuration is missing 
the minimum token length 
_(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)_ 
is used as fallback. This means that all Tokens equals or longer than this 
value are marked as processable.
+
+This algorithm was introduced by 
[STANBOL-658](https://issues.apache.org/jira/browse/STANBOL-685)
 
 ### Entity Lookup ###
 
-A "OR" query with [1..MAX_SEARCH_TOKENS] tokens is used to lookup entities via 
the 
[EntitySearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java)
 interface. If the actual implementation cut off results, than it must be 
ensured that Entities that match both tokens are ranked first.
+A "OR" query with [1..MAX_SEARCH_TOKENS] processable tokens is used to lookup 
entities via the 
[EntitySearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java)
 interface. If the actual implementation cut off results, than it must be 
ensured that Entities that match both tokens are ranked first.
 Currently there are two implementations of this interface: (1) for the 
Entityhub 
([EntityhubSearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntityhubSearcher.java))
 and (2) for ReferencedSites 
([ReferencedSiteSearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/ReferencedSiteSearcher.java)).
 There is also an 
[Implementation](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/test/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/TestSearcherImpl.java)
 that holds entities in-memory, however currently this is only used for unit 
tests.
 
 Queries do use the configured 
[EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getNameField()
 and the language of labels is restricted to the current language or labels 
that do not define any language.
@@ -115,15 +125,18 @@ All labels (values of the [EntityLinkerC
 
 For each label that fulfills the above criteria the following steps are 
processed. The best result is used as the result of the whole matching process:
 
-* All tokens (of the text) following the current position are searched within 
the label.
-* As of now, tokens MUST appear in the correct order within a label (e.g. 
"Murdoch Rupert" will NOT match "Rupert Murdoch")
-* On the first processable token of the text that is not present within the 
label matching is canceled. (see the definition of processable token above)
-* On the second non-processable token not found in the label the matching is 
also canceled (e.g. "University of Michigan" will match "University Michigan")
+* Tokens (of the text) following the current position are searched within the 
label. This also includes non-processable Tokens. 
+    * Processable Tokens MUST match with Tokens in the Label. A maximum number 
of 
[EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getMaxNotFound()
 non-processable Tokens may not match.
+    * Token order is important. Tokens in the Entity Label are allied to be 
skipped (e.g. the text 'Barack Obama' will match the label 'Barack Hussein 
Obama' because Hussein is allowed to be skipped. The other way around it would 
be no match because processable Tokens in the Text are not allied to be skipped)
+* If the first Token of the Label is not matches preceding Tokens of the Text 
are matched against the Label. This is done to ensure that Entities that use 
adjectives in their labels (e.g. "great improvement", "Gute Deutschkenntnisse") 
are matched. In addition this also helps to match named entities (e.g. person 
names) as the first token of those mentions are sometimes erroneously 
classified adjectives by POS taggers.
+* Tokens that appear in the wrong order (e.g. the text 'Obama, Barack' with 
the label 'Barack Obama' are matched with a factor of <code>0.7</code>. 
Currently only exact matches are considered.
+
+If two tokens match is calculated by dividing the longest matching part from 
the begin of the Token to the maximum length of the two tokens. e.g. 'German' 
would match with 'Germany' with <code>5/6=0.83</code>. The result of this 
comparison is the token similarity. If this similarity is greater equals than 
the configured minimum token similarity factor 
_(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)_ 
than those tokens are considered to match. The token similarity is also used 
for calculating the confidence.  
 
 Entities are 
[Suggested](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java)
 if:
 
-* a label does match exactly with the text following the current position it 
the entity is suggested. (e.g. 
[Passerine](http://en.wikipedia.org/wiki/Passerine))
-* a label matches at least 
[EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getMinFoundTokens()
 (default=2) are matching with the text. This ensures that "[Rupert 
Murdoch](http://en.wikipedia.org/wiki/Rupert_Murdoch)" is not suggested for 
"[Rupert](http://en.wikipedia.org/wiki/Rupert)" but on the other hand "Barack 
Hussein Obama" is suggested for "Barack Obama". Setting "minFoundToken" to 
values less than two will usually cause a lot of false positives, but would 
also come up with a suggestion for "Barack Obama" if the content contains the 
word "Obama".
+* a label does match exactly with the current position in the text. This is if 
all tokens of the Label match with the Tokens of the text. Note that tokens are 
considered to match if the similarity is greater equals than the minimum token 
match factor.
+* partial matches are considered if more than 
[EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getMinFoundTokens()
 (default=2) processable tokens match. Non-processable tokens are not 
considered for this. This ensures that "[Rupert 
Murdoch](http://en.wikipedia.org/wiki/Rupert_Murdoch)" is not suggested for 
"[Rupert](http://en.wikipedia.org/wiki/Rupert)" but on the other hand "Barack 
Hussein Obama" is suggested for "Barack Obama".
 
 The described matching process is currently directly part of the 
[EntityLinker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java).
 To support different matching strategies this would need to be externalized 
into an own "EntityLabelMatcher" interface.
 

Modified: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png?rev=1360538&r1=1360537&r2=1360538&view=diff
==============================================================================
Binary files - no diff available.

svn commit: r1360538 - in /incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines: keywordlinkingengine.mdtext keywordlinkingengineconfig.png

Reply via email to