Author: buildbot
Date: Thu Jul 12 06:21:34 2012
New Revision: 825542

Log:
Staging update by buildbot for stanbol

Modified:
    websites/staging/stanbol/trunk/content/   (props changed)
    
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
    
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png

Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Jul 12 06:21:34 2012
@@ -1 +1 @@
-1360368
+1360538

Modified: 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
==============================================================================
--- 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
 (original)
+++ 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
 Thu Jul 12 06:21:34 2012
@@ -97,6 +97,7 @@
 <li><strong>Type Field</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.typeField)</em>: 
Values of this field are used as values of the "fise:entity-types" property of 
created "<a 
href="../enhancementstructure.html#fiseentityannotation">fise:EntityAnnotation</a>"s.
 The default is "rdf:type".</li>
 <li><strong>Redirect Field</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField)</em> 
and <strong>Redirect Mode</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode)</em>: 
Redirects allow to tell the KeywordLinkingEngine to follow a specific property 
in the knowledge base for matched entities. This feature e.g. allows to follow 
redirects from "USA" to "United States" as defined in Wikipedia. See 
"Processing of Entity Suggestions" for details. Possible valued for the 
Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses 
label, type informations of redirected entities, but keeps the URI of the 
extracted entity; "FOLLOW" - follows the redirect</li>
 <li><strong>Min Token Length</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)</em>:
 While the KeywordLinkingEngine preferable uses POS (part-of-speach) taggers to 
determine if a word should matched with the controlled vocabulary the minimum 
token length provides a fall back if (a) no POS tagger is available for the 
language of the parsed text or (b) if the confidence of the POS tagger is lower 
than the threshold.</li>
+<li><strong>Minimum Token Match Factor</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)</em>:
 If a Token of the text is compared with a Token of an Entity Label the 
similarity of those two is expressed in the range [0..1]. The minimum token 
match factor specifies the minimum similarity of two Tokens so that they are 
considered to match. Lower similarity scores are not considered as match. This 
parameter is important as it e.g. allows inflected forms of words to match. 
However it also may result in false positives of similar words. users should 
note that the similarity score is also used for calculating the confidence. So 
similarity scores &lt; 1 but higher than the configured minimum token match 
factor will reduce the confidence of suggested Entities.</li>
 <li><strong>Keyword Tokenizer</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer)</em>:
 This allows to use a special Tokenizer for matching keywords and alpha numeric 
IDs. Typical language specific Tokenizers tend to split such IDs in several 
tokens and therefore might prevent a correct matching. This Tokenizer should 
only be activated if the KeywordLinkingEngine is configured to match against 
IDs like ISBN numbers, Product IDs ... It should not be used to match against 
natural language labels. </li>
 <li><strong>Suggestions</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions)</em>:
 The maximum number of suggested Entities.</li>
 <li><strong>Languages</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)</em>
 and <strong>Default Matching Language</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.defaultMatchingLanguage)</em>:
 The first allows to specify languages that should be processed by this engine. 
This is e.g. useful if the controlled vocabulary only contains labels in for a 
specific language but does not formally specify this information (by setting 
the "xml:lang" property for labels). The default matching language can be used 
to work around the exact opposite case. As an example in DBpedia labels do get 
the language of the dataset they are extracted from (e.g. all data extracted 
from en.wikipedia.org will get "xml:lang=en"). The default matching language 
allows to tell the KeywordLinkingEngine to use labels of that language for 
matching regardless of the language of the parsed content. In the case of 
DBpedia this allows e.g. to match persons
  mentioned in an Italian text with the english labels extracted from 
en.wikipedia.org. Details about natural language processing features used by 
this engine are provided in the section "Multiple Language Support"</li>
@@ -161,9 +162,20 @@ The following list provides a short over
 <li><strong>Token:</strong> The currently processed word part of the chunk and 
the sentence.</li>
 <li><strong>TokenIndex:</strong> The index of the currently active token 
relative to the AnalysedSentence.</li>
 </ul>
-<p>The ProcessingState provides means to navigate to the next token. If chunks 
are present tokens that are outside of chunks are ignored.</p>
+<p>Processing is done based on Tokens (words). The ProcessingState provides 
means to navigate to the next token. If Chunks are present tokens that are 
outside of chunks are ignored. Only 'processable' tokens are considered to 
lookup entities (see the next section for details). If a Token is processable 
is determined as follows</p>
+<ul>
+<li>Only Tokens within a Chunk are considered. If no Chunks are available all 
Tokens.</li>
+<li>If POS tags are available AND POS tags considered as NOUNS are configured 
(see <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTagsCollectionEnum.java";>PosTagsCollectionEnum</a>)
 than POS tags are considered for deciding if a Token is processable<ul>
+<li>The minimum POS tag probability is <code>0.667</code></li>
+<li>Tokens with a POS tag representing a NOUN and a probability &gt;= 
minPosTagProb are marked as processable</li>
+<li>Tokens with a POS tag NOT representing a NOUN and a probability &gt;= 
minPosTagProb/2 are marked as NOT processable</li>
+</ul>
+</li>
+<li>If POS tags are NOT available or the NOUN POS tags configuration is 
missing the minimum token length 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)</em>
 is used as fallback. This means that all Tokens equals or longer than this 
value are marked as processable.</li>
+</ul>
+<p>This algorithm was introduced by <a 
href="https://issues.apache.org/jira/browse/STANBOL-685";>STANBOL-658</a></p>
 <h3 id="entity-lookup">Entity Lookup</h3>
-<p>A "OR" query with [1..MAX_SEARCH_TOKENS] tokens is used to lookup entities 
via the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java";>EntitySearcher</a>
 interface. If the actual implementation cut off results, than it must be 
ensured that Entities that match both tokens are ranked first.
+<p>A "OR" query with [1..MAX_SEARCH_TOKENS] processable tokens is used to 
lookup entities via the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java";>EntitySearcher</a>
 interface. If the actual implementation cut off results, than it must be 
ensured that Entities that match both tokens are ranked first.
 Currently there are two implementations of this interface: (1) for the 
Entityhub (<a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntityhubSearcher.java";>EntityhubSearcher</a>)
 and (2) for ReferencedSites (<a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/ReferencedSiteSearcher.java";>ReferencedSiteSearcher</a>).
 There is also an <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/test/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/TestSearcherImpl.java";>Implementation</a>
 that holds entities in-memory, however currently this is only used for unit 
tests.</p>
 <p>Queries do use the configured <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getNameField()
 and the language of labels is restricted to the current language or labels 
that do not define any language.</p>
 <p>Only "processable" tokens are used to lookup entities. If a token is 
processable is determined as follows:</p>
@@ -176,15 +188,20 @@ Currently there are two implementations 
 <p>All labels (values of the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getNameField()
 field) in the language of the content or without any defined language are 
candidates for matches.</p>
 <p>For each label that fulfills the above criteria the following steps are 
processed. The best result is used as the result of the whole matching 
process:</p>
 <ul>
-<li>All tokens (of the text) following the current position are searched 
within the label.</li>
-<li>As of now, tokens MUST appear in the correct order within a label (e.g. 
"Murdoch Rupert" will NOT match "Rupert Murdoch")</li>
-<li>On the first processable token of the text that is not present within the 
label matching is canceled. (see the definition of processable token above)</li>
-<li>On the second non-processable token not found in the label the matching is 
also canceled (e.g. "University of Michigan" will match "University 
Michigan")</li>
+<li>Tokens (of the text) following the current position are searched within 
the label. This also includes non-processable Tokens. <ul>
+<li>Processable Tokens MUST match with Tokens in the Label. A maximum number 
of <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getMaxNotFound()
 non-processable Tokens may not match.</li>
+<li>Token order is important. Tokens in the Entity Label are allied to be 
skipped (e.g. the text 'Barack Obama' will match the label 'Barack Hussein 
Obama' because Hussein is allowed to be skipped. The other way around it would 
be no match because processable Tokens in the Text are not allied to be 
skipped)</li>
+</ul>
+</li>
+<li>If the first Token of the Label is not matches preceding Tokens of the 
Text are matched against the Label. This is done to ensure that Entities that 
use adjectives in their labels (e.g. "great improvement", "Gute 
Deutschkenntnisse") are matched. In addition this also helps to match named 
entities (e.g. person names) as the first token of those mentions are sometimes 
erroneously classified adjectives by POS taggers.</li>
+<li>Tokens that appear in the wrong order (e.g. the text 'Obama, Barack' with 
the label 'Barack Obama' are matched with a factor of <code>0.7</code>. 
Currently only exact matches are considered.</li>
 </ul>
+<p>If two tokens match is calculated by dividing the longest matching part 
from the begin of the Token to the maximum length of the two tokens. e.g. 
'German' would match with 'Germany' with <code>5/6=0.83</code>. The result of 
this comparison is the token similarity. If this similarity is greater equals 
than the configured minimum token similarity factor 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)</em>
 than those tokens are considered to match. The token similarity is also used 
for calculating the confidence.<br />
+</p>
 <p>Entities are <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java";>Suggested</a>
 if:</p>
 <ul>
-<li>a label does match exactly with the text following the current position it 
the entity is suggested. (e.g. <a 
href="http://en.wikipedia.org/wiki/Passerine";>Passerine</a>)</li>
-<li>a label matches at least <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getMinFoundTokens()
 (default=2) are matching with the text. This ensures that "<a 
href="http://en.wikipedia.org/wiki/Rupert_Murdoch";>Rupert Murdoch</a>" is not 
suggested for "<a href="http://en.wikipedia.org/wiki/Rupert";>Rupert</a>" but on 
the other hand "Barack Hussein Obama" is suggested for "Barack Obama". Setting 
"minFoundToken" to values less than two will usually cause a lot of false 
positives, but would also come up with a suggestion for "Barack Obama" if the 
content contains the word "Obama".</li>
+<li>a label does match exactly with the current position in the text. This is 
if all tokens of the Label match with the Tokens of the text. Note that tokens 
are considered to match if the similarity is greater equals than the minimum 
token match factor.</li>
+<li>partial matches are considered if more than <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java";>EntityLinkerConfig</a>.getMinFoundTokens()
 (default=2) processable tokens match. Non-processable tokens are not 
considered for this. This ensures that "<a 
href="http://en.wikipedia.org/wiki/Rupert_Murdoch";>Rupert Murdoch</a>" is not 
suggested for "<a href="http://en.wikipedia.org/wiki/Rupert";>Rupert</a>" but on 
the other hand "Barack Hussein Obama" is suggested for "Barack Obama".</li>
 </ul>
 <p>The described matching process is currently directly part of the <a 
href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java";>EntityLinker</a>.
 To support different matching strategies this would need to be externalized 
into an own "EntityLabelMatcher" interface.</p>
 <h3 id="processing-of-entity-suggestions">Processing of Entity Suggestions</h3>

Modified: 
websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
==============================================================================
Binary files - no diff available.


Reply via email to