entitylinking.mdtext

rwesten Sun, 09 Jun 2013 22:30:42 -0700

Author: rwesten
Date: Mon Jun 10 05:29:06 2013
New Revision: 1491336

URL: http://svn.apache.org/r1491336
Log:
STANBOL-1100: changed all mentions of the 'prop' property to 'prob'. 
STANBOL-1070: Added Documentation for the LinkinStateAware extension point. 
Also fixed some wrong property names


Modified:
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext

Modified: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext?rev=1491336&r1=1491335&r2=1491336&view=diff
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
 (original)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
 Mon Jun 10 05:29:06 2013
@@ -20,7 +20,7 @@ The Linking Process consists of three ma
 
 ### Token Types
 
-The KeywordLinkingEngine operates based on tokens (words). Those tokens are 
divided in the following Categories
+The EntityLinkingEngine operates based on tokens (words). Those tokens are 
divided in the following Categories
 
 * __Linkable Tokens__: This are words that are linked with the Vocabulary. 
This means that the engine will issue quires in the controlled vocabulary for 
those tokens
 * __Matchable Tokens__: Matchable tokens are used to refine quires. For the 
matching of entity labels with the text those words are treated in the same way 
as linkable words. So the main difference is that matchable words alone will 
not cause the engine to query for Entities in the Controlled Vocabulary.
@@ -38,7 +38,7 @@ In addition to the token type the engine
 
 ### Consumed NLP Processing Results:
 
-The KeywordLinkingEngine consumes NLP processing results from the AnalyzedText 
ContentPart of the processed ContentItem. The following list describes the 
consumed information and their usage in the linking process: 
+The EntityLinkingEngine consumes NLP processing results from the AnalyzedText 
ContentPart of the processed ContentItem. The following list describes the 
consumed information and their usage in the linking process: 
 
 1. __Language_ _(required)_: The Language of the Text is acquired from the 
Metadata of the ContentItem. It is required to search for labels in the correct 
language and also to correctly apply language specific configurations of the 
engine.
 2. __Sentences__ _(optional)_: Sentence annotations are used as segments for 
the matching process. In addition for the first word of an Sentence the _Upper 
Case_ feature is NOT set. In the case that no Sentence Annotations are present 
the whole text is treated as a single Sentence.
@@ -128,7 +128,7 @@ This specifies that all Languages other 
 
 Values MUST BE parsed as Array or Vector. This is done by using the 
["elem1","elem2",...] syntax as defined by OSGI ".config" files. The following 
example shows the two above examples combined to a single configuration.
 
-    
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["!fr","!it","de","en","*"]
+    enhancer.engines.linking.processedLanguages=["!fr","!it","de","en","*"]
 
 
 __2. Language specific Parameter Configuration__
@@ -141,7 +141,7 @@ In addition to specifying the processed 
 
 The first line sets the parameter for {language}. The 2nd and 3rd line show 
that either the wildcard language '*' or the empty language '' can be used to 
configure parameters that are used as defaults for all languages. 
 
-The following param-names are supported by the KeywordLinkingEngine
+The following param-names are supported by the EntityLinkingEngine
 
 __Phrase level Parameters:__
 
@@ -162,20 +162,20 @@ NOTE: that tokens are linked if any of "
 
 __Examples:__
 
-The default configuration for the KeywordLinkingEngine uses the following 
setting
+The default configuration for the EntityLinkingEngine uses the following 
setting
 
-    *;lmmtip;uc=LINK;prop=0.75;pprob=0.75
+    *;lmmtip;uc=LINK;prob=0.75;pprob=0.75
     de;uc=MATCH
     es;lc=Noun
     nl;lc=Noun
 
 The first line enable _Link Multiple Matchable Tokens in Phrases_ and linking 
of upper case tokens for all languages. In addition it sets the minimum 
probabilities for Pos- and Phrase annotations to 0.75 (what would be also the 
default). The following three lines provide additional language specific 
defaults. For German the upper case mode is reset to MATCH as in German all 
Nouns use upper case. For Spain and Dutch linking for the LexicalCategory Noun 
is enabled. This is because the OpenNLP POS tagger for those languages does not 
support ProperNoun's and therefore the Engine would not link any tokens if 
_Link ProperNouns only_ is enabled. The same configuration in the OSGI 
'.config' file syntax would look like follows
 
-    
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["*;lmmtip;uc\=LINK;prop\=0.75;pprob\=0.75","de;uc\=MATCH","es;lc\=Noun","nl;lc\=Noun"]
+    
enhancer.engines.linking.processedLanguages=["*;lmmtip;uc\=LINK;prop\=0.75;pprob\=0.75","de;uc\=MATCH","es;lc\=Noun","nl;lc\=Noun"]
 
 The 2nd example shows how to define default settings without using the 
wildcard '*' that would enable processing of all languages. The following 
example shows an configuration that only enables English and ignores text in 
all other languages.
 
-    ;lmmtip;uc=LINK;prop=0.75;pprob=0.75
+    ;lmmtip;uc=LINK;prob=0.75;pprob=0.75
     en
     de;uc=MATCH
 
@@ -187,7 +187,7 @@ This configuration allows to configure t
 * __Label Field__ _(enhancer.engines.linking.labelField)_: The name of the 
field/property used to link (search and match) Entities. Only a single field is 
supported for performance reasons.
 * __Case Sensitivity__ _(enhancer.engines.linking.caseSensitive)_: Boolean 
switch that allows to activate/deactivate case sensitive matching. It is 
important to understand that even with case sensitivity activated an Entity 
with the label such as "Anaconda" will be suggested for the mention of 
"anaconda" in the text. The main difference will be the confidence value of 
such a suggestion as with case sensitivity activated the starting letters "A" 
and "a" are NOT considered to be matching. See the second technical part for 
details about the matching process. Case Sensitivity is deactivated by default. 
It is recommended to be activated if controlled vocabularies contain 
abbreviations similar to commonly used words e.g. CAN for Canada.
 * __Type Field__ _(enhancer.engines.linking.typeField)_: Values of this field 
are used as values of the "fise:entity-types" property of created 
"[fise:EntityAnnotation](../enhancementstructure.html#fiseentityannotation)"s. 
The default is "rdf:type". _NOTE_ that in contrast to the 
[NamedEntityLinking](namedentityextractionengine) the types are not used for 
the linking process. They are only used while writing the 
'fise:EntityAnnotation's and to determine the 'dc:type' values of 
'fise:TextAnnotation's.
-* __Type Mappings__ _(enhancer.engines.linking.typeMappings)_: The FISE 
enhancement structure (as used by the Stanbol Enhancer) distinguishes 
[TextAnnotation](../enhancementstructure.html#fisetextannotation) and 
[EntityAnnotation](../enhancementstructure.html#fiseentityannotation)s. The 
Keyword linking engine needs to create both types of Annotations: 
TextAnnotations selecting the words that match some Entities in the Controlled 
Vocabulary and EntityAnnotations that represent an Entity suggested for a 
TextAnnotation. The Type Mappings are used to determine the "dc:type" of the 
TextAnnotation based on the types of the suggested Entity. The default 
configuration comes with mappings for Persons, Organizations, Places and 
Concepts but this fields allows to define additional mappings. For details 
about the syntax see the sub-section "Type Mapping Syntax" below.
+* __Type Mappings__ _(enhancer.engines.linking.typeMappings)_: The FISE 
enhancement structure (as used by the Stanbol Enhancer) distinguishes 
[TextAnnotation](../enhancementstructure.html#fisetextannotation) and 
[EntityAnnotation](../enhancementstructure.html#fiseentityannotation)s. The 
EntityLinkingEgnine needs to create both types of Annotations: TextAnnotations 
selecting the words that match some Entities in the Controlled Vocabulary and 
EntityAnnotations that represent an Entity suggested for a TextAnnotation. The 
Type Mappings are used to determine the "dc:type" of the TextAnnotation based 
on the types of the suggested Entity. The default configuration comes with 
mappings for Persons, Organizations, Places and Concepts but this fields allows 
to define additional mappings. For details about the syntax see the sub-section 
"Type Mapping Syntax" below.
 * __Redirect Field__ _(enhancer.engines.linking.redirectField)_ and __Redirect 
Mode__ _(enhancer.engines.linking.redirectMode)_: Redirects allow to follow 
links to other entities defined in the vocabulary linked against. This is 
useful in cases where matched Entities are not equals to the Entities that 
users want to suggest. A good example is [DBpedia](http://dbpedia.org) where 
the Entity 'dbpedia:USA' defines only the label "USA" and an redirect to the 
Entity 'dbpedia:United_States' with all the information. The _Redirect Mode_ 
can now be used to define if redirects should be "IGNORE"; "ADD_VALUES" causes 
information of the redirected entity ('dbpedia:United_States') to be added to 
the matched one ('dbpedia:USA'); "FOLLOW" will suggest the redirected Entity 
('dbpedia:United_States') instead of the matched one ('dbpedia:USA'). The 
_Redirect Field_ defines the field/property used for redirects.
 * __Suggestions__ _(enhancer.engines.linking.suggestions)_: The maximum number 
of suggestions. The default value for this is '3'. If the engine is used in 
combination with an post processing engine (e.g. disambiguation) that users 
might want to increase this value.
 
@@ -220,7 +220,7 @@ The parameters below are used to configu
 
 #### Type Mappings Syntax
 
-The Type Mappings are used to determine the "dc:type" of the 
[TextAnnotation](../enhancementstructure.html#fisetextannotation) based on the 
types of the suggested Entity. The field "Type Mappings" (property: 
_org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings_) can be 
used to customize such mappings.
+The Type Mappings are used to determine the "dc:type" of the 
[TextAnnotation](../enhancementstructure.html#fisetextannotation) based on the 
types of the suggested Entity. The field "Type Mappings" (property: 
_enhancer.engines.linking.typeMappings_) can be used to customize such mappings.
 
 This field uses the following syntax
 
@@ -242,7 +242,7 @@ Some Examples of additional Mappings for
 
 The first two lines map some will known Classes that represent drugs and 
diseases to 'drugbank:drugs' and 'diseasome:diseases'. The third and fourth 
line define 1:1 mappings for side effects and ingredients and the last line 
adds 'dailymed:organization' as an additional mapping to DBpedia Ontology 
Organisation.
 
-The following mappings are predefined by the KeywordLinkingEngine.
+The following mappings are predefined by the EntityLinkingEngine.
 
     dbp-ont:Person; foaf:Person; schema:Person > dbp-ont:Person
     dbp-ont:Organisation; dbp-ont:Newspaper; schema:Organization > 
dbp-ont:Organisation
@@ -251,7 +251,7 @@ The following mappings are predefined by
 
 ## Extension Points
 
-This section describes Interfaces that are used as Extension Points by the 
KeywordLinkingEngine
+This section describes Interfaces that are used as Extension Points by the 
EntityLinkingEngine
 
 ### EntitySearcher
 
@@ -273,11 +273,11 @@ This method is used for searching entiti
 
 The [EntityhubLinkingEngine](entityhublinking) includes EntitySearcher 
implementations based on the FieldQuery search interface implemented by the 
Stanbol Entityhub.
 
-Currently the StanbolEntityhub based implementations are instantiated based on 
the value of the 
_'org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId'_. 
Users that want to use a different implementation of this Interface to be used 
for linking will need to extend the KeywordLinkingEngine and override the 
#activateEntitySearcher(ComponentContext context, Dictionary<String,Object> 
configuration) and #deactivateEntitySearcher(). Those methods are called during 
activation/deactivation of the KeywordLinkingEngine and are expected to 
set/unset the #entitySearcher field.
+Currently the StanbolEntityhub based implementations are instantiated based on 
the value of the _'enhancer.engines.linking.entityhub.siteId'_. Users that want 
to use a different implementation of this Interface to be used for linking will 
need to extend the EntityLinkingEngine and override the 
#activateEntitySearcher(ComponentContext context, Dictionary<String,Object> 
configuration) and #deactivateEntitySearcher(). Those methods are called during 
activation/deactivation of the EntityLinkingEngine and are expected to 
set/unset the #entitySearcher field.
 
 ### LabelTokenizer
 
-The LabelTokenizer interface is used to tokenize labels of Entity suggestions 
as returned by the [EntitySearcer](#entitysearcher). As the matching process of 
the KeywordLinkingEngine is based on Tokens (words) multi-word labels (e.g. 
Univerity of Munich) need to be tokenized before they can be matched against 
the current context in the Text.
+The LabelTokenizer interface is used to tokenize labels of Entity suggestions 
as returned by the [EntitySearcer](#entitysearcher). As the matching process of 
the EntityLinkingEngine is based on Tokens (words) multi-word labels (e.g. 
Univerity of Munich) need to be tokenized before they can be matched against 
the current context in the Text.
 
 The _LabelTokenizer_ interface defines only the single _tokenize(String label, 
String language)::String[]_ method that gets the label and the language as 
parameter and returns the tokens as a String array. If the tokenizer where not 
able to tokenize the label (e.g. because he does not support the language) it 
MUST return NULL. In this case the NamedEntityLinking engine will try to match 
the label as a single token.
 
@@ -324,3 +324,70 @@ This _LabelTokenizer_ supports the confi
 
 Internally the OpenNLP service to load tokenizer models for languages. That 
means that tokenizer models are loaded via the DataFileProvider infrastructure. 
For user that means that custom tokenizer models are loaded from the Stanbol 
Datafiles directory ({stanbol-working-dir}/stanbol/datafiles).
 
+### LinkingStateAware
+
+Added with [STANBOL-1070](https://issues.apache.org/jira/browse/STANBOL-1070) 
this interface allows to receive callbacks about the processing state of the 
entity linking process. This interface define methods for start/end section as 
well as start/end token. Both the start and the end method do parsed the active 
Span as parameter. An instance of this interface can be parsed to the 
constructor of the EntityLinker implementation.
+
+The typical usage of this extension point is as follows:
+
+    :::java
+    @Reference 
+    protected LabelTokenizer labelTokenizer; 
+
+    private TextProcessingConfig textProcessingConfig;
+    private EntityLinkerConfig linkerConfig;
+
+    private EntitySearcher entitySearcher;
+
+    @Activate
+    @SuppressWarnings("unchecked")
+    protected void activate(ComponentContext ctx) throws 
ConfigurationException {
+        super.activate(ctx);
+        Dictionary<String,Object> properties = ctx.getProperties();
+        //extract TextProcessing and EnityLinking config from the provided 
properties
+        textProcessingConfig = TextProcessingConfig.createInstance(properties);
+        linkerConfig = 
EntityLinkerConfig.createInstance(properties,prefixService);
+
+        //create/init the entitySearcher
+        entitySearcher = new MyEntitySearcher();
+
+        //parse additional properties
+    }
+    
+    public void computeEnhancements(ContentItem ci) throws EngineException {
+        AnalysedText at = NlpEngineHelper.getAnalysedText(this, ci, true);
+        String language = NlpEngineHelper.getLanguage(this, ci, true);
+        
+        //create an instance of your LinkingStateAware implementation
+        LinkingStateAware linkingStateAware; //= new YourImpl(..);
+
+        //create one EntityLinker instance per enhancement request
+        EntityLinker entityLinker = new EntityLinker(at,language, 
+            languageConfig, entitySearcher, linkerConfig, 
+            labelTokenizer, linkingStateAware);
+
+        //during processing we will receive callbacks to the 
+        //linkingStateAware instance
+        try {
+            entityLinker.process();
+        } catch (EntitySearcherException e) {
+            log.error("Unable to link Entities with "+entityLinker,e);
+            throw new EngineException(this, ci, "Unable to link Entities with 
"+entityLinker, e);
+        }
+    }
+        
+Note that it is also possible to use a single EntityLinker/LinkingStateAware 
pair to process multiple ContentItems. However in this case received callbacks 
need to be filtered based on the AnalysedText being the context of the Span 
instanced parsed to the callback methods.
+
+    :::java
+    @Override
+    public void startToken(Token token) {
+        //process based on the context
+        AnalysedText at = token.getContext();
+        // â¦
+    }
+
+In addition such a usage would require the LinkingStateAware implementation to 
be thread save.
+ 
+
+
+

svn commit: r1491336 - /stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext

Reply via email to