Author: rwesten
Date: Mon Jun 10 05:29:06 2013
New Revision: 1491336
URL: http://svn.apache.org/r1491336
Log:
STANBOL-1100: changed all mentions of the 'prop' property to 'prob'.
STANBOL-1070: Added Documentation for the LinkinStateAware extension point.
Also fixed some wrong property names
Modified:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
Modified:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext?rev=1491336&r1=1491335&r2=1491336&view=diff
==============================================================================
---
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
(original)
+++
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
Mon Jun 10 05:29:06 2013
@@ -20,7 +20,7 @@ The Linking Process consists of three ma
### Token Types
-The KeywordLinkingEngine operates based on tokens (words). Those tokens are
divided in the following Categories
+The EntityLinkingEngine operates based on tokens (words). Those tokens are
divided in the following Categories
* __Linkable Tokens__: This are words that are linked with the Vocabulary.
This means that the engine will issue quires in the controlled vocabulary for
those tokens
* __Matchable Tokens__: Matchable tokens are used to refine quires. For the
matching of entity labels with the text those words are treated in the same way
as linkable words. So the main difference is that matchable words alone will
not cause the engine to query for Entities in the Controlled Vocabulary.
@@ -38,7 +38,7 @@ In addition to the token type the engine
### Consumed NLP Processing Results:
-The KeywordLinkingEngine consumes NLP processing results from the AnalyzedText
ContentPart of the processed ContentItem. The following list describes the
consumed information and their usage in the linking process:
+The EntityLinkingEngine consumes NLP processing results from the AnalyzedText
ContentPart of the processed ContentItem. The following list describes the
consumed information and their usage in the linking process:
1. __Language_ _(required)_: The Language of the Text is acquired from the
Metadata of the ContentItem. It is required to search for labels in the correct
language and also to correctly apply language specific configurations of the
engine.
2. __Sentences__ _(optional)_: Sentence annotations are used as segments for
the matching process. In addition for the first word of an Sentence the _Upper
Case_ feature is NOT set. In the case that no Sentence Annotations are present
the whole text is treated as a single Sentence.
@@ -128,7 +128,7 @@ This specifies that all Languages other
Values MUST BE parsed as Array or Vector. This is done by using the
["elem1","elem2",...] syntax as defined by OSGI ".config" files. The following
example shows the two above examples combined to a single configuration.
-
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["!fr","!it","de","en","*"]
+ enhancer.engines.linking.processedLanguages=["!fr","!it","de","en","*"]
__2. Language specific Parameter Configuration__
@@ -141,7 +141,7 @@ In addition to specifying the processed
The first line sets the parameter for {language}. The 2nd and 3rd line show
that either the wildcard language '*' or the empty language '' can be used to
configure parameters that are used as defaults for all languages.
-The following param-names are supported by the KeywordLinkingEngine
+The following param-names are supported by the EntityLinkingEngine
__Phrase level Parameters:__
@@ -162,20 +162,20 @@ NOTE: that tokens are linked if any of "
__Examples:__
-The default configuration for the KeywordLinkingEngine uses the following
setting
+The default configuration for the EntityLinkingEngine uses the following
setting
- *;lmmtip;uc=LINK;prop=0.75;pprob=0.75
+ *;lmmtip;uc=LINK;prob=0.75;pprob=0.75
de;uc=MATCH
es;lc=Noun
nl;lc=Noun
The first line enable _Link Multiple Matchable Tokens in Phrases_ and linking
of upper case tokens for all languages. In addition it sets the minimum
probabilities for Pos- and Phrase annotations to 0.75 (what would be also the
default). The following three lines provide additional language specific
defaults. For German the upper case mode is reset to MATCH as in German all
Nouns use upper case. For Spain and Dutch linking for the LexicalCategory Noun
is enabled. This is because the OpenNLP POS tagger for those languages does not
support ProperNoun's and therefore the Engine would not link any tokens if
_Link ProperNouns only_ is enabled. The same configuration in the OSGI
'.config' file syntax would look like follows
-
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["*;lmmtip;uc\=LINK;prop\=0.75;pprob\=0.75","de;uc\=MATCH","es;lc\=Noun","nl;lc\=Noun"]
+
enhancer.engines.linking.processedLanguages=["*;lmmtip;uc\=LINK;prop\=0.75;pprob\=0.75","de;uc\=MATCH","es;lc\=Noun","nl;lc\=Noun"]
The 2nd example shows how to define default settings without using the
wildcard '*' that would enable processing of all languages. The following
example shows an configuration that only enables English and ignores text in
all other languages.
- ;lmmtip;uc=LINK;prop=0.75;pprob=0.75
+ ;lmmtip;uc=LINK;prob=0.75;pprob=0.75
en
de;uc=MATCH
@@ -187,7 +187,7 @@ This configuration allows to configure t
* __Label Field__ _(enhancer.engines.linking.labelField)_: The name of the
field/property used to link (search and match) Entities. Only a single field is
supported for performance reasons.
* __Case Sensitivity__ _(enhancer.engines.linking.caseSensitive)_: Boolean
switch that allows to activate/deactivate case sensitive matching. It is
important to understand that even with case sensitivity activated an Entity
with the label such as "Anaconda" will be suggested for the mention of
"anaconda" in the text. The main difference will be the confidence value of
such a suggestion as with case sensitivity activated the starting letters "A"
and "a" are NOT considered to be matching. See the second technical part for
details about the matching process. Case Sensitivity is deactivated by default.
It is recommended to be activated if controlled vocabularies contain
abbreviations similar to commonly used words e.g. CAN for Canada.
* __Type Field__ _(enhancer.engines.linking.typeField)_: Values of this field
are used as values of the "fise:entity-types" property of created
"[fise:EntityAnnotation](../enhancementstructure.html#fiseentityannotation)"s.
The default is "rdf:type". _NOTE_ that in contrast to the
[NamedEntityLinking](namedentityextractionengine) the types are not used for
the linking process. They are only used while writing the
'fise:EntityAnnotation's and to determine the 'dc:type' values of
'fise:TextAnnotation's.
-* __Type Mappings__ _(enhancer.engines.linking.typeMappings)_: The FISE
enhancement structure (as used by the Stanbol Enhancer) distinguishes
[TextAnnotation](../enhancementstructure.html#fisetextannotation) and
[EntityAnnotation](../enhancementstructure.html#fiseentityannotation)s. The
Keyword linking engine needs to create both types of Annotations:
TextAnnotations selecting the words that match some Entities in the Controlled
Vocabulary and EntityAnnotations that represent an Entity suggested for a
TextAnnotation. The Type Mappings are used to determine the "dc:type" of the
TextAnnotation based on the types of the suggested Entity. The default
configuration comes with mappings for Persons, Organizations, Places and
Concepts but this fields allows to define additional mappings. For details
about the syntax see the sub-section "Type Mapping Syntax" below.
+* __Type Mappings__ _(enhancer.engines.linking.typeMappings)_: The FISE
enhancement structure (as used by the Stanbol Enhancer) distinguishes
[TextAnnotation](../enhancementstructure.html#fisetextannotation) and
[EntityAnnotation](../enhancementstructure.html#fiseentityannotation)s. The
EntityLinkingEgnine needs to create both types of Annotations: TextAnnotations
selecting the words that match some Entities in the Controlled Vocabulary and
EntityAnnotations that represent an Entity suggested for a TextAnnotation. The
Type Mappings are used to determine the "dc:type" of the TextAnnotation based
on the types of the suggested Entity. The default configuration comes with
mappings for Persons, Organizations, Places and Concepts but this fields allows
to define additional mappings. For details about the syntax see the sub-section
"Type Mapping Syntax" below.
* __Redirect Field__ _(enhancer.engines.linking.redirectField)_ and __Redirect
Mode__ _(enhancer.engines.linking.redirectMode)_: Redirects allow to follow
links to other entities defined in the vocabulary linked against. This is
useful in cases where matched Entities are not equals to the Entities that
users want to suggest. A good example is [DBpedia](http://dbpedia.org) where
the Entity 'dbpedia:USA' defines only the label "USA" and an redirect to the
Entity 'dbpedia:United_States' with all the information. The _Redirect Mode_
can now be used to define if redirects should be "IGNORE"; "ADD_VALUES" causes
information of the redirected entity ('dbpedia:United_States') to be added to
the matched one ('dbpedia:USA'); "FOLLOW" will suggest the redirected Entity
('dbpedia:United_States') instead of the matched one ('dbpedia:USA'). The
_Redirect Field_ defines the field/property used for redirects.
* __Suggestions__ _(enhancer.engines.linking.suggestions)_: The maximum number
of suggestions. The default value for this is '3'. If the engine is used in
combination with an post processing engine (e.g. disambiguation) that users
might want to increase this value.
@@ -220,7 +220,7 @@ The parameters below are used to configu
#### Type Mappings Syntax
-The Type Mappings are used to determine the "dc:type" of the
[TextAnnotation](../enhancementstructure.html#fisetextannotation) based on the
types of the suggested Entity. The field "Type Mappings" (property:
_org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings_) can be
used to customize such mappings.
+The Type Mappings are used to determine the "dc:type" of the
[TextAnnotation](../enhancementstructure.html#fisetextannotation) based on the
types of the suggested Entity. The field "Type Mappings" (property:
_enhancer.engines.linking.typeMappings_) can be used to customize such mappings.
This field uses the following syntax
@@ -242,7 +242,7 @@ Some Examples of additional Mappings for
The first two lines map some will known Classes that represent drugs and
diseases to 'drugbank:drugs' and 'diseasome:diseases'. The third and fourth
line define 1:1 mappings for side effects and ingredients and the last line
adds 'dailymed:organization' as an additional mapping to DBpedia Ontology
Organisation.
-The following mappings are predefined by the KeywordLinkingEngine.
+The following mappings are predefined by the EntityLinkingEngine.
dbp-ont:Person; foaf:Person; schema:Person > dbp-ont:Person
dbp-ont:Organisation; dbp-ont:Newspaper; schema:Organization >
dbp-ont:Organisation
@@ -251,7 +251,7 @@ The following mappings are predefined by
## Extension Points
-This section describes Interfaces that are used as Extension Points by the
KeywordLinkingEngine
+This section describes Interfaces that are used as Extension Points by the
EntityLinkingEngine
### EntitySearcher
@@ -273,11 +273,11 @@ This method is used for searching entiti
The [EntityhubLinkingEngine](entityhublinking) includes EntitySearcher
implementations based on the FieldQuery search interface implemented by the
Stanbol Entityhub.
-Currently the StanbolEntityhub based implementations are instantiated based on
the value of the
_'org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId'_.
Users that want to use a different implementation of this Interface to be used
for linking will need to extend the KeywordLinkingEngine and override the
#activateEntitySearcher(ComponentContext context, Dictionary<String,Object>
configuration) and #deactivateEntitySearcher(). Those methods are called during
activation/deactivation of the KeywordLinkingEngine and are expected to
set/unset the #entitySearcher field.
+Currently the StanbolEntityhub based implementations are instantiated based on
the value of the _'enhancer.engines.linking.entityhub.siteId'_. Users that want
to use a different implementation of this Interface to be used for linking will
need to extend the EntityLinkingEngine and override the
#activateEntitySearcher(ComponentContext context, Dictionary<String,Object>
configuration) and #deactivateEntitySearcher(). Those methods are called during
activation/deactivation of the EntityLinkingEngine and are expected to
set/unset the #entitySearcher field.
### LabelTokenizer
-The LabelTokenizer interface is used to tokenize labels of Entity suggestions
as returned by the [EntitySearcer](#entitysearcher). As the matching process of
the KeywordLinkingEngine is based on Tokens (words) multi-word labels (e.g.
Univerity of Munich) need to be tokenized before they can be matched against
the current context in the Text.
+The LabelTokenizer interface is used to tokenize labels of Entity suggestions
as returned by the [EntitySearcer](#entitysearcher). As the matching process of
the EntityLinkingEngine is based on Tokens (words) multi-word labels (e.g.
Univerity of Munich) need to be tokenized before they can be matched against
the current context in the Text.
The _LabelTokenizer_ interface defines only the single _tokenize(String label,
String language)::String[]_ method that gets the label and the language as
parameter and returns the tokens as a String array. If the tokenizer where not
able to tokenize the label (e.g. because he does not support the language) it
MUST return NULL. In this case the NamedEntityLinking engine will try to match
the label as a single token.
@@ -324,3 +324,70 @@ This _LabelTokenizer_ supports the confi
Internally the OpenNLP service to load tokenizer models for languages. That
means that tokenizer models are loaded via the DataFileProvider infrastructure.
For user that means that custom tokenizer models are loaded from the Stanbol
Datafiles directory ({stanbol-working-dir}/stanbol/datafiles).
+### LinkingStateAware
+
+Added with [STANBOL-1070](https://issues.apache.org/jira/browse/STANBOL-1070)
this interface allows to receive callbacks about the processing state of the
entity linking process. This interface define methods for start/end section as
well as start/end token. Both the start and the end method do parsed the active
Span as parameter. An instance of this interface can be parsed to the
constructor of the EntityLinker implementation.
+
+The typical usage of this extension point is as follows:
+
+ :::java
+ @Reference
+ protected LabelTokenizer labelTokenizer;
+
+ private TextProcessingConfig textProcessingConfig;
+ private EntityLinkerConfig linkerConfig;
+
+ private EntitySearcher entitySearcher;
+
+ @Activate
+ @SuppressWarnings("unchecked")
+ protected void activate(ComponentContext ctx) throws
ConfigurationException {
+ super.activate(ctx);
+ Dictionary<String,Object> properties = ctx.getProperties();
+ //extract TextProcessing and EnityLinking config from the provided
properties
+ textProcessingConfig = TextProcessingConfig.createInstance(properties);
+ linkerConfig =
EntityLinkerConfig.createInstance(properties,prefixService);
+
+ //create/init the entitySearcher
+ entitySearcher = new MyEntitySearcher();
+
+ //parse additional properties
+ }
+
+ public void computeEnhancements(ContentItem ci) throws EngineException {
+ AnalysedText at = NlpEngineHelper.getAnalysedText(this, ci, true);
+ String language = NlpEngineHelper.getLanguage(this, ci, true);
+
+ //create an instance of your LinkingStateAware implementation
+ LinkingStateAware linkingStateAware; //= new YourImpl(..);
+
+ //create one EntityLinker instance per enhancement request
+ EntityLinker entityLinker = new EntityLinker(at,language,
+ languageConfig, entitySearcher, linkerConfig,
+ labelTokenizer, linkingStateAware);
+
+ //during processing we will receive callbacks to the
+ //linkingStateAware instance
+ try {
+ entityLinker.process();
+ } catch (EntitySearcherException e) {
+ log.error("Unable to link Entities with "+entityLinker,e);
+ throw new EngineException(this, ci, "Unable to link Entities with
"+entityLinker, e);
+ }
+ }
+
+Note that it is also possible to use a single EntityLinker/LinkingStateAware
pair to process multiple ContentItems. However in this case received callbacks
need to be filtered based on the AnalysedText being the context of the Span
instanced parsed to the callback methods.
+
+ :::java
+ @Override
+ public void startToken(Token token) {
+ //process based on the context
+ AnalysedText at = token.getContext();
+ // â¦
+ }
+
+In addition such a usage would require the LinkingStateAware implementation to
be thread save.
+
+
+
+