Author: rwesten
Date: Fri Mar 16 08:25:56 2012
New Revision: 1301374

URL: http://svn.apache.org/viewvc?rev=1301374&view=rev
Log:
Formatting improvements, Updated links to the deprecated TaxonomyLinkingEngine

Modified:
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext

Modified: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext?rev=1301374&r1=1301373&r2=1301374&view=diff
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
 (original)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
 Fri Mar 16 08:25:56 2012
@@ -14,28 +14,28 @@ The example in the scene shows an config
 
 ### Configuration Parameter
 
-* __Name__(stanbol.enhancer.engine.name): The name of the Enhancement Engine. 
This name is used to refer an [EnhancementEngine](index.html) in 
[EnhancementChain](enhancementchain.html)s
-* __Referenced 
Site__(org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId): 
The name of the ReferencedSite of the Stanbol Entityhub that holds the 
controlled vocabulary to be used for extracting Entities. "entityhub" or 
"local" can be used to extract Entities managed directly by the Entityhub.
-* __Label 
Field__(org.apache.stanbol.enhancer.engines.keywordextraction.nameField): The 
name of the property used to lookup Entities. Only a single field is supported 
for performance reasons. Users that want to use values of several fields should 
collect such values by an according configuration in the mappings.txt used 
during indexing. This [usage scenario](../../customvocabulary.html) provides 
more information on this.
-* __Case 
Sensitivity__(org.apache.stanbol.enhancer.engines.keywordextraction.caseSensitive):
 This allows to activate/deactivate case sensitive matching. It is important to 
understand that even with case sensitivity activated an Entity with the label 
such as "Anaconda" will be suggested for the mention of "anaconda" in the text. 
The main difference will be the confidence value of such a suggestion as with 
case sensitivity activated the starting letters "A" and "a" are NOT considered 
to be matching. See the second technical part for details about the matching 
process. Case Sensitivity is deactivated by default. It is recommended to be 
activated if controlled vocabularies contain abbreviations similar to commonly 
used words e.g. CAN for Canada.
-* __Type 
Field__(org.apache.stanbol.enhancer.engines.keywordextraction.typeField): 
Values of this field are used as values of the "fise:entity-types" property of 
created "fise:EntityAnnotation"s. The default is "rdf:type".
-* __Redirect 
Field__(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField) 
and __Redirect 
Mode__(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode): 
Redirects allow to tell the KeywordLinkingEngine to follow a specific property 
in the knowledge base for matched entities. This feature e.g. allows to follow 
redirects from "USA" to "United States" as defined in Wikipedia. See 
"Processing of Entity Suggestions" for details. Possible valued for the 
Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses 
label, type informations of redirected entities, but keeps the URI of the 
extracted entity; "FOLLOW" - follows the redirect
-* __Min Token 
Length__(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength):
 While the KeywordLinkingEngine preferable uses POS (part-of-speach) taggers to 
determine if a word should matched with the controlled vocabulary the minimum 
token length provides a fall back if (a) no POS tagger is available for the 
language of the parsed text or (b) if the confidence of the POS tagger is lower 
than the threshold.
-* __Keyword 
Tokenizer__(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer):
 This allows to use a special Tokenizer for matching keywords and alpha numeric 
IDs. Typical language specific Tokenizers tend to split such IDs in several 
tokens and therefore might prevent a correct matching. This Tokenizer should 
only be activated if the KeywordLinkingEngine is configured to match against 
IDs like ISBN numbers, Product IDs ... It should not be used to match against 
natural language labels. 
-* 
__Suggestions__(org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions):
 The maximum number of suggested Entities.
-* 
__Languages__(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)
 and __Default Matching 
Language__(org.apache.stanbol.enhancer.engines.keywordextraction.defaultMatchingLanguage):
 The first allows to specify languages that should be processed by this engine. 
This is e.g. useful if the controlled vocabulary only contains labels in for a 
specific language but does not formally specify this information (by setting 
the "xml:lang" property for labels). The default matching language can be used 
to work around the exact opposite case. As an example in DBpedia labels do get 
the language of the dataset they are extracted from (e.g. all data extracted 
from en.wikipedia.org will get "xml:lang=en"). The default matching language 
allows to tell the KeywordLinkingEngine to use labels of that language for 
matching regardless of the language of the parsed content. In the case of 
DBpedia this allows e.g. to match persons mentioned in an Italian text with the 
english l
 abels extracted from en.wikipedia.org. Details about natural language 
processing features used by this engine are provided in the section "Multiple 
Language Support"
-* __Type 
Mappings__(org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings): 
The FISE enhancement structure (as used by the Stanbol Enhancer) distinguishes 
TextAnnotations and EntityAnnotations. The Keyword linking engine needs to 
create both types of Annotations: TextAnnotations selecting the words that 
match some Entities in the Controlled Vocabulary and EntityAnnotations that 
represent an Entity suggested for a TextAnnotation. The Type Mappings are used 
to determine the "dc:type" of the TextAnnotation based on the types of the 
suggested Entity. The default configuration comes with mappings for Persons, 
Organizations, Places and Concepts but this fields allows to define additional 
mappings. For details see the section "Type Mapping Syntax" and "Processing of 
Entity Suggestions".
-* __Dereference 
Entities__(org.apache.stanbol.enhancer.engines.keywordextraction.dereference): 
If enabled this engine adds additional information about the suggested Entities 
to the Metadata of the enhanced content item.
-* __Ranking__(service.ranking): This property is used of two engines do use 
the same __Name__. In such cases the one with the higher ranking will be used 
to enhance content items. Typically users will not need to change this.
+* __Name__ _(stanbol.enhancer.engine.name)_: The name of the Enhancement 
Engine. This name is used to refer an [EnhancementEngine](index.html) in 
[EnhancementChain](enhancementchain.html)s
+* __Referenced Site__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId)_: The 
name of the ReferencedSite of the Stanbol Entityhub that holds the controlled 
vocabulary to be used for extracting Entities. "entityhub" or "local" can be 
used to extract Entities managed directly by the Entityhub.
+* __Label Field__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.nameField)_: The name 
of the property used to lookup Entities. Only a single field is supported for 
performance reasons. Users that want to use values of several fields should 
collect such values by an according configuration in the mappings.txt used 
during indexing. This [usage scenario](../../customvocabulary.html) provides 
more information on this.
+* __Case Sensitivity__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.caseSensitive)_: This 
allows to activate/deactivate case sensitive matching. It is important to 
understand that even with case sensitivity activated an Entity with the label 
such as "Anaconda" will be suggested for the mention of "anaconda" in the text. 
The main difference will be the confidence value of such a suggestion as with 
case sensitivity activated the starting letters "A" and "a" are NOT considered 
to be matching. See the second technical part for details about the matching 
process. Case Sensitivity is deactivated by default. It is recommended to be 
activated if controlled vocabularies contain abbreviations similar to commonly 
used words e.g. CAN for Canada.
+* __Type Field__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.typeField)_: Values of 
this field are used as values of the "fise:entity-types" property of created 
"fise:EntityAnnotation"s. The default is "rdf:type".
+* __Redirect Field__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField)_ and 
__Redirect Mode__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode)_: 
Redirects allow to tell the KeywordLinkingEngine to follow a specific property 
in the knowledge base for matched entities. This feature e.g. allows to follow 
redirects from "USA" to "United States" as defined in Wikipedia. See 
"Processing of Entity Suggestions" for details. Possible valued for the 
Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses 
label, type informations of redirected entities, but keeps the URI of the 
extracted entity; "FOLLOW" - follows the redirect
+* __Min Token Length__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)_: 
While the KeywordLinkingEngine preferable uses POS (part-of-speach) taggers to 
determine if a word should matched with the controlled vocabulary the minimum 
token length provides a fall back if (a) no POS tagger is available for the 
language of the parsed text or (b) if the confidence of the POS tagger is lower 
than the threshold.
+* __Keyword Tokenizer__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer)_: 
This allows to use a special Tokenizer for matching keywords and alpha numeric 
IDs. Typical language specific Tokenizers tend to split such IDs in several 
tokens and therefore might prevent a correct matching. This Tokenizer should 
only be activated if the KeywordLinkingEngine is configured to match against 
IDs like ISBN numbers, Product IDs ... It should not be used to match against 
natural language labels. 
+* __Suggestions__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions)_: The 
maximum number of suggested Entities.
+* __Languages__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)_ 
and __Default Matching Language__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.defaultMatchingLanguage)_:
 The first allows to specify languages that should be processed by this engine. 
This is e.g. useful if the controlled vocabulary only contains labels in for a 
specific language but does not formally specify this information (by setting 
the "xml:lang" property for labels). The default matching language can be used 
to work around the exact opposite case. As an example in DBpedia labels do get 
the language of the dataset they are extracted from (e.g. all data extracted 
from en.wikipedia.org will get "xml:lang=en"). The default matching language 
allows to tell the KeywordLinkingEngine to use labels of that language for 
matching regardless of the language of the parsed content. In the case of 
DBpedia this allows e.g. to match persons mentioned in an Italian text with the 
eng
 lish labels extracted from en.wikipedia.org. Details about natural language 
processing features used by this engine are provided in the section "Multiple 
Language Support"
+* __Type Mappings__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings)_: The 
FISE enhancement structure (as used by the Stanbol Enhancer) distinguishes 
TextAnnotations and EntityAnnotations. The Keyword linking engine needs to 
create both types of Annotations: TextAnnotations selecting the words that 
match some Entities in the Controlled Vocabulary and EntityAnnotations that 
represent an Entity suggested for a TextAnnotation. The Type Mappings are used 
to determine the "dc:type" of the TextAnnotation based on the types of the 
suggested Entity. The default configuration comes with mappings for Persons, 
Organizations, Places and Concepts but this fields allows to define additional 
mappings. For details see the section "Type Mapping Syntax" and "Processing of 
Entity Suggestions".
+* __Dereference Entities__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.dereference)_: If 
enabled this engine adds additional information about the suggested Entities to 
the Metadata of the enhanced content item.
+* __Ranking__ _(service.ranking)_: This property is used of two engines do use 
the same __Name__. In such cases the one with the higher ranking will be used 
to enhance content items. Typically users will not need to change this.
 
 Additionally the following properties can be configured via a configuration 
file:
 
-* __Minimum Found 
Tokens__(org.apache.stanbol.enhancer.engines.keywordextraction.minFoundTokens): 
This allows to tell the KeywordLinking Engine how to deal with Entities that do 
not exactly match words in the text. Typical Examples are "George W. Bush" -> 
"George Walker Bush". This parameter allows the minimum number of tokens that 
need to match. The default value is '2'. Note that this does not apply for 
exact matches. Setting this to a high value can be used to force a mode that 
will only consider entities where all tokens of the label match the mention in 
the text.
-* __Minimum Pos Tag 
Probability__(org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability):
 The minimum probability of a POS (part-of-speech) tag. Tags with a lower 
probability will be ignored. In such cases the configured value for the __Min 
Token Length__ will apply. The value MUST BE in the range [0..1]
+* __Minimum Found Tokens__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.minFoundTokens)_: This 
allows to tell the KeywordLinking Engine how to deal with Entities that do not 
exactly match words in the text. Typical Examples are "George W. Bush" -> 
"George Walker Bush". This parameter allows the minimum number of tokens that 
need to match. The default value is '2'. Note that this does not apply for 
exact matches. Setting this to a high value can be used to force a mode that 
will only consider entities where all tokens of the label match the mention in 
the text.
+* __Minimum Pos Tag Probability__ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability)_: 
The minimum probability of a POS (part-of-speech) tag. Tags with a lower 
probability will be ignored. In such cases the configured value for the __Min 
Token Length__ will apply. The value MUST BE in the range [0..1]
 
 ### Type Mappings Syntax
 
-The Type Mappings are used to determine the "dc:type" of the TextAnnotation 
based on the types of the suggested Entity. The field "Type Mappings" 
(property: org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings) 
can be used to customize such mappings.
+The Type Mappings are used to determine the "dc:type" of the TextAnnotation 
based on the types of the suggested Entity. The field "Type Mappings" 
(property: 
_org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings_) can be 
used to customize such mappings.
 
 This field uses the following syntax
 
@@ -151,6 +151,8 @@ Input Parameters
 * span: number of tokens selected by the current suggestion e.g. "Barack 
Hussein Obama" -> 2
 * label_tokens: number of tokens of the matched label of the current entity 
(label_token) e.g. "Barack Hussein Obama" -> 3
 
+The confidence is calculated as follows: 
+
     :::java
     confidence = (match/max_matched)^2 * (matched/span) * 
(matched/label_tokens)
 
@@ -162,16 +164,17 @@ Some Examples:
 
 The calculation of the confidence is currently direct part of the 
[EntityLinker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java).
 To support different matching strategies this would need to be externalized 
into an own interface.
 
-## Future Plans for the TaxonomyLinkingEngine ##
-
-The TaxonomyLinkingEngine is still available and fully functional. However it 
is marked as deprecated and not included in any of the launchers. Current users 
are encouraged to switch over to the KeywordLinkingEngine. 
+## Notes about the TaxonomyLinkingEngine ##
 
-In the future it is planed to repurpose the TaxonomyLinkingEngine as a special 
version of the KeywordLinkingEngine with a specialized configuration and 
feature set targeted for (hierarchical) Taxonomies. 
+The KeywordLinkingEngine is a re-implementation of the TaxonomyLinkingEngine 
which is more modular and therefore better suited for future improvements and 
extensions as requested by 
[STANBOL-303](https://issues.apache.org/jira/browse/STANBOL-303). As of 
[STANBOL-506](https://issues.apache.org/jira/browse/STANBOL-506) this engine is 
now deprecated and will be deleted from the SVN.
 
-This will include: 
+<!--
+However there would be now the possibility to implement a new version of an 
TaxonomyLinkingEngine with special support for hierarchical taxonomies. Such an 
engine would feature:
 
-* default configuration specific for SKOS
-* support for term hierarchies - adding suggestions for parent concepts
+* default configuration optimized for SKOS
+* support for term hierarchies - adding suggestions for parent concepts. 
Optionally by using a transitive closure over the hierarchy.
+* support for SKOS matching relations
 * support for restricting enhancements to a specific Taxonomy 
(skos:ConceptScheme) - this would allow to index several taxonomies in the same 
ReferencedSite but still use only a specific one for the enhancements.
 
-
+One Idea would be to allow users to use 
[LDPath](http://code.google.com/p/ldpath/) to configure post processing rules 
applied to extracted concepts of the Taxonomy.
+-->


Reply via email to