Author: rwesten
Date: Fri Mar 16 08:01:01 2012
New Revision: 1301363
URL: http://svn.apache.org/viewvc?rev=1301363&view=rev
Log:
Added configuration section for the KeywordLinkingEngine
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
(with props)
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext?rev=1301363&r1=1301362&r2=1301363&view=diff
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
(original)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
Fri Mar 16 08:01:01 2012
@@ -1,22 +1,82 @@
Title: The Keyword Linking Engine: custom vocabularies and multiple languages
-The
[KeywordLinkingEngine](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/)
is a re-implementation of the
[TaxonomyLinkingEngine](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/)
which is more modular and therefore better suited for future improvements and
extensions as requested by
[STANBOL-303](https://issues.apache.org/jira/browse/STANBOL-303).
+The KeywordLinkingEngine is intended to be used to extract occurrences of
Entities part of a Controlled Vocabulary in content parsed to the Stanbol
Enhancer. To do this words appearing within the text are compared with labels
of entities. The Stanbol Entityhub is used to lookup Entities based on their
labels.
+
+This documentation first provides information about the configuration options
of this engine. This section is mainly intended for users of this engine. The
remaining part of this document is rather technical and intended to be read by
developers that want to extend this engine or want to know the technical
details.
+
+## Configuration
+
+The KeywordLinkingEnigne provides a lot of configuration possibilities. This
section provides describes the different option based on the configuration
dialog as shown by the Apache Felix Webconsole.
+
+
+
+The example in the scene shows an configuration that is used to extract Drugs
base on various IDs (e.g. the ATC code and the nchi key) that are all stored as
values of the skos:notation property. This example is used to emphasize on
newer features like case sensitive mapping, keyword tokenizer and also
customized type mappings. Similar configurations would be also need to extract
product ids, ISBN number or more generally concepts of an thesaurus based on
there notation.
+
+### Configuration Parameter
+
+* __Name__(stanbol.enhancer.engine.name): The name of the Enhancement Engine.
This name is used to refer an [EnhancementEngine](index.html) in
[EnhancementChain](enhancementchain.html)s
+* __Referenced
Site__(org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId):
The name of the ReferencedSite of the Stanbol Entityhub that holds the
controlled vocabulary to be used for extracting Entities. "entityhub" or
"local" can be used to extract Entities managed directly by the Entityhub.
+* __Label
Field__(org.apache.stanbol.enhancer.engines.keywordextraction.nameField): The
name of the property used to lookup Entities. Only a single field is supported
for performance reasons. Users that want to use values of several fields should
collect such values by an according configuration in the mappings.txt used
during indexing. This [usage scenario](../../customvocabulary.html) provides
more information on this.
+* __Case
Sensitivity__(org.apache.stanbol.enhancer.engines.keywordextraction.caseSensitive):
This allows to activate/deactivate case sensitive matching. It is important to
understand that even with case sensitivity activated an Entity with the label
such as "Anaconda" will be suggested for the mention of "anaconda" in the text.
The main difference will be the confidence value of such a suggestion as with
case sensitivity activated the starting letters "A" and "a" are NOT considered
to be matching. See the second technical part for details about the matching
process. Case Sensitivity is deactivated by default. It is recommended to be
activated if controlled vocabularies contain abbreviations similar to commonly
used words e.g. CAN for Canada.
+* __Type
Field__(org.apache.stanbol.enhancer.engines.keywordextraction.typeField):
Values of this field are used as values of the "fise:entity-types" property of
created "fise:EntityAnnotation"s. The default is "rdf:type".
+* __Redirect
Field__(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField)
and __Redirect
Mode__(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode):
Redirects allow to tell the KeywordLinkingEngine to follow a specific property
in the knowledge base for matched entities. This feature e.g. allows to follow
redirects from "USA" to "United States" as defined in Wikipedia. See
"Processing of Entity Suggestions" for details. Possible valued for the
Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses
label, type informations of redirected entities, but keeps the URI of the
extracted entity; "FOLLOW" - follows the redirect
+* __Min Token
Length__(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength):
While the KeywordLinkingEngine preferable uses POS (part-of-speach) taggers to
determine if a word should matched with the controlled vocabulary the minimum
token length provides a fall back if (a) no POS tagger is available for the
language of the parsed text or (b) if the confidence of the POS tagger is lower
than the threshold.
+* __Keyword
Tokenizer__(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer):
This allows to use a special Tokenizer for matching keywords and alpha numeric
IDs. Typical language specific Tokenizers tend to split such IDs in several
tokens and therefore might prevent a correct matching. This Tokenizer should
only be activated if the KeywordLinkingEngine is configured to match against
IDs like ISBN numbers, Product IDs ... It should not be used to match against
natural language labels.
+*
__Suggestions__(org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions):
The maximum number of suggested Entities.
+*
__Languages__(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)
and __Default Matching
Language__(org.apache.stanbol.enhancer.engines.keywordextraction.defaultMatchingLanguage):
The first allows to specify languages that should be processed by this engine.
This is e.g. useful if the controlled vocabulary only contains labels in for a
specific language but does not formally specify this information (by setting
the "xml:lang" property for labels). The default matching language can be used
to work around the exact opposite case. As an example in DBpedia labels do get
the language of the dataset they are extracted from (e.g. all data extracted
from en.wikipedia.org will get "xml:lang=en"). The default matching language
allows to tell the KeywordLinkingEngine to use labels of that language for
matching regardless of the language of the parsed content. In the case of
DBpedia this allows e.g. to match persons mentioned in an Italian text with the
english l
abels extracted from en.wikipedia.org. Details about natural language
processing features used by this engine are provided in the section "Multiple
Language Support"
+* __Type
Mappings__(org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings):
The FISE enhancement structure (as used by the Stanbol Enhancer) distinguishes
TextAnnotations and EntityAnnotations. The Keyword linking engine needs to
create both types of Annotations: TextAnnotations selecting the words that
match some Entities in the Controlled Vocabulary and EntityAnnotations that
represent an Entity suggested for a TextAnnotation. The Type Mappings are used
to determine the "dc:type" of the TextAnnotation based on the types of the
suggested Entity. The default configuration comes with mappings for Persons,
Organizations, Places and Concepts but this fields allows to define additional
mappings. For details see the section "Type Mapping Syntax" and "Processing of
Entity Suggestions".
+* __Dereference
Entities__(org.apache.stanbol.enhancer.engines.keywordextraction.dereference):
If enabled this engine adds additional information about the suggested Entities
to the Metadata of the enhanced content item.
+* __Ranking__(service.ranking): This property is used of two engines do use
the same __Name__. In such cases the one with the higher ranking will be used
to enhance content items. Typically users will not need to change this.
+
+Additionally the following properties can be configured via a configuration
file:
+
+* __Minimum Found
Tokens__(org.apache.stanbol.enhancer.engines.keywordextraction.minFoundTokens):
This allows to tell the KeywordLinking Engine how to deal with Entities that do
not exactly match words in the text. Typical Examples are "George W. Bush" ->
"George Walker Bush". This parameter allows the minimum number of tokens that
need to match. The default value is '2'. Note that this does not apply for
exact matches. Setting this to a high value can be used to force a mode that
will only consider entities where all tokens of the label match the mention in
the text.
+* __Minimum Pos Tag
Probability__(org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability):
The minimum probability of a POS (part-of-speech) tag. Tags with a lower
probability will be ignored. In such cases the configured value for the __Min
Token Length__ will apply. The value MUST BE in the range [0..1]
+
+### Type Mappings Syntax
+
+The Type Mappings are used to determine the "dc:type" of the TextAnnotation
based on the types of the suggested Entity. The field "Type Mappings"
(property: org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings)
can be used to customize such mappings.
+
+This field uses the following syntax
+
+ {uri}
+ {source} > {target}
+ {source1}; {source2}; ... {sourceN} > {target}
+
+The first variant is a shorthand for {uri} > {uri} and therefore specifies
that the {uri} should be used as 'dc:type' for TextAnnotations if the matched
entity is of type {uri}. The second variant matches a {source} URI to a
{target}. Variant three shows the possibility to match multiple URIs to the
same target in a single configuration line.
+
+Both 'ns:localName' and full qualified URIs are supported. For supported
namespaces see the
[NamespaceEnum](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/generic/servicesapi/src/main/java/org/apache/stanbol/entityhub/servicesapi/defaults/NamespaceEnum.java).
Information about accepted (INFO) and ignored (WARN) type mappings are
available in the logs.
+
+Some Examples of additional Mappings for the e-health domain:
+
+ drugbank:drugs; dbp-ont:Drug; dailymed:drugs; sider:drugs; tcm:Medicine >
drugbank:drugs
+ diseasome:diseases; linkedct:condition; tcm:Disease > diseasome:diseases
+ sider:side_effects
+ dailymed:ingredients
+ dailymed:organization > dbp-ont:Organisation
+
+The first two lines map some will known Classes that represent drugs and
diseases to 'drugbank:drugs' and 'diseasome:diseases'. The third and fourth
line define 1:1 mappings for side effects and ingredients and the last line
adds 'dailymed:organization' as an additional mapping to DBpedia Ontology
Organisation.
+
+The following mappings are predefined by the KeywordLinkingEngine.
+
+ dbp-ont:Person; foaf:Person; schema:Person > dbp-ont:Person
+ dbp-ont:Organisation; dbp-ont:Newspaper; schema:Organization >
dbp-ont:Organisation
+ dbp-ont:Place; schema:Place; gml:_Feature > dbp-ont:Place
+ skos:Concept
-Currently the main advantage of using this engine is its ability to support
multiple languages and provide enhancement results specific to custom
vocabulary.
## Multiple Language Support
The KeywordLinkingEngine supports the extraction of keywords in multiple
languages. However, the performance and to some extend also the quality of the
enhancements depend on how well a language is supported by the used NLP
framework (currently OpenNLP).
The following list provides a short overview about the different language
specific component/configurations:
-* **Language detection:** The KeywordLinkingEngine depends on the correct
detection of the language by the LanguageIdentificationEngine. If no language
is detected or this information is missing then "English" is assumed as default.
-* **Multi-lingual labels of the controlled vocabulary:** Entities are matched
based on labels of the current language and labels without any defined
language. e.g. English labels will not be matched against German language
texts. Therefore it is important to have a controlled vocabulary that includes
labels in the language of the texts you want to enhance.
-* **Natural Language Processing support:** The KeywordLinkingEngine is able to
use [Sentence
Detectors](http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetector.html),
[POS (Part of Speech)
taggers](http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTagger.html)
and
[Chunkers](http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html).
If such components are available for a language then they are used to optimize
the enhancement process.
-* **Sentence detector:** If a sentence detector is present the memory
footprint of the engines improves, because Tokens, POS tags and Chunks are only
kept for the currently active sentence. If no sentence detector is available
the entire content is treated as a single sentence.
-* **Tokenizer:** A (word)
[tokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html)
is required for the enhancement process. If no specific tokenizer is available
for a given language, then the [OpenNLP
SimpleTokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html)
is used as default. How well this tokenizer works will depend on the language.
-* **POS tagger:** POS (Part-of-Speech) taggers annotate tokens with their
type. Because of the KeywordLinkingEngine is only interested in Nouns, Foreign
Words and Numbers, the presence of such a tagger allows to skip a lot of the
tokens and to improve performance. However POS taggers use different sets of
tags for different languages. Because of that it is not enough that a POS
tagger is available for a language there MUST BE also a configuration of the
POS tags representing Nouns.
-* **Chunker:** There are two types of Chunkers. First the
[Chunkers](http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html)
as provided by OpenNLP (based on statistical models) and second a [POS tag
based
Chunker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java)
provided by the openNLP bundle of Stanbol. Currently the availability of a
Chunker does not have a big influence on the performance nor the quality of the
Enhancements.
-* **Configuration:** The set of languages to be annotated can be configured
for the KeywordLinkingEngine. An empty configuration indicates that texts in
any language should be processed. By using this configuration it is possible to
configure different KeywordLinkingEngine instances for different languages
(e.g. with different configurations)
+* __Language detection:__ The KeywordLinkingEngine depends on the correct
detection of the language by the LanguageIdentificationEngine. If no language
is detected or this information is missing then "English" is assumed as default.
+* __Multi-lingual labels of the controlled vocabulary:__ Entities are matched
based on labels of the current language and labels without any defined
language. e.g. English labels will not be matched against German language
texts. Therefore it is important to have a controlled vocabulary that includes
labels in the language of the texts you want to enhance.
+* __Natural Language Processing support:__ The KeywordLinkingEngine is able to
use [Sentence
Detectors](http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetector.html),
[POS (Part of Speech)
taggers](http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTagger.html)
and
[Chunkers](http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html).
If such components are available for a language then they are used to optimize
the enhancement process.
+* __Sentence detector:__ If a sentence detector is present the memory
footprint of the engines improves, because Tokens, POS tags and Chunks are only
kept for the currently active sentence. If no sentence detector is available
the entire content is treated as a single sentence.
+* __Tokenizer:__ A (word)
[tokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html)
is required for the enhancement process. If no specific tokenizer is available
for a given language, then the [OpenNLP
SimpleTokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html)
is used as default. The parameter __Keyword Tokenizer__ can be used to force
the usage of a special Tokenizer that is optimized for matching keyword. This
Tokenizer ensures that alpha-numeric IDs are not tokenized to ensure correct
matching of such tokens. If this option is enabled than any language specific
Tokenizer will be ignored in favor of the KeywordTokenizer.
+* __POS tagger:__ POS (Part-of-Speech) taggers annotate tokens with their
type. Because of the KeywordLinkingEngine is only interested in Nouns, Foreign
Words and Numbers, the presence of such a tagger allows to skip a lot of the
tokens and to improve performance. However POS taggers use different sets of
tags for different languages. Because of that it is not enough that a POS
tagger is available for a language there MUST BE also a configuration of the
POS tags representing Nouns.
+* __Chunker:__ There are two types of Chunkers. First the
[Chunkers](http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html)
as provided by OpenNLP (based on statistical models) and second a [POS tag
based
Chunker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java)
provided by the openNLP bundle of Stanbol. Currently the availability of a
Chunker does not have a big influence on the performance nor the quality of the
Enhancements.
## Keyword extraction and linking workflow ##
@@ -24,14 +84,14 @@ Basically the text is parsed from the be
### Text Processing ###
-The
[AnalysedContent](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/AnalysedContent.java)
Interface is used to access natural language text that was already processed
by an NLP framework. Currently there is only a single implementation based on
the commons.opennlp
[TextAnalyzer](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/TextAnalyzer.java)
utility. In general this part is still very focused on OpenNLP. Making it also
usable together with other NLP frameworks would probably need some re-factoring.
+The
[AnalysedContent](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/AnalysedContent.java)
Interface is used to access natural language text that was already processed
by a NLP framework. Currently there is only a single implementation based on
the commons.opennlp
[TextAnalyzer](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/TextAnalyzer.java)
utility. In general this part is still very focused on OpenNLP. Making it also
usable together with other NLP frameworks would probably need some re-factoring.
The current state of the processing is represented by the
[ProcessingState](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/ProcessingState.java).
Based on the capabilities of the NLP framework for the current language it
provides a the following set of information:
-* **AnalysedSentence:** If a sentence detector is present, than this represent
the current sentence of the text. If not, then the whole text is represented as
a single sentence. The AnalysedSentence also provides access to POS tags and
Chunks (if available)
-* **Chunk:** If a chunker is present, then this represents the current chunk.
Otherwise this will be null.
-* **Token:** The currently processed word part of the chunk and the sentence.
-* **TokenIndex:** The index of the currently active token relative to the
AnalysedSentence.
+* __AnalysedSentence:__ If a sentence detector is present, than this represent
the current sentence of the text. If not, then the whole text is represented as
a single sentence. The AnalysedSentence also provides access to POS tags and
Chunks (if available)
+* __Chunk:__ If a chunker is present, then this represents the current chunk.
Otherwise this will be null.
+* __Token:__ The currently processed word part of the chunk and the sentence.
+* __TokenIndex:__ The index of the currently active token relative to the
AnalysedSentence.
The ProcessingState provides means to navigate to the next token. If chunks
are present tokens that are outside of chunks are ignored.
@@ -74,7 +134,7 @@ In case there are one or more [Suggestio
LinkedEntity is an object model representing the Stanbol Enhancement
Structure. After the processing of the parsed content is completed, the
LinkedEntities are "serialized" as RDF triples to the metadata of the
ContentItem.
TextAnnotations as defined in the [Stanbol Enhancement
Structure](http://wiki.iks-project.eu/index.php/EnhancementStructure) do use
the [dc:type](http://www.dublincore.org/documents/dcmi-terms/#terms-type)
property to provide the general type of the extracted Entity. However suggested
Entities might have very specific types. Therefore the
[EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java)
provides the possibility to map the specific types of the Entity to types used
for the dc:type property of TextAnnotations. The
[EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).DEFAULT_ENTITY_TYPE_MAPPINGS
contains some predefined mappings.
-*Note that the field used to retrieve the types of an suggested Entity can be
configured by the EntityLinkerConfig. The default value for the type field is
"rdf:type".*
+*Note that the field used to retrieve the types of a suggested Entity can be
configured by the EntityLinkerConfig. The default value for the type field is
"rdf:type".*
In some cases suggested entities might redirect to others. In the case of
Wikipedia/DBpedia this is often used to link from acronyms like
[IMF](http://en.wikipedia.org/w/index.php?title=IMF&redirect=no) to the real
entity [International Monetary
Fund](http://en.wikipedia.org/wiki/International_Monetary_Fund). But also some
Thesauri define labels as own Entities with an URI and users might want to use
the URI of the Concept rather than one of the label.
To support such use cases the KeywordLinkingEngine has support for redirects.
Users can first configure the redirect mode (ignore, copy values, follow) and
secondly the field used to search for redirects (default=rdfs:seeAlso).
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png?rev=1301363&view=auto
==============================================================================
Binary file - no diff available.
Propchange:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream