[
https://issues.apache.org/jira/browse/STANBOL-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rupert Westenthaler resolved STANBOL-245.
-----------------------------------------
Resolution: Fixed
The current version is still useable. However note that this engine was
replaced by the KeywordLinkingEngine (see STANBOL-303) and is now deprecated
(see STANBOL-506).
Existing users should use the KeywordLinkingEngine instead. See Documentation
at
http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
> Taxonomy Engine
> ---------------
>
> Key: STANBOL-245
> URL: https://issues.apache.org/jira/browse/STANBOL-245
> Project: Stanbol
> Issue Type: New Feature
> Components: Enhancer
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> The goal of this Engine is to find Terms defined in a Taxonomy within parsed
> content. Named Entity Recognition (e.g. the opennlp-ner) engines can not be
> used for that because Taxonomies typically also contain Entities of types
> that can not be detected by NER.
> Taxonomies will be stored within a ReferencedSite of the Entityhub. Terms of
> the Taxonomy will be Entities of the Referenced Site
> For processing of the parsed content (Text) this engine can use the following
> natural language processing component.
> * OpenNLP tokenizer (SimpleTokenizer with the possibility to add Language
> specific one)
> * Sentence Detector (optional): If present than the parsed content is
> analyzed sentence by sentence
> * POS tagger (optional): Part of Speech analyzers tag each token with the
> type of the Word. If present it allows this engine to look up only words with
> a specific types (e.g. nouns). If not present this engine will lookup every
> word in the parsed content.
> * Chunker (optional): Allows to detect phrases within the parsed content. If
> not present the Engine will try to build chunks based on the POS tags of
> words (e.g. two nouns in a row or nouns connected with a preposision). If
> also no POS tags are available results for the current could be compared with
> surrounding tokens.
> NOTE: all that components other than the Tokenizer are optional. The main
> reason for there usage is to reduce the number of lookups and therefore to
> increase the performance.
> The Engine will produce TextAnnotations as well as EntityAnnotations.
> TextAnnotations will only be created in case an Term in the Taxonomy was
> found. EntityAnnotations are used to represent suggested Terms within the
> Taxonomy.
> NOTE:
> Even that this Engine will be able to use any ReferencedSite of the Stanbol
> Entityhub it is intended to be used with Taxonomy like data. If used in
> combination with general purpose datasets such as dbpedia or freebase it will
> be only of limited use because such datasets define entities for many
> commonly used words. This Engine will create Enhancements if such words are
> present within parsed content. It might still be possible to successfully use
> this Engine for such datasets, but Users will need to filter results.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira