[ 
https://issues.apache.org/jira/browse/STANBOL-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler resolved STANBOL-245.
-----------------------------------------

    Resolution: Fixed

The current version is still useable. However note that this engine was 
replaced by the KeywordLinkingEngine (see STANBOL-303) and is now deprecated 
(see STANBOL-506).

Existing users should use the KeywordLinkingEngine instead. See Documentation 
at 
http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
                
> Taxonomy Engine
> ---------------
>
>                 Key: STANBOL-245
>                 URL: https://issues.apache.org/jira/browse/STANBOL-245
>             Project: Stanbol
>          Issue Type: New Feature
>          Components: Enhancer
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> The goal of this Engine is to find Terms defined in a Taxonomy within parsed 
> content. Named Entity Recognition (e.g. the opennlp-ner) engines can not be 
> used for that because Taxonomies typically also contain Entities of types 
> that can not be detected by NER.
> Taxonomies will be stored within a ReferencedSite of the Entityhub. Terms of 
> the Taxonomy will be Entities of the Referenced Site
> For processing of the parsed content (Text) this engine can use the following 
> natural language processing component.
> * OpenNLP tokenizer (SimpleTokenizer with the possibility to add Language 
> specific one)
> * Sentence Detector (optional): If present than the parsed content is 
> analyzed  sentence by sentence
> * POS tagger (optional): Part of Speech analyzers tag each token with the 
> type of the Word. If present it allows this engine to look up only words with 
> a specific types (e.g. nouns). If not present this engine will lookup every 
> word in the parsed content.
> * Chunker (optional): Allows to detect phrases within the parsed content. If 
> not present the Engine will try to build chunks based on the POS tags of 
> words (e.g. two nouns in a row or nouns connected with a preposision). If 
> also no POS tags are available results for the current could be compared with 
> surrounding tokens.
> NOTE: all that components other than the Tokenizer are optional. The main 
> reason for there usage is to reduce the number of lookups and therefore to 
> increase the performance.
> The Engine will produce TextAnnotations as well as EntityAnnotations. 
> TextAnnotations will only be created in case an Term in the Taxonomy was 
> found. EntityAnnotations are used to represent suggested Terms within the 
> Taxonomy.
> NOTE:
> Even that this Engine will be able to use any ReferencedSite of the Stanbol 
> Entityhub it is intended to be used with Taxonomy like data. If used in 
> combination with general purpose datasets such as dbpedia or freebase it will 
> be only of limited use because such datasets define entities for many 
> commonly used words. This Engine will create Enhancements if such words are 
> present within parsed content. It might still be possible to successfully use 
> this Engine for such datasets, but Users will need to filter results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to