[jira] [Updated] (STANBOL-245) Taxonomy Engine

Rupert Westenthaler (JIRA) Thu, 30 Jun 2011 05:56:57 -0700

     [ 
https://issues.apache.org/jira/browse/STANBOL-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rupert Westenthaler updated STANBOL-245:
----------------------------------------

    Description: 
The goal of this Engine is to find Terms defined in a Taxonomy within parsed 
content. Named Entity Recognition (e.g. the opennlp-ner) engines can not be 
used for that because Taxonomies typically also contain Entities of types that 
can not be detected by NER.

Taxonomies will be stored within a ReferencedSite of the Entityhub. Terms of 
the Taxonomy will be Entities of the Referenced Site

For processing of the parsed content (Text) this engine can use the following 
natural language processing component.

* OpenNLP tokenizer (SimpleTokenizer with the possibility to add Language 
specific one)
* Sentence Detector (optional): If present than the parsed content is analyzed  
sentence by sentence
* POS tagger (optional): Part of Speech analyzers tag each token with the type 
of the Word. If present it allows this engine to look up only words with a 
specific types (e.g. nouns). If not present this engine will lookup every word 
in the parsed content.
* Chunker (optional): Allows to detect phrases within the parsed content. If 
not present the Engine will try to build chunks based on the POS tags of words 
(e.g. two nouns in a row or nouns connected with a preposision). If also no POS 
tags are available results for the current could be compared with surrounding 
tokens.

NOTE: all that components other than the Tokenizer are optional. The main 
reason for there usage is to reduce the number of lookups and therefore to 
increase the performance.

The Engine will produce TextAnnotations as well as EntityAnnotations. 
TextAnnotations will only be created in case an Term in the Taxonomy was found. 
EntityAnnotations are used to represent suggested Terms within the Taxonomy.

NOTE:
Even that this Engine will be able to use any ReferencedSite of the Stanbol 
Entityhub it is intended to be used with Taxonomy like data. If used in 
combination with general purpose datasets such as dbpedia or freebase it will 
be only of limited use because such datasets define entities for many commonly 
used words. This Engine will create Enhancements if such words are present 
within parsed content. It might still be possible to successfully use this 
Engine for such datasets, but Users will need to filter results.

  was:
The goal of this Engine is to find Terms defined in a Taxonomy within parsed 
content. Named Entity Recognition (e.g. the opennlp-ner) engines can not be 
used for that because Taxonomies typically also contain Entities of types that 
can not be detected by NER.

Taxonomies will be stored within a ReferencedSite of the Entityhub. Terms of 
the Taxonomy will be Entities of the Referenced Site

For processing of the parsed content (Text) this engine can use the following 
natural language processing component.

* OpenNLP tokenizer (SimpleTokenizer with the possibility to add Language 
specific onece)
* Sentence Detector (optional): If present than the parsed content is analyzed  
sentence by sentence
* POS tagger (optional): Part of Speech analyzers tag each token with the type 
of the Word. If present it allows this engine to look up only words with a 
specific types (e.g. nouns). If not present this engine will lookup every word 
in the parsed content.
* Chunker (optional): Allows to detect phrases within the parsed content. If 
not present the Engine will try to build chunks based on the POS tags of words 
(e.g. two nouns in a row or nouns connected with a preposision). If also no POS 
tags are available results for the current could be compared with surrounding 
tokens.

NOTE: all that components other than the Tokenizer are optional. The main 
reason for there usage is to reduce the number of lookups and therefore to 
increase the performance.

The Engine will produce TextAnnotations as well as EntityAnnotations. 
TextAnnotations will only be created in case an Term in the Taxonomy was found. 
EntityAnnotations are used to represent suggested Terms within the Taxonomy.

NOTE:
Even that this Engine will be able to use any ReferencedSite of the Stanbol 
Entityhub it is intended to be used with Taxonomy like data. If used in 
combination with general purpose datasets such as dbpedia or freebase it will 
be only of limited use because such datasets define entities for many commonly 
used words. This Engine will create Enhancements if such words are present 
within parsed content. It might still be possible to successfully use this 
Engine for such datasets, but Users will need to filter results.


> Taxonomy Engine
> ---------------
>
>                 Key: STANBOL-245
>                 URL: https://issues.apache.org/jira/browse/STANBOL-245
>             Project: Stanbol
>          Issue Type: New Feature
>          Components: Enhancer
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> The goal of this Engine is to find Terms defined in a Taxonomy within parsed 
> content. Named Entity Recognition (e.g. the opennlp-ner) engines can not be 
> used for that because Taxonomies typically also contain Entities of types 
> that can not be detected by NER.
> Taxonomies will be stored within a ReferencedSite of the Entityhub. Terms of 
> the Taxonomy will be Entities of the Referenced Site
> For processing of the parsed content (Text) this engine can use the following 
> natural language processing component.
> * OpenNLP tokenizer (SimpleTokenizer with the possibility to add Language 
> specific one)
> * Sentence Detector (optional): If present than the parsed content is 
> analyzed  sentence by sentence
> * POS tagger (optional): Part of Speech analyzers tag each token with the 
> type of the Word. If present it allows this engine to look up only words with 
> a specific types (e.g. nouns). If not present this engine will lookup every 
> word in the parsed content.
> * Chunker (optional): Allows to detect phrases within the parsed content. If 
> not present the Engine will try to build chunks based on the POS tags of 
> words (e.g. two nouns in a row or nouns connected with a preposision). If 
> also no POS tags are available results for the current could be compared with 
> surrounding tokens.
> NOTE: all that components other than the Tokenizer are optional. The main 
> reason for there usage is to reduce the number of lookups and therefore to 
> increase the performance.
> The Engine will produce TextAnnotations as well as EntityAnnotations. 
> TextAnnotations will only be created in case an Term in the Taxonomy was 
> found. EntityAnnotations are used to represent suggested Terms within the 
> Taxonomy.
> NOTE:
> Even that this Engine will be able to use any ReferencedSite of the Stanbol 
> Entityhub it is intended to be used with Taxonomy like data. If used in 
> combination with general purpose datasets such as dbpedia or freebase it will 
> be only of limited use because such datasets define entities for many 
> commonly used words. This Engine will create Enhancements if such words are 
> present within parsed content. It might still be possible to successfully use 
> this Engine for such datasets, but Users will need to filter results.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (STANBOL-245) Taxonomy Engine

Reply via email to