[
https://issues.apache.org/jira/browse/STANBOL-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rupert Westenthaler updated STANBOL-245:
----------------------------------------
Description:
The goal of this Engine is to find Terms defined in a Taxonomy within parsed
content. Named Entity Recognition (e.g. the opennlp-ner) engines can not be
used for that because Taxonomies typically also contain Entities of types that
can not be detected by NER.
Taxonomies will be stored within a ReferencedSite of the Entityhub. Terms of
the Taxonomy will be Entities of the Referenced Site
For processing of the parsed content (Text) this engine can use the following
natural language processing component.
* OpenNLP tokenizer (SimpleTokenizer with the possibility to add Language
specific one)
* Sentence Detector (optional): If present than the parsed content is analyzed
sentence by sentence
* POS tagger (optional): Part of Speech analyzers tag each token with the type
of the Word. If present it allows this engine to look up only words with a
specific types (e.g. nouns). If not present this engine will lookup every word
in the parsed content.
* Chunker (optional): Allows to detect phrases within the parsed content. If
not present the Engine will try to build chunks based on the POS tags of words
(e.g. two nouns in a row or nouns connected with a preposision). If also no POS
tags are available results for the current could be compared with surrounding
tokens.
NOTE: all that components other than the Tokenizer are optional. The main
reason for there usage is to reduce the number of lookups and therefore to
increase the performance.
The Engine will produce TextAnnotations as well as EntityAnnotations.
TextAnnotations will only be created in case an Term in the Taxonomy was found.
EntityAnnotations are used to represent suggested Terms within the Taxonomy.
NOTE:
Even that this Engine will be able to use any ReferencedSite of the Stanbol
Entityhub it is intended to be used with Taxonomy like data. If used in
combination with general purpose datasets such as dbpedia or freebase it will
be only of limited use because such datasets define entities for many commonly
used words. This Engine will create Enhancements if such words are present
within parsed content. It might still be possible to successfully use this
Engine for such datasets, but Users will need to filter results.
was:
The goal of this Engine is to find Terms defined in a Taxonomy within parsed
content. Named Entity Recognition (e.g. the opennlp-ner) engines can not be
used for that because Taxonomies typically also contain Entities of types that
can not be detected by NER.
Taxonomies will be stored within a ReferencedSite of the Entityhub. Terms of
the Taxonomy will be Entities of the Referenced Site
For processing of the parsed content (Text) this engine can use the following
natural language processing component.
* OpenNLP tokenizer (SimpleTokenizer with the possibility to add Language
specific onece)
* Sentence Detector (optional): If present than the parsed content is analyzed
sentence by sentence
* POS tagger (optional): Part of Speech analyzers tag each token with the type
of the Word. If present it allows this engine to look up only words with a
specific types (e.g. nouns). If not present this engine will lookup every word
in the parsed content.
* Chunker (optional): Allows to detect phrases within the parsed content. If
not present the Engine will try to build chunks based on the POS tags of words
(e.g. two nouns in a row or nouns connected with a preposision). If also no POS
tags are available results for the current could be compared with surrounding
tokens.
NOTE: all that components other than the Tokenizer are optional. The main
reason for there usage is to reduce the number of lookups and therefore to
increase the performance.
The Engine will produce TextAnnotations as well as EntityAnnotations.
TextAnnotations will only be created in case an Term in the Taxonomy was found.
EntityAnnotations are used to represent suggested Terms within the Taxonomy.
NOTE:
Even that this Engine will be able to use any ReferencedSite of the Stanbol
Entityhub it is intended to be used with Taxonomy like data. If used in
combination with general purpose datasets such as dbpedia or freebase it will
be only of limited use because such datasets define entities for many commonly
used words. This Engine will create Enhancements if such words are present
within parsed content. It might still be possible to successfully use this
Engine for such datasets, but Users will need to filter results.
> Taxonomy Engine
> ---------------
>
> Key: STANBOL-245
> URL: https://issues.apache.org/jira/browse/STANBOL-245
> Project: Stanbol
> Issue Type: New Feature
> Components: Enhancer
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> The goal of this Engine is to find Terms defined in a Taxonomy within parsed
> content. Named Entity Recognition (e.g. the opennlp-ner) engines can not be
> used for that because Taxonomies typically also contain Entities of types
> that can not be detected by NER.
> Taxonomies will be stored within a ReferencedSite of the Entityhub. Terms of
> the Taxonomy will be Entities of the Referenced Site
> For processing of the parsed content (Text) this engine can use the following
> natural language processing component.
> * OpenNLP tokenizer (SimpleTokenizer with the possibility to add Language
> specific one)
> * Sentence Detector (optional): If present than the parsed content is
> analyzed sentence by sentence
> * POS tagger (optional): Part of Speech analyzers tag each token with the
> type of the Word. If present it allows this engine to look up only words with
> a specific types (e.g. nouns). If not present this engine will lookup every
> word in the parsed content.
> * Chunker (optional): Allows to detect phrases within the parsed content. If
> not present the Engine will try to build chunks based on the POS tags of
> words (e.g. two nouns in a row or nouns connected with a preposision). If
> also no POS tags are available results for the current could be compared with
> surrounding tokens.
> NOTE: all that components other than the Tokenizer are optional. The main
> reason for there usage is to reduce the number of lookups and therefore to
> increase the performance.
> The Engine will produce TextAnnotations as well as EntityAnnotations.
> TextAnnotations will only be created in case an Term in the Taxonomy was
> found. EntityAnnotations are used to represent suggested Terms within the
> Taxonomy.
> NOTE:
> Even that this Engine will be able to use any ReferencedSite of the Stanbol
> Entityhub it is intended to be used with Taxonomy like data. If used in
> combination with general purpose datasets such as dbpedia or freebase it will
> be only of limited use because such datasets define entities for many
> commonly used words. This Engine will create Enhancements if such words are
> present within parsed content. It might still be possible to successfully use
> this Engine for such datasets, but Users will need to filter results.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira