enhancer: enhancementstructure.mdtext enhancementstructure.png

rwesten Tue, 29 May 2012 23:43:37 -0700

Author: rwesten
Date: Wed May 30 06:43:05 2012
New Revision: 1344114

URL: http://svn.apache.org/viewvc?rev=1344114&view=rev
Log:
first version for the Documentation of the Stanbol EnhancementStructure


Added:
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.mdtext
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.png
   (with props)

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.mdtext?rev=1344114&view=auto
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.mdtext
 (added)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.mdtext
 Wed May 30 06:43:05 2012
@@ -0,0 +1,126 @@
+This document specifies the Structure used by the Stanbol Enhancer encodes 
features extracted form the parsed [ContentItem](contentitem.html). The 
Enhancement Structure is based on [RDF](http://www.w3.org/TR/rdf-primer/) 
technology and defined as [OWL](http://www.w3.org/2004/OWL/) ontology. 
+
+Its two main purposes are to facilitate the:
+
+1. Interoperability between EnhancementEngines: The design of the Stanbol 
Enhancer is based on the processing of an [ContentItem](contentitem.html) by 
multiple [EnhancementEngine](engines)s in an [EnhancementChain](chains). 
Together with the ContentItem API the EnhancementStructure is the key enabler 
for the cooperation of the different engines. It ensures that enhancements 
created by one engine can be consumed by the following engines (e.g. the first 
engine detects the language of the parsed text; the second consumes the 
language to select the correct NER (named entity recognition) model and create 
enhancements describing Named Entities contained in the text; the third Engine 
consumes those Named Entity annotations and creates suggestions for Entities 
part of an controlled vocabulary).
+2. Consumption of extracted Features: The knowledge structure standardized by 
this Ontology aims to allow users to consume/process the features extracted 
from the parsed content. This includes things like:
+    * list all suggested Entities (accept/reject Tags)
+    * list all suggested Topics (content classification)
+    * group Entity suggestion based on detected "Named Entities" 
(disambiguation support)
+    * show the occurrence of detected Entities within the analyzed text 
(similar to spell checker UIs)
+
+    The last section of this document provides a more detailed look at those 
usage scenarios.
+
+This document first specifies the Enhancement Structure consisting of the 
Ontology as well as additional rules  [EnhancementEngine](engines)s need to 
consider when writing Enhancement. In the second part this document focuses 
more on the consumption of Enhancements by users of the Stanbol Enhancer.
+
+## Overview
+
+The Stanbol Enhancement Structure is a central part of the [Stanbol 
Enhancer](index.html) architecture as it represents the binding element between 
the [ContentItem](contentitem.html) analyzed by the the 
[EnhancementEngine](engines)s as configured by an [EnhancementChain](chains). 
Together with the [ContentParts](content item.html#content-parts) it represents 
the state that is constantly updated during the enhancement process.
+
+The following graphic provides an overview on how the EnhancementStructure is 
used by the Stanbol Enhancer to formally represent the enhancement results.
+
+![EnhancementStructure Overview](enhancementstructure.png "Overview of the 
Stanbol Enhancement Structure showing 'Bob Marley' recognized as Person within 
the parsed Text with two suggested Entities 'Bob Marley' the musician and 'Bob 
Marley' the comedian")
+
+The above figure shows three Enhancements: One TextAnnotation created by the 
NER (NamedEntityRecognition) engine and two EntityAnnotation that suggest/link 
Entities as defined by [DBpedia.org](http://dbpedia.org).
+
+The bold relations within the figure are central as they show how the 
EnhancementStructure is used to formally specify that the mention "Bob Marley" 
within the analyzed text is believed to represent the Entity 
[dbpedia:Bob_Marley](http://dbpedia.org/resource/Bob-Marley). However it is 
also stated that there is a disambiguation with an other person 
[dbpedia:Bob_Marley_(comedian)](http://bpedia.org/resource/Bob_Marley_(comedian)).
+
+The dashed relations are also important as they are used to formally describe 
the extraction context: which EnhancementEngine has extracted a feature from 
what ContentItem. If even more contextual information are needed, users can 
combine those information with the [ExecutionMetadata](executionmetadata.html) 
collected during the enhancement process.
+
+## General Information
+
+__Used Namespaces__
+
+This provides the list of namespaces used/referenced by the Enhancement 
Structure
+
+* __fise__ (_http://fise.iks-project.eu/ontology/_): This is the main 
namespace of the currently used Enhancement Structure. All custom concepts and 
properties are defined using this namespace. (*)
+* __enhancer__ (_http://stanbol.apache.org/ontology/enhancer/enhancer#_): This 
is the main namespace of the Stanbol Enhancer defining concepts such as 
ContentItem, EnhancementEngine, EnhancementChain â¦
+* __entityhub__ (_http://stanbol.apache.org/ontology/entityhub/entityhub#_)
+: This is the main namespace of the Stanbol Entityhub component. 
+* __dc__ (_http://purl.org/dc/terms/_): The Dublin Core terms standard is also 
heavily used by the Stanbol Enhancement Structure. Especially to encode metada 
data, but also to encode relations between extracted information 
(fise:Enhancement's)
+* __dppedia-ont__ (_http://dbpedia.org/ontology/_): Concepts of this Ontology 
are used to describe the types of "Named Entities" detected in parsed content.
+* __skos__ (_http://www.w3.org/2004/02/skos/core#_): The SKOS standard is 
preferable used to describe entries of Thesauri or more generally any type of 
controlled vocabularies.
+* __rdf__ (_http://www.w3.org/1999/02/22-rdf-syntax-ns#_)
+* in addition [EnhancementEngine](engines)s are free to add/use properties of 
any additional Ontology (e.g. when adding the rdf:type's of suggested Entities).
+
+_(*) Historical side note: FISE was the name of the Stanbol Enhancer before 
its [incubation to Apache](TODO: add link). The Enhancement Structure does 
still use the original namespace for compatibility reasons._
+
+__About Expressiveness:__
+
+All Stanbol Ontologies are encoded using OWL but restrict itself to basic 
features. Users need to be aware that not all rules defined in this 
documentation are formally expressed within the Ontology. However all the 
stated rules are validated by the 
[EnhancementStructureHelper](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/generic/test/src/main/java/org/apache/stanbol/enhancer/test/helper/EnhancementStructureHelper.java)
 UnitTest utility part of the "org.apache.stanbol.enhancer.test" module. This 
ensures that EnhancementEngine implementation that validate there enhancement 
using this utility comply to this specification.
+
+__About Reasoning:__
+
+Apache Stanbol assumes the users will have no reasoning support. Because of 
that EnhancementEngines are required to materialize information that would be 
otherwise only available by reasoning (e.g. it is required that they add both 
"fise:TextAnnotation" and "fise:Enhancement" as "rdf:type"s when writing a 
TextAnnotation).
+
+## Core Concepts
+
+The main concept of the Stanbol Enhancement Structure is the 
"fise:Enhancement". It is used as base concept for all annotation types and 
defines the generic properties every enhancement MUST provide (e.g. creator, 
creation date, extracted-from, confidence). On top of the "fise:Enhancement" 
three specific annotations types are defined:
+
+* TextAnnotation: To describe features with there occurrence within the parsed 
Text
+* EntityAnnotation: To suggest (linked) Entities with features detected within 
the content
+* TopicAnnotation: To classify (link) the parsed content along topics
+
+### fise:Enhancement
+
+Every feature extracted by an [EnhancementEngine](engines) that is expressed 
using the Stanbol Enhancement Structure needs to be represented as a RDF 
resource with the "rdf:type" "fise:Enhancement".
+
+Enhancements use [Dublin Core 
terms](http://dublincore.org/documents/dcmi-terms/) to provide metadata about 
their creation:
+
+* __dc:creator__ _(required, single)_: The [EnhancementEngine](engines) that 
created the Enhancement. Currently the full qualified name of the Java Class 
implementing the engine is used as String values. In future version this will 
change to the relative URL of the EnhancementEngine (e.g. 
"/enhancer/engine/{engine-name}")
+* __dc:created__ _(required, single)_: The UTF date/time when the enhancement 
was created by the EnhancementEngine.
+* __dc:contributor__ _(optional, multiple)_: Additional 
[EnhancementEngine](engines) that contributed to the Enhancement.
+* __dc:modified__ _(optional, single)_: The last change to a given enhancement.
+
+The following properties provide information about the enhancement
+
+* __fise:extracted-from__ _(required, single)_: The URI of the 
"enhancer:ContentItem" the feature was extracted. EnhancementEngines need to 
use the UriRef returned by ContentItem#getUri() as value.
+* __fise:confidence__ _(optional, single, range: 0 <= confidence <= 1)_: The 
confidence of the enhancement as floating point number. NOTE that while this 
uses a floating point number as value users should not treat values to be on a 
rational scale - meaning that an enhancement with a confidence of 0.4 is NOT 
half as good as one with 0.8!
+* __dc:relation__ _(optional, multiple)_: Specifies that the current 
fise:Enhancement has a relation to an other fise:Enhancement. Values need to be 
resources of the "rdf:type" "fise:Enhancement".
+* __dc:requires__ _(optional, multiple)_: Specifies that the current 
fise:Enhancement depends on an other fise:Enhancement. This is a stronger 
version of using "dc:relation" and should indicate that if one of the required 
enhancements is declined/removed this also affects this one. Values need to be 
resources of the "rdf:type" "fise:Enhancement". NOTE also that Dublin Core 
terms defines dc:requires as an sub-property of dc:relation.
+
+### fise:TextAnnotation
+
+TextAnnotations are used to select portions parsed textual content by using 
the following properties:
+
+* __fise:start__ _(optional, single)_: The start character position within the 
plain text version of the parsed content. Note that the plain text version can 
be retrieved by using the [multi-part content item 
support](enhancerrest.thml#multi-part-contentitem-support) of the Stanbol 
Enhancer RESTful API.
+* __fise:end__ _(required of fise:start is present, single)_: The end 
character position. This MUST only be present of "fise:start" is also defined.
+* __fise:selected-text__ _(optional, single)_: The text selected by the 
TextAnnotation. This MUST be the same as the text from index "fise:start" to 
"fise:end" within the plain text version of the parsed content.
+* __fise:selection-context__ _(required if fise:selected-text is present, 
single)_: The selection context such as the current sentence or a fixed number 
of characters/word before and after the selected text. This MUST be present if 
"fise:selected-text" is defined.
+* __dc:type__ _(optional,single)_: The nature of the selected part of the text 
(e.g. dbpedia-ont:Person, Organization, dbpedia-ont:Place for Named Entities; 
dc:LinguisticSystem for language annotations; skos:Concept for abstract things 
incl. categorizations). Note that dc:type values are just recommendations. 
Users are free to use different as the recommended one. As an example the 
[KeywordLinkingEngine](engines/keywordlinkingengine.html) allows users to 
configure dc:type mappings.
+
+As hinted by the description of the above properties their usage depends on 
the size of the selected part of the text.
+
+* selection of the whole Document: This is the default and MUST BE assumed if 
non of the start/end/selected-text/selection-context properties is present
+* selection of a part (e.g. chapter, sentence): The preferred way is to define 
start/end positions. selected-text and selection-context are inefficient for 
bigger section as they would duplicate those sections of the content with the 
RDF graph as literals.
+* Selection of words, word-phrases: In this case it is highly recommended to 
define start/end as well as selected-text/selection-context. Especially the 
selected-text and selection-context are important to calculate the exact 
position of an enhancement in non-plain-text content (e.g. HTML fragments).
+
+NOTE: In future version TextAnnotations might switch to a Model that uses
+
+* fise:selection-prefix: some words/characters before the selected section.
+* fise:selection-head: the first few word/characters of a the selected section 
within the text. Alternative to fise:selected-text in case bigger sections of 
the parsed content need to be selected.
+* fise:selection-tail: the last few words/characters of a selected section. To 
be used together with fise:selection-head.
+* fise:selection-suffix: some words/characters after the selected section.
+
+### fise:EntityAnnotation
+
+EntityAnnotations are used to suggest/link entities recognized within the 
Text. While fise:TextAnnotations are used for representing the recognition(s) 
(occurrence(s) within the content) the EntityAnnotation provides information 
about the referenced Entity.
+
+* __fise:entity-reference__ _(required, single)_: The URI of the referenced 
entity. In cases several URIs are defined as equal (e.g. by "owl:sameAs") 
EnhancementEngines need to choose one of the URIs and include the according 
"owl:sameAs" in the enhancement results
+* __fise:entity-label__ _(required, single)_: The label of the linked entity. 
While entities may define multiple labels (e.g. for different languages, 
alternate/preferred â¦) EnhancementEngines are required to only include a 
single - the best fitting - label.
+* __fise:entity-type__ _(optional, multiple)_: The types of the linked entity. 
Usually this is the list of rdf:types. However there might be situations where 
other Resources are used as types. 
+* __dc:relation__ _(required, multiple)_: The dc:relation property is required 
for entity annotations. Typically values are "fise:TextAnnotation"s this 
EntityAnnotation is a suggestion for.
+* __entityhub:site__ (optional, single)_: The name of the Entityhub 
ReferencedSite managing the the suggested Entity. If this property is present 
users can dereference the suggested Entity with a GET request to 
"{stanbol}/entityhub/site/{site-name}/entity?id={entity}" where {site-name} is 
the value of this property and {entity} is the value of the 
"fise:entity-reference" property. 
+    NOTE: the values "local" and "entityhub" need to be treated separately. In 
those cases the GET request need to use 
"{stanbol}/entityhub/entity?id={entity}".
+
+### fise:TopicAnnotation
+
+TopicAnnotation are used to categorize/classify the parsed content along some 
categorization system. This is done by suggesting/linking Topics of that 
categorization system for (possible parts) of the parsed content. A 
"fise:TextAnnotation" is used to select the part of the content where the 
linked topics apply.
+
+* __fise:entity-reference__ _(required, single)_: The URI of the topic.
+* __fise:entity-label__ _(required, single)_: The human readable label of the 
topic. While topics may define multiple labels (e.g. for different languages) 
EnhancementEngines are required to only include a single - the best fitting - 
label.
+* __fise:entity-type__ _(optional, multiple)_: It is best practice to use 
[SKOS](http://www.w3.org/2004/02/skos/) for modeling hierarchical 
classification systems. If this recommendation is followed than the value of 
fise:entity-type will be "skos:Concept". However users are free to also use 
different types with "fise:TopicAnnotation"s. 
+* __dc:relation__ _(required, multiple)_: The dc:relation property is required 
for topic annotations. It refers to the fise:TextAnnotation specifying the part 
of the text this topic is applied to.
+* __entityhub:site__ (optional, single)_: The name of the Entityhub 
ReferencedSite managing the the suggested Entity. If this property is present 
users can dereference the suggested Entity with a GET request to 
"{stanbol}/entityhub/site/{site-name}/entity?id={entity}" where {site-name} is 
the value of this property and {entity} is the value of the 
"fise:entity-reference" property. 
+    NOTE: the values "local" and "entityhub" need to be treated separately. In 
those cases the GET request need to use 
"{stanbol}/entityhub/entity?id={entity}".
+

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.png
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.png?rev=1344114&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

svn commit: r1344114 - in /incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer: enhancementstructure.mdtext enhancementstructure.png

Reply via email to