Author: rwesten
Date: Mon Nov 28 05:41:23 2011
New Revision: 1206981
URL: http://svn.apache.org/viewvc?rev=1206981&view=rev
Log:
first version of a proposal for the Stanbol Enhancement Structure based on the
[Annotation-Ontology](http://code.google.com/p/annotation-ontology/)
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/ses_annotationontology.mdtext
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/ses_annotationontology.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/ses_annotationontology.mdtext?rev=1206981&view=auto
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/ses_annotationontology.mdtext
(added)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/ses_annotationontology.mdtext
Mon Nov 28 05:41:23 2011
@@ -0,0 +1,157 @@
+Title: The Stanbol Enhancement Structure (PROPOSAL)
+
+Please NOTE: This is a proposal for the future version of the Enhancement
Structure used by the Stanbol Enhancer.
+
+**NOTES:**
+
+* This **DOES NOT** describe the Enhancement Structure used by the current
version of the Stanbol Enhancer!
+* There is also an [older Proposal](stanbolenhancementstructure.html) that
might still contain some information that are not yet contained in this version.
+
+## Background
+
+This proposal is aimed to define the "Stanbol Enhancement Structure" intended
to be used by future version of the Stanbol Enhancer to encode Knowledge
extracted from analyzed Documents.
+
+Currently the Stanbol Enhancer still uses the [FISE Enhancement
Structure](http://wiki.iks-project.eu/index.php/EnhancementStructure) that
dates back before the incubation of Stanbol to Apache. This proposal now
suggest to base the "Stanbol Enhancement Structure" on the existing
[Annotation-Ontology](http://code.google.com/p/annotation-ontology/wiki/Homepage).
+
+The following two sections provide a short overview about the currently used
FISE Enhancement Structure as well as the Annotation-Ontology. As this
information is critical to understand the suggestion made in the later parts of
this document.
+
+### FISE Enhancement Structure
+
+The FISE Enhancement Structure defines three main Concepts:
+
+1. **FISE Enhancement**: Defines Metadata about the creation process, type of
the Enhancement as well as relations to other Enhancements.
+2. **FISE Text Annotation**: Defines a selections within enhanced plain Text.
Annotations about other content types are not defined.
+3. **FISE Entity Annotation**: Defines an annotation about an Entity.
+
+Each Annotation created by an Enhancement Engine MUST have the FISE
Enhancement type as well as one of FISE Text Annotation or FISE Entity
Annotation.
+
+The typical use is as follows:
+
+* A Text Annotation is used to define the annotated part of the document. Text
Annotations do use the dc:type property to define the type of the extracted
entity (e.g. as provided by Named Entity Recognition).
+* A Entity Annotation is used to suggest Entities for a Text Annotation.
+* Properties of the Enhancement are used to link the Text Annotation with the
suggested Entity Annotations.
+* Enhancement Engines may also add knowledge about suggested entities
(dereferencing of entities).
+
+Annotations like Keywords, Categories ... where discussed but never formally
defined for the FISE Enhancement Structure.
+
+### Annotation-Ontology
+
+This Proposal describes how Stanbol can used the
[Annotation-Ontology](http://code.google.com/p/annotation-ontology/wiki/Homepage)
for encoding Enhancements.
+
+From the Annotation-Ontology homepage:
+
+> Annotation Ontology (AO) is a vocabulary designed to extensively reuse
existing domain ontologies (entities annotations or semantic tags) and to
provide several other kind of annotations - comments, textual annotation
(classic tags), notes, examples, erratum... - on potentially any kind of
document (text, images, audio...) and document fragments.
+
+The following Figure gives an overview about the Annotation-Ontology as it
shows a simple tagging like annotation of an whole document.
+
+> 
+
+> Image Credit: Annotation-Ontology
[Link](http://annotation-ontology.googlecode.com/svn/trunk/images/Document%20Annotation%20-%20AO%20Annotation%20Ontology%20-%20by%20Paolo%20Ciccarese.png)
+
+## Stanbol Enhancement Strucutre
+
+The following sections describe how the Stanbol Enhancement Structure can
utilize the Annotation-Ontology to encode knowledge extracted form analyzed
Content Items.
+
+### ContentItems
+
+Within the FISE Enhancement Structure the enhanced ContentItems where only
referenced by the **fise:extracted-form** property. There was no specification
on how to further define properties of the ContentItem. The Annotation-Ontology
defines a much richer vocabulary for that.
+
+First an most important the Annotation-Ontology distinguished between the:
+
+* **Annotated Document**: This is the Document that is annotated
+* **Source Document**: This is the Document version that was used for the
annotation process.
+
+> 
+
+> Image Credit: Annotation Ontology
[Link](http://annotation-ontology.googlecode.com/svn/trunk/images/Source%20Document%202%20-%20AO%20Annotation%20Ontology%20-%20by%20Paolo%20Ciccarese.png)
+
+As an example: If a Web-Crawler crawls a site on the Web and stores a local
copy for indexing, than the **Annotated Document** would use the URL of the
document on the Web. The **Source Document** would be the ID of the locally
cached version used for the enhancement process.
+
+#### Content Adapter and Source Documents:
+
+The Content Adapter pattern was suggested to be used to convert parsed
documents to different Content Formats such as extracting the Plain Text of
parsed HTML or PDF documents.
+
+The possibility to distinguish between the *Annotated Document* and the
*Source Document* nicely supports this, because while Enhancement Engines can
state that an Annotation is about the *Annotated Document* they can still state
the exact *Source Document* that was used for processing. This allows e.g. to
clearly state that the indexes of a text selection are based on the plain text
version of the *Annotated Document*.
+
+### Content Selectors
+
+The FISE Enhancement Structure defined a single "Content Selector" the *FISE
Text Annotation*. The Annotation-Ontology uses a much richer Structure that
even provides the possibility to extensions for defining specific selections
different content types.
+
+With the Annotation-Ontology each Selector can link to both a the *Annotated
Document* and the *Source Document*. In the following an Example for an Image
Selection
+
+> 
+
+> Image Credits: Annotation-Ontology
[Link](http://annotation-ontology.googlecode.com/svn/trunk/images/Image%20InitEndCorner%20Selector%20-%20AO%20Annotation%20Ontology%20-%20by%20Paolo%20Ciccarese.png).
+
+#### Text Selectors
+
+The "PrefixPostfixSelector" as defined by the Text-Annotation Ontology differs
from the currently used FISE Text Annotation. It does not define the character
indexes and uses prefix and postfix instead of the surrounding context.
+
+Regarding backward compatibility The suggestion is to adopt the
"PrefixPostfixSelector" but keep the start and end positions of the current
Text Annotation. The prefix/posfix model of the "PrefixPostfixSelector" is
definitely better than the used context of the FISE Text Annotation, because it
allows to clearly identify the selected text even if it occurs several times in
a given context.
+
+#### Multi Media Selectors and the Media Fragments Standard
+
+The [Media Fragments Working
Group](http://www.w3.org/2008/WebVideo/Fragments/) of the W3C is currently
working on a Recommendion on how to encode Fragments of Resources within so
called [Media Fragments
URIs](http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/).
+
+This specification defines how to encode the
[Temporal](http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/#naming-time),
[Spatial](http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/#naming-space),
[Track](http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/#naming-track)
and
[ID](http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/#naming-id)
dimensions within Document URIs but also defines processing rules (e.g. for
Browsers) and the semantics.
+
+The proposal here is to use this specification for encoding selections within
multi media files within the Annotation-Ontology. This will most likely require
the definition of an MediaFragmentSelector as extension.
+
+### Annotations
+
+The FISE Enhancement Structure uses both properties of FISE Enhancements and
FISE TextAnnotation/EntityAnnotation to describe Annotations as defined by the
Annotation-Ontology. On the other side some properties of the FISE
TextAnnotation are part of the Selectors within the Annotation-Ontology.
Because of that the switch to the Annotation-Ontology will not only mean a
change in the used Vocabulary, but also bring some structural changes.
+
+Annotations as defined by the Annotation-Ontology are structured as follows:
+
+* An Annotation is represented by a Resource (called Annotation-Resource in
the remaining document) with the rdf:type ao:Annotation. Special types of
Annotations can be introduced by subclasses of ao:Annotation.
+* The Annotation-Resource may be linked to an Selector with the **ao:context**
property. If no such link is present the Annotation-Resource is about the whole
Document. It is also possible to link multiple Selectors with an annotation.
+* Each Annotation-Resource MUST BE linked to the *Annotated Document* by using
the **ao:annotatesResource** property. The *Source Document* can be referenced
by using the **ao:onSourceDocument**. It is also possible to link multiple
Documents with an annotation.
+
+The following sub-sections will provide an overview how Text Annotations ,
Entity Annotations and Category Annotations as used by Stanbol can be expressed
using the Annotation-Ontology
+
+#### Text Annotations
+
+Text Annotations are Annotations as typically created by NER (Named Entity
Recognition) engines. Such Annotations select a part of a Text and assign an
type (Person, Organization, Place ...) to that.
+
+The text selection can be expressed by using an "PrefixPostfixSelector". The
type and the confidence of the detected named entity need to be properties of
the Annotation class.
+
+#### Entity Annotations
+
+Entity Annotations are similar to "Qualifier" annotations as defined to the
Annotaiton-Ontology. The *ao:hasTopic* relation is used to link the annotation
with the related topic.
+
+#### Category Anotations
+
+Category Annotations are typically about the whole or an specific section of
an Document. Normal Selectors can be used for defining the categorized Section.
If no Selector is present the categorization applies to the whole document. The
"Qualifier" annotation could also be used as a base class for categorizations.
+
+### Annotation Sets
+
+Within the Annotation-Ontologies Annotation Sets can be used to group several
Annotations together. Although the FISE Enhancement Structure does not
explicitly define a similar possibility the possibilities to define relations
between FISE Enhancements are used for a similar purpose by the Stanbol
Enhancer. Therefore the suggestion is to use this feature of the
Annotation-Ontology to model for expressing sets of possible Categories,
suggestions of Entities.
+
+The following figure shows an Example for an Annotation Set with a single
Annotation
+
+> 
+
+> Image Credits: Annotation-Ontology
[Link](http://annotation-ontology.googlecode.com/svn/trunk/images/Annotation%20Set%20-%20AO%20Annotation%20Ontology%20-%20by%20Paolo%20Ciccarese.png)
+
+This suggests the use of Annotation Sets to formally describe situations where
the Stanbol Enhancer need group several Annotations in order to provide users
the possibility to select from a predefined set of options. Assigning an unique
ID - the URI of the AnnotationSet instance - to such a collection of
Annotations brings also the possibility for the consumer to provide explicit
feedback to the Stanbol Enhancer (e.g. by accepting/rejecting Annotations part
of the AnnotationSet, adding an additional Annotation to an set, ...)
+
+Note that single Annotations might be part of several annotation sets. As an
Example take an Text Annotation for that to sets of Entity suggestions are
generated.
+
+The suggestion is to create subclasses for common types of Annotation Sets
uses by the Stanbol Enhancer
+
+#### Entity Suggestions
+
+With the FISE Enhancement Structure this is expressed by a
*fise:TextAnnotation* that is linked to several *fise:EntityAnnotation*s by the
*dc:relation* property.
+
+Expressing the same based on the Annotation-Ontology would be possible by
+
+* An Annotation Set that links to the following Annotations (by the *ao:item*
property):
+* An TextAnnotaion including the PrefixPostfixSelector selector defining the
actual position of the selected text within the document
+* One EntityAnnotation (extends ao:Qualifier) per suggested Entities.
+* In addition the Annotation Set also includes metadata such the the Engine
that created the suggestions
+
+#### Category Suggestions
+
+Typically categorizations can provide more than a single Category. So grouping
such suggestions within an AnnotationSet gives Users the possibility to
accept/reject one or more of such suggestions. In addition it would also allow
to distinguish sets of categorizations calculated based on disjoint sets of
categories (e.g. a categorization based on a UserProfile with a categorization
based on general topics or a spatial categorization.)
+
+