Author: rwesten
Date: Thu Jun 14 05:46:11 2012
New Revision: 1350093
URL: http://svn.apache.org/viewvc?rev=1350093&view=rev
Log:
first version of EntityTagging
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.mdtext
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.mdtext?rev=1350093&r1=1350092&r2=1350093&view=diff
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.mdtext
(original)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancementstructure.mdtext
Thu Jun 14 05:46:11 2012
@@ -137,7 +137,55 @@ TopicAnnotation are used to categorize/c
## Entity Tagging
-TODO: Work in progress
+Entity Tagging is about suggesting Users Entities instead of Strings to tag
their Documents. The difference is very easy to explain. Lets assume a Blogger
that uses the tag "Bob Marley" to tag a blog entry. Tagging is all about
structuring content - so by tagging it with "Bob Marley" he can not easily find
all Documents that uses that tag. However most likely he would also want to
create a category of Documents about Reggae music and most likely he would like
that Documents tagged with "Bob Marley" are part of that group.
+
+But while the knowledge that "Bob Marley" is related to "Reggae music" might
be obvious for the Blogger it can not be known by the Blgging Tool he uses. So
typically the only way to active this is that the Blogger tags the document
with both tags.
+
+Entity Tagging tries to work around that by linking Documents with Entities
defined by a knowledge base. The fact that Bob Marley is related to Reggae
music is nothing novel. [DBpedia](http://dbpedia.org) - the Wikipedia database
- does know that and a lot more about - the Entity -
[dbpedia:Bob_Marley](dbpedia.org/resource/Bob_Marley). So if the blogger tags
his Document with "dbpedia:Bob_Marley" he does not only tag it with "Bob
Marley" but also with all the other contextual information provided by DBPedia
- including the fact that Bob_Marley was an Reggae interpret.
+
+But this does not only work with famous people, big cities ⦠nowadays the
web [links data](http://linked-data.org) of different domains. However this is
not only about the Web - it works even better if you also can use Entities
relevant to yourself and/or your working environment (Products, CRM
information, â¦).
+
+### Suggest Entities with the Stanbol Enhancer
+
+Requesting the Stanbol Enhancer to analyze a text requires to send an POST the
the [RESTful API](enhancerrest.html) of the Stanbol Enhancer.
+
+ curl -X POST -H "Accept: application/rdf+xml" -H "Content-type:
text/plain" \
+ --data "The Stanbol enhancer can detect famous cities such as \
+ Paris and people such as Bob Marley."
http://{host}:{port}/enhancer
+
+As response you will receive the enhancement results formatted as RDF graph in
the serialization specified by the "Accept" header ('application/rdf+xml' in
the above example request). This RDF graph contains the information about the
Entities extracted from the parsed content.
+
+The following Figure shows how extracted entities are described in the
enhancement results.
+
+
+In principle there are two Resources that are of interest for the Entity
tagging use case:
+
+1. EntityAnnotations: Resources with the 'rdf:type' 'fise:EntityAnnotation' do
represent the entity suggestions by the Stanbol Enhancer. This resources
provide the label, type and most important the URI of the extracted Entity. In
addition the value of the fise:confidence' [0..1] can be used as indication how
certain the Stanbol Enhancer is about this Entity.
+2. Entities: This refers to all resources with an incoming
'fise:entity-reference' relation (such as 'dbpedia:Bob_Marley' in the above
example). Enhancement Engines can be configured to "dereference" suggested
entities - meaning to use the URI of the entity to retrieve additional
information. In this case additional information about suggested Entities will
be available in the Enhancement results. If this in not the case users will
need to dereference suggested entities themselves.
+
+The following steps are typically needed to acquire the information needed to
implement an entity tagging user interface:
+
+1. Iterate over all suggested Entities: This are all resources such as
"{entity-annotation} rdf:type fise:EntityAnnotation"
+2. Basic information: Those are available directly via the {entity-annotation}
to ensure there availability even if the {entity} itself in not not included -
dereferenced - in the enhancement results.
+ * URI of the suggested Entity: {entity-annotation} fise:entity-reference
{entity}
+ * Label: The value of the fise:entity-label is typically the label via
that the Entity was recognized in the analyzed content. Additional labels are
typically available via the {entity}
+ * Types: Tha value of the fise:entity-type property of the
{entity-annotation} are the same as the rdf:type values of the {entity}.
+ * Confidence: The 'fise:confidence' value represent how confident the
Stanbol Enhancer is about this suggestion. Values are in the range [0..1] where
0 means very uncertain and 1 represent a high certainly.
+3. Dereferenced {entity}: Some EnhancementEngines support to add also
information about suggested Entities to the enhancement results - in other
words: to dereference suggested entities. In this case additional information
about the {entity} can be retrieved directly from the enhancement results. Most
important those information include all available labels (in all languages) of
the Entity.
+4. Dereferencing suggested Entities: If the suggested Entity is available via
the Stanbol Entityhub the {entity-anntotation} does have the 'entityhub:site'
property. The value of this property is the name of the ReferencedSite of the
Entityhub. To dereference the Entity a GET request to
"{stanbol-root-URL}/entityhub/site/{site-name}/entity?id={entity}" need to be
used. The "Accept" header of the request need to be set to the according RDF
serialization (e.g. "application/rdf+json").
+
+### Content Categorizations:
+
+'fise:TopicAnnotation' instances are used to formally represent categories
assigned to the parsed Content. The main difference between extracted Entities
and assigned Categories is that extracted Entities do have one or more explicit
mentions within the text while assigned Categories are suggested based on the
document as a whole - typically they are not explicitly mentioned in the text.
+
+Typically a entity tagging UI will want to distinguish between Categories and
Entities because:
+
+* Categories are used to group Content (e.g. Blog posts about Work and private
things)
+* Entities are used to search/suggest Blog posts about specific topics (e.g. A
blog about some feature implemented with "Apache Solr", a nice event in the
"Sternbräu" in "Salzburg")
+
+The usage of 'fise:TopicAnnotation' is similar to EntityAnnotation. They do
use the exact same properties
('fise:entity-referene','fise:entity-label',fise:entity-type',
'fise:confidence','entityhub:site'). The only difference is that one need to
iterate over '{topic-anntoation} rdf:type fise:TopicAnnotaion'. So typically
clients will want to use the exact same code to process {entity-annotation} and
{topic-annotation} instances.
+
+In the next section "Entity Disambiguation" an improved version of Entity
Tagging is described that allows users to: (1) accept/decline a spotted Entity
and than (2) select one of several suggested Entities.
## Entity Disambiguation