Author: rwesten
Date: Mon Mar  5 13:28:39 2012
New Revision: 1297047

URL: http://svn.apache.org/viewvc?rev=1297047&view=rev
Log:
Documentation for the TikaEngine

Added:
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext
Modified:
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext

Modified: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext?rev=1297047&r1=1297046&r2=1297047&view=diff
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext
 (original)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext
 Mon Mar  5 13:28:39 2012
@@ -4,54 +4,57 @@ This provides an overview about all [Enh
 
 ## Preprocessing
 
-- __[Language Identification Engine](langidengine.html)__
-       - language detection for textual content utilizing [Apache 
Tika](http://tika.apache.org/)
+* __[Language Identification Engine](langidengine.html)__
+       * language detection for textual content utilizing [Apache 
Tika](http://tika.apache.org/)
        
-
-- __[Metaxa Engine](metaxaengine.html)__
-       - text extraction from various document formats
-       - extraction of metadata from document formats
-       -
+* __[Tika Engine](tikaengine.html)__ (based on [Apache 
Tika](http://tika.apache.org/))
+       * content type detection
+       * text extraction from various document formats
+       * extraction of metadata from document formats
+
+* __[Metaxa Engine](metaxaengine.html)__
+       * text extraction from various document formats
+       * extraction of metadata from document formats
        
 ## Natural Language Processing
 
-- __[Named Entity Extraction Enhancement 
Engine](namedentityextractionengine.html)__ 
-       - NLP processing using OpenNLP NER
-       - detects occurrences of persons, places and organizations only
+* __[Named Entity Extraction Enhancement 
Engine](namedentityextractionengine.html)__ 
+       * NLP processing using OpenNLP NER
+       * detects occurrences of persons, places and organizations only
        
        
-- __[KeywordLinkingEngine](keywordlinkingengine.html)__
-       - NLP processing using OpenNLP
-       - supports multiple languages
-       - detects occurrences of untyped entities as concepts, takes local 
taxonomies as linking target
+* __[KeywordLinkingEngine](keywordlinkingengine.html)__
+       * NLP processing using OpenNLP
+       * supports multiple languages
+       * detects occurrences of untyped entities as concepts, takes local 
taxonomies as linking target
 
        
-- _Taxonomy Linking Engine_ (deprecated, see KeywordLinkingEngine)
-       - NLP processing using OpenNLP POS
-       - detect occurrences of untyped entities as concepts, takes local 
taxonomies as linking target
+* _Taxonomy Linking Engine_ (deprecated, see KeywordLinkingEngine)
+       * NLP processing using OpenNLP POS
+       * detect occurrences of untyped entities as concepts, takes local 
taxonomies as linking target
        
 
 ## Linking Suggestions
 
-- __[Named Entity Tagging Engine](namedentitytaggingengine.html)__
-       - suggest links to several Linked Data Sources (e.g. DBpedia)
+* __[Named Entity Tagging Engine](namedentitytaggingengine.html)__
+       * suggest links to several Linked Data Sources (e.g. DBpedia)
 
-- __[Geonames Enhancement Engine](geonamesengine.html)__ 
-       - suggests links to geonames.org
-       - provides hierarchical links for locations
+* __[Geonames Enhancement Engine](geonamesengine.html)__ 
+       * suggests links to geonames.org
+       * provides hierarchical links for locations
 
-- __[OpenCalais Enhancement Engine](opencalaisengine.html)__
-       - integrates service from Open Calais. (Note: You need to provide a key 
in order to use this engine)
+* __[OpenCalais Enhancement Engine](opencalaisengine.html)__
+       * integrates service from Open Calais. (Note: You need to provide a key 
in order to use this engine)
 
-- __[Zemanta Enhancement Engine](zemantaengine.html)__
-       - integrates the Zemanta services. (Note: You need to provide a key in 
order to use this engine)
+* __[Zemanta Enhancement Engine](zemantaengine.html)__
+       * integrates the Zemanta services. (Note: You need to provide a key in 
order to use this engine)
 
 
 
 ## Postprocessing / Other
 
-- _CachingDereferencerEngine_ (deprecated, see dereferencing support of 
individual engines as well as  
[STANBOL-336](https://issues.apache.org/jira/browse/STANBOL-336))
-       - retrieves additional content for presenting the enhancement results.
+* _CachingDereferencerEngine_ (deprecated, see dereferencing support of 
individual engines as well as  
[STANBOL-336](https://issues.apache.org/jira/browse/STANBOL-336))
+       * retrieves additional content for presenting the enhancement results.
        
-- __[Refactor Engine](refactorengine.html)__
-               - transforms enhancements according to a target ontology, 
requires KRES launcher.
+* __[Refactor Engine](refactorengine.html)__
+       * transforms enhancements according to a target ontology, requires KRES 
launcher.

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext?rev=1297047&view=auto
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext
 (added)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext
 Mon Mar  5 13:28:39 2012
@@ -0,0 +1,67 @@
+Title: Tika Engine
+
+
+Apache Stanbol Enhancement Engine based on Apache Tika that has three main 
functionalities:
+
+1. To detect the content type of parsed content. This is only performed if the 
no content type is parsed of the cogent type is set to 
"application/octed-stream". The detected content type is added to the metadata 
of the Content Item. 
+2. To extract the plain text (and XHTML) from parsed content and add it to the 
[ContentItem](../contentitem.html)   as content parts with the type Blob.
+3. To extract metadata from the parsed content and add it to the metadata of 
the [ContentItem](../contentitem.html)
+
+
+## Supported Media Types
+
+As this engine uses Apache Tika the supported media types are the same as 
stated on the [Tika Homepage](http://tika.apache.org/1.0/formats.html).
+
+## Extracted Metadata
+
+Tika provides metadata as 'key:values' pairs. To use them efficiently within 
stanbol they need to be converted to valid RDF and aligned with existing 
Ontologies.
+
+The TikaEngine supports alignments to several different Ontologies. Such 
alignment rules can be activated/deactivated within the configuration of the 
TikaEngine.
+
+Supported Ontologies:
+
+* [Ontology for Media Resources](http://www.w3.org/TR/mediaont-10/): This is 
the most complete mapping to an single Ontology. This includes mappings for all 
Dublin Core metadata; geo locations; some image specific data and most of the 
Audio and Viedo related metadata.
+
+* [DC terms](http://dublincore.org/documents/dcmi-terms/): Provides good 
mappings for text documents (HTML, Office, OpenOffice, PDF ...)
+
+* [Nepomuk EXIF 
ontology](http://www.semanticdesktop.org/ontologies/2007/05/10/nexif/): 
Interesting for users that want to work with EXIF metadata extracted from 
images.
+
+* [Nepomuk Message 
Ontology](http://www.semanticdesktop.org/ontologies/2007/03/22/nmo/): Used for 
sender and recaiver information of mail messages. 
+
+* SKOS: Allows mapping of labels and notes to 
[SKOS](http://www.w3.org/2009/08/skos-reference/skos.html). This is deactivated 
by default.
+
+* RDFS: Allows to map labels and comments to "rdfs:label" and "rdfs:comment"
+
+### ContentType:
+
+The detected content type for the parsed contentItem is added by using the 
following two properties:
+
+* 'http://purl.org/dc/terms/format': Dublin Core terms 'format'
+* 'http://www.w3.org/ns/ma-ont#hasFormat': Media Resource Ontology 'hasFormat'
+
+Note that this properties will only be present if the related Ontology is 
activated in the TikaEngine configuration.
+
+
+## Sending Requests directly to the Tika Engine
+
+The Stanbol Enhancer allows to send enhancement requests directly to specific 
EnhancementEngine. This feature can be used in combination with the Tika Engine 
to request
+
+1. the "text/plain" or "application/xhtml+xml" version of parsed content
+2. the extracted metadata as RDF aligned to the activated Ontologies
+
+The first example requests the plain text version of a PDF file with the name 
"test.pdf". Note the 
+
+* 'Accept' header is set to the contentType of the requested content and the 
+* 'omitMetadata=true' telling the Enhancer to not return the RDF metadata.
+
+    :::bash
+    curl -v -X POST -H "Accept: text/plain" -T 
mag_internes_protokoll_20100721_rw.doc \
+        "http://localhost:8080/enhancer/engine/tika?omitMetadata=true";
+
+This second example returns the metadata as extracted from the parsed 
"song.mp3"
+
+    :::bash
+    curl -v -X POST -H "Accept: application/rdf+xml" -T song.mp3 \
+        "http://localhost:8080/enhancer/engine/tika";
+
+


Reply via email to