enhancerrest.mdtext

rwesten Fri, 17 Feb 2012 02:30:11 -0800

Author: rwesten
Date: Fri Feb 17 10:29:45 2012
New Revision: 1245375

URL: http://svn.apache.org/viewvc?rev=1245375&view=rev
Log:
Documentation for the enhancer RESTful API


Added:
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancerrest.mdtext

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancerrest.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancerrest.mdtext?rev=1245375&view=auto
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancerrest.mdtext
 (added)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancerrest.mdtext
 Fri Feb 17 10:29:45 2012
@@ -0,0 +1,305 @@
+title: Stanbol Enhancer RESTful API
+
+
+<p>The RESTful service endpoint provided by the Stanbol Enhancer is a 
stateless interface that allows the caller to submit content and get the 
resulting enhancements formatted as RDF at once without storing anything on the 
server-side. More advanced options also allow to parse pre-existing metadata, 
parse and request alternate content versions and additional metadata created by 
the Enhancer or specific Enhancement Engines.</p>
+
+The here described RESTful interface is provided on several Endpoints
+
+* __'/enhancer':__ The main Endpoint of the Stanbol Enhancer. Parsed content 
will get enhanced by using the default enhancement chain.
+* __'/enhancer/chain/{chain-name}'__: The Stanbol Enhancer supports the 
configuration of multiple [Enhancement Chains](chains). Users can lookup active 
chains by requests to the 'enhancer/chain' endpoint.
+* __'/engines':__ Same as '/enhancer' this ensures backward compatibility to 
older Stanbol versions.
+
+## Enhancement Request
+
+This sections describes how to parse Content to the Stanbol Enhancer that gets 
than analyzed. Results are sent back in the form of a serialized RDF graph.
+
+The content to analyze should be sent in a POST request with the mimetype 
specified in
+the <code>Content-type</code> header. The response will hold the RDF 
enhancement serialized in the format specified in the <code>Accept</code> 
header:
+   
+    :::bash
+    curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \
+         --data "John Smith was born in London." ${it.serviceUrl}
+
+The list of mimetypes accepted as inputs depends on the deployed engines. By 
default most Enhancement Engines can only process plain text content. However 
EnhancementEngines like [Metaxa](engines/metaxaengine.html) can be used to 
create 'text/plain' versions of parsed content. This allows also to enhance 
contents with mime types such as html, pdf and MS office documents (see the 
Metaxa documentation for details)
+ 
+Stanbol enhancer is able to serialize the response in the following RDF 
formats:
+
+    :::text
+    application/json (JSON-LD)
+    application/rdf+xml (RDF/XML)
+    application/rdf+json (RDF/JSON)
+    text/turtle (Turtle)
+    text/rdf+nt (N-TRIPLES)
+
+### Additional supported QueryParameters:
+
+* __uri={content-item-uri}:__ By default the URI of the content item being 
enhanced is a local, non de-referencable URI automatically built out of a hash 
digest of the binary content. Sometimes it might be helpful to provide the URI 
of the [ContentItem](contentitem.html) to be used in the enhancements RDF graph.
+* __executionmetadata=true/false:__ Allows the include of [execution 
metadata](executionmetadata.html) in the enhancement metadata of the response. 
Such data include also the [execution plan](chains/executionplan.html) used to 
enhance the parsed content. This information is typically only useful to 
clients that want to know how the parsed content was processed by the enhancer. 
NOTE that the execution metadata can also be requested by using the multi-part 
content item API described below.
+
+The following example shows how to send an enhancement request with a
+custom content item URI that will include the execution metadata in the
+response.
+
+    :::bash
+    curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \
+        --data "John Smith was born in London." \
+        
"${it.serviceUrl}?uri=urn:fise-example-content-item&executionmetadata=true"
+
+
+## Multi-part ContentItem support
+
+The multi-part ContentItem extensions to the RESTful API (introduced by 
[STANBOL-481](https://issues.apache.org/jira/browse/STANBOL-481)) are 
considered an advanced usage of the Stanbol Enhancer. 
+
+Users will want to use this extensions if they need to:
+
+* parse multiple versions of the content: Most CMS already do have support for 
converting content to plain text. This API allows to parse both the original 
AND multiple transcoded versions of the content to the Enhancer.
+* parse pre-existing metadata: Typically CMS do have already some metadata 
about content parsed to the Stanbol Enhancer (e.g. User provided Tags, 
Categories â¦). The multi-part extensions do allow to parse such data in 
addition to the content. 
+* request transcoded versions of the parsed content: This API extensions 
allows to include transcoded (e.g. the 'plain/text') version of parsed content 
in the response. It also allows requests that directly returns transcoded 
content by omitting extracted metadata.
+* request additional metadata that are normally not included within the 
metadata of the Enhancement response: This can to request the [execution 
metadata](executionmetadata.html) in an own RDF graph, but it can also be used 
to request metadata of specific enhancement engines (TODO: add example)
+
+
+### QueryParameters 
+
+The following QueryParameters are defined by the multi-part content item 
extension:
+
+* __outputContentType=[mediaType]:__ Allows to specify the Mimetypes of 
content included within the Response of the Stanbol Enhancer. This parameter 
supports wild cards (e.g. '*' ... all, 'text/*'' ... all text versions,  
'text/plain' ... only the plain text version). This parameter can be used 
multiple times.
+
+    Responses to requests with this parameter will be encoded as 
<code>multipart/from-data</code>. If the "Accept" header of the request is not 
compatible to <code>multipart/from-data</code> it is assumed as a <code>400 
BAD_REQUEST</code>. For details see the documentation of the [Multipart MIME 
format for ContentItems](contentitem.html#multipart_mime_serialization).
+
+* __omitParsed=[true/false]:__ Makes only sense in combination with  the 
<code>outputContentType</code> parameter. This allows to exclude all content 
included in the request from the response. A typical combination is 
<code>outputContentType=*/*&omitParsed=true</code>. The default value of this 
parameter is <code>false</code>
+
+* __outputContentPart=[uri/'*']:__ This parameter allows to explicitly include 
content parts with a specific URI in the response. Currently this only supports 
[ContentParts](contentitem.html#content_parts) that are stored as RDF graphs. 
+
+    Responses to requests with this parameter will be encoded as 
<code>multipart/from-data</code>. If the "Accept" header of the request is not 
compatible to <code>multipart/from-data</code> it is assumed as a <code>400 
BAD_REQUEST</code>. The selected content parts will be included as MIME parts 
in the returned [Multipart MIME formated 
ContentItems](contentitem.html#multipart_mime_serialization). The URI of the 
part will be used as name. Such parts will be added after the "metadata" and 
the "content" (if present).
+
+* __omitMetadata=[true/false]:__ This allows to enable/disable the inclusion 
of the metadata in the response. The default is <code>false</code>.
+
+    Typically <code>omitMetadata=true</code> is used when users want to use 
the Stanbol Enhancer just to get one or more ContentParts as an response. Note 
that
+Requests that use an <code>Accept: {mimeType}</code> header AND 
<code>omitMetadata=true</code> will directly return the content verison of 
<code>{mimeType}</code> and NOT wrap the result as 
<code>multipart/from-data</code>. See also the example further down this 
documentation.
+
+* __rdfFormat=[rdfMimeType]:__ This allows for requests that result in 
<code>multipart/from-data</code> encoded responses to specify the used RDF 
serialization format. Supported formats and defaults are the same as for normal 
Enhancer Requests. 
+
+### Parsing multiple ContentParts
+
+Requests to the Stanbol Enahcer with the <code>Content-Type: 
multipart/from-data</code> are considered to contain a ContentItem serialized 
as MultiPart MIME. The exact specification of the [MultiPart MIME format for 
ContentItems](contentitem.html#multipart_mime_serialization) is provided by the 
documentation of the ContentItem.
+
+The combination of <code>multipart/from-data</code> encoded requests with 
QueryParameters as described above allow for the usage of [MultiPart MIME 
format for ContentItems](contentitem.html#multipart_mime_serialization) for 
both request and resonse.
+
+
+### Example usages of the multi-part content item RESTful API extensions
+
+The following examples show some typical usages of the multi-part content item 
RESTful API. Note that for better readability the values of the query 
parameters are
+not URLEncoded.
+
+Return Metadata and transformed Content versions
+    
+    :::bash
+    curl -v -X POST -H "Accept: multipart/from-data" \
+        -H "Content-type: text/html; charset=UTF-8"  \
+        --data "&lt;html&gt;&lt;body&gt;&lt;p&gt;John Smith was born in 
London.&lt;/p&gt;&lt;/body&gt;&lt;/html&gt;" \
+        
"${it.serviceUrl}?outputContent=*/*&omitParsed=true&rdfFormat=application/rdf+xml"
+
+__Example 1: Return metadata and content__
+
+This will result in an Response with the mime type <code>"Content-Type: 
multipart/from-data; charset=UTF-8; boundary=contentItem"</code> and the 
Metadata as well as the plain text version of the parsed HTML document as 
content.
+
+    :::text
+    --contentItem
+    Content-Disposition: form-data; name="metadata"; 
filename="urn:content-item-sha1-76e44d4b51c626bbed38ce88370be88702de9341"
+    Content-Type: application/rdf+xml; charset=UTF-8;
+    Content-Transfer-Encoding: 8bit
+
+    &lt;rdf:RDF
+        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
+    [..the metadata formatted as RDF+XML..]
+    &lt;/rdf:RDF&gt;
+
+    --contentItem
+    Content-Disposition: form-data; name="content"
+    Content-Type: multipart/alternate; boundary=contentParts; charset=UTF-8
+    Content-Transfer-Encoding: 8bit
+
+    --contentParts
+    Content-Disposition: form-data; 
name="urn:metaxa:plain-text:2daba9dc-21f6-7ea1-70dd-a2b0d5c6cd08"
+    Content-Type: text/plain; charset=UTF-8
+    Content-Transfer-Encoding: 8bit
+
+    John Smith was born in London.
+    --contentParts--
+
+    --contentItem--
+
+__Example 2: Directly return the plain text version of parsed content__
+
+This shows how the Apache Stanbol Enhancer can be used to transcode parsed 
content.
+
+    curl -v -X POST -H "Accept: text/plain" \
+        -H "Content-type: text/html; charset=UTF-8" \
+        --data "&lt;html&gt;&lt;body&gt;&lt;p&gt;John Smith was born in 
London.&lt;/p&gt;&lt;/body&gt;&lt;/html&gt;" \
+        "${it.serviceUrl}?omitMetadata=true"
+
+The response will use <code>Content-Type: text/plain</code> and contain the 
string
+
+    :::text
+    John Smith was born in London.
+
+To make this work the requested [Enhancement Chain](chains) will need to 
include an engine (e.g. [Metaxa](engines/metaxaengine.html)) that supports 
transcoding the parsed content. Not that because the metadata are omitted by 
responses to such requests it is also recommended to configure/use a chain that 
does no further processing on the transcoded content. 
+
+__Example 3: Parse multiple content versions__
+
+This example will use the "httpmime" part of the Apache commons httpcomponents 
to create the Multipart MIME sent to the Stanbol enhancer.
+
+    :::xml
+    <dependency>
+        <groupId>org.apache.httpcomponents</groupId>
+        <artifactId>httpmime</artifactId>
+        <version>4.1.2</version>
+    </dependency>
+
+The created Multipart MIME content MUST follow the specifications as defined 
by the [MultiPart MIME format for 
ContentItems](contentitem.html#multipart_mime_serialization).
+
+    :::java
+    InputStream wordIn; //The MS Word version of the Content
+    InputStream plainIn; //The plain text version of the Content
+    HttpClient httpClient; //The client used to execute the request
+
+    //create the multipart/from-data container for the ContentItem
+    //MultipartEntity also implements HttpEntity
+    MultipartEntity contentItem = new MultipartEntity(null, null ,UTF8);
+    //The multipart/alternate container for the contents
+    HttpMultipart content = new HttpMultipart("alternate", UTF8 
,"contentParts");
+
+    //now add the container for the content to the content item container
+    contentItem.addPart(
+        "content", //the name MUST BE "content"!
+        new MultipartContentBody(content));
+    
+    //now add the MS word content at the first location
+    //this will make it the "original" content
+    content.addBodyPart(new FormBodyPart(
+        "http://www.example.com/example.docx";, //the id of the content part
+        new InputStreamBody(
+            wordIn, 
+            
"application/vnd.openxmlformats-officedocument.wordprocessingml.document", 
+            "example.docx")));
+
+     //now add the alternate plain text version
+     content.addBodyPart(new FormBodyPart(
+        "http://www.example.com/example.docx";, //the id of the content part
+        new StringBody( //use a StringBody to avoid binary encoding for text
+            IOUtils.toString(plainIn), //apache commons IO utility
+            "text/plain",
+            Charset.forName("UTF-8"))));
+    
+    //now we are ready to create and execute the POST request to the
+    //Stanbol Enhancer
+    HttpPost request = new HttpPost("http://localhost:8080/enhancer";);
+    request.setEntity(contentItem);
+    request.setHeader("Accept","application/rdf+xml);
+    Response response = httpClient.execute(request);
+
+
+Note that for such requests [Metaxa](engines/metaxaengine.html) will still try 
to extract metadata of the parsed MS Word document, but all other engines will 
use the plain text version as parsed by the request for processing.
+
+__Example 4: Parse existing free text annotations__
+
+This example shows how the multi-part content item API can be used to parse 
already existing tags for an parsed content to the Stanbol Enhancer. For this 
example it is important to understand that parsed metadata need to confirm to 
the Stanbol Enhancement Structure. Because of that this example consist of two 
main steps:
+
+1. Convert user tags to TextAnnotations
+2. Send existing Metadata along with the Content to the Stanbol Enhancer
+
+Also note that the code snippets will uses utilities provided by the 
"org.apache.stannbol.enhancer.servicesapi" module. As RDF framework Clerezza is 
used. Both dependencies are easily replaceable.
+
+First lets have a look at the required information
+
+    :::java
+    MGraph graph; //the RDF graph to store the metadata
+    UriRef ciUri; //the URI for the contentItem
+    String tag; // user provided tag
+    UriRef tagType; //the type of the Tag
+    
+Reagrding the tag type: Stanbol natively supports the following types 
+
+* __Person__ (http://dbpedia.org/ontology/Person)
+* __Organization__ (http://dbpedia.org/ontology/Organisation): NOTE the 
British spelling
+* __Place__ (http://dbpedia.org/ontology/Place)
+
+The processing of parsed tags that use other or no type depends on the used 
[enhancement engines](engines) and there configurations. Especially the 
configuration of used the [Named Entity Tagging 
Engine](engines/namedentitytaggingengine.html)s is important in that respect.
+
+    :::java
+    Resource user; //the user that has created the tag (optional)
+    //in case of an name just use a literal
+    user = new PlainListeral("Rudolf Huber");
+    //in case users have assigned URIs
+    user = new UriRef("http://my.cms.org/users/rudof.huber";);
+
+Now we can convert the information to TextAnnoations
+
+    :::java
+    //first create a URI for the text annotation. Here we use a random URN
+    //If you can create a meaningful URI this would be better!
+    UriRef ta = new 
UriRef("urn:user-annotation:"+EnhancementEngineHelper.randomUUID());
+    //The the 'rdf:type's
+    graph.add(new TripleImpl(ta, RDF.type, 
TechnicalClasses.ENHANCER_TEXTANNOTATION));
+    graph.add(new TripleImpl(ta, RDF.type, 
TechnicalClasses.ENHANCER_ENHANCEMENT));
+       
+    //this TextAnnotation is about the ContentItem
+    graph.add(new TripleImpl(ta, Properties.ENHANCER_EXTRACTED_FROM, ciUri));
+    //if the Tag uses a type add it
+    if(tagType != null){
+        graph.add(new TripleImpl(ta, Properties.DC_TYPE, tagType));
+    }
+    //add the value of the tag
+    graph.add(new TripleImpl(ta, Properties.ENHANCER_SELECTED_TEXT, new 
PlainLiteralImpl(tag)));
+    //add the user
+    if(user != null){
+        graph.add(new TripleImpl(ta, Properties.DC_CREATOR,user));
+    }
+
+Now the 'graph' contains a valid TextAnnotation for the given user tag. This 
should be done for all tags of the current content.
+
+In the next step we need to serialize the RDF data. Again I will use here 
Clerezza as API, but any RDF framework will provide similar functionality
+
+   :::java
+   ByteArrayOutputStream out = new ByteArrayOutputStream();
+   //this tells the Serializer to create "application/rdf+xml"
+   serializer.serialize(out, metadata, SupportedFormat.RDF_XML);
+   String rdfContent = new String(out.toByteArray(),UTF8);
+   
+Now we need to create the MultiPart MIME content item containing the metadata 
and the content
+
+   :::java
+   String content; //the content we want to send to the Stanbol Enhancer
+
+    //the container for the ContentITem
+    MultipartEntity contentItem = new MultipartEntity(null, null ,UTF8);
+
+    //The Metadata MUST BE the first element
+    contentItem.addPart(
+        "metadata", //the name MUST BE "metadata" 
+        new StringBody(rdfContent,SupportedFormat.RDF_XML,UTF8){
+            @Override
+            public String getFilename() { //The filename MUST BE the
+                return ciUri.getUnicodeString(); //uri of the ContentItem
+            }
+        });
+
+Note that because the StringBody class provided my the "httpmime" framework 
does not set a Filename we need to override this method and return the URI of 
the content item. This is essential, because we need ensure that the URI of the 
ContentItem is the same as the URI (variable 'ciUri') as used when creating the 
TextAnnotations for the user tags.
+
+For the following code snippet note that we can directly add the content to 
the content item container. Only if we would need to sent multiple alternate 
content versions (as shown in 'Example 3') the usage of an 
'multipart/alternate' container is required.
+ 
+    :::java
+    //Add the content as second mime part
+    contentItem.addPart(
+        "content", //the name MUST BE "content"
+        new StringBody(content,"text/plain",UTF8));
+
+    //now we are ready to create and execute the POST request to the
+    //Stanbol Enhancer
+    HttpPost request = new HttpPost("http://localhost:8080/enhancer";);
+    request.setEntity(contentItem);
+    request.setHeader("Accept","application/rdf+xml);
+    Response response = httpClient.execute(request);
+
+The response of the Enhancer will now contain Entity suggestions for the free 
text user tags.
+

svn commit: r1245375 - /incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/enhancerrest.mdtext

Reply via email to