Author: rwesten
Date: Wed Apr 11 08:32:35 2012
New Revision: 1324631

URL: http://svn.apache.org/viewvc?rev=1324631&view=rev
Log:
added documentation of the ContentItemFactory, ContentSource, ContentReference 
and ContentSink interfaces (STANBOL-573)

Added:
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitemfactory.mdtext
Modified:
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitem.mdtext

Modified: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitem.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitem.mdtext?rev=1324631&r1=1324630&r2=1324631&view=diff
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitem.mdtext
 (original)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitem.mdtext
 Wed Apr 11 08:32:35 2012
@@ -2,7 +2,7 @@ Title: Content Item
 
 <span style="float:right"> ![Content Item Overview](contentitemoverview.png 
"The ContentItem can contain several ContentParts and the Enhancement Metadata 
- an RDF Graph")</span> 
 
-The ContentItem is the object which represents the content to be enhanced by 
Apache Stanbol. It is created based on the data provided by the enhancement 
request and used throughout the enhancement process to store results. 
Therefore, after the enhancement process has finished, the ContentItem 
represents the result of the Apache Stanbol enhancement process.
+The ContentItem is the object which represents the content to be enhanced by 
Apache Stanbol. It is created based on the data provided by the enhancement 
request and used throughout the enhancement process to store results. 
Therefore, after the enhancement process has finished, the ContentItem 
represents the result of the Apache Stanbol enhancement process. ContentItem 
instances are created by using the 
[ContentItemFactory](contentitemfactory.html) service.
 
 The following section describes the interface of the ContentItem in detail:
 
@@ -94,6 +94,18 @@ However, whenever components need to ens
 
 While accessing content items within an [enhancement engine](engines) there is 
an exception to this rule. If an engine declares that it only supports the 
<code>SYNCHRONOUS</code> enhancement mode, then the [enhancement job 
manager](enhancementjobmanager.html) needs to take care that an engine has 
exclusive access to the _CotentItem_. In this case implementors of enhancement 
engines need not to care about using read/write locks.
 
+### ContentItemFactory
+
+Since version 0.10.0 ContentItems and Blobs are created by using the 
[ContentItemFactory](contentitemfactory.html). ContentItemFactory 
implementation register themselves as OSGI service. By default the 
implementation with the highest "service.ranking" is used by the 
StanbolEnhancer to create instances. By default two implementations are 
available. The in-memory and a file-based one where the in-memory 
implementation is used as default.
+
+Most users will not need to change the default ContentItem implementation. 
However if the Enhancer is used to extract metadata from gib media files such 
as EXIF metadata from big images, ID3 from MP3 files ... than changing the 
default from the InMemoryContentItemFactory to the FileContentItemFactory might 
considerable reduce the memory footprint. 
+
+With the introduction of the ContentItemFactory also all ContentItem 
implementation specific constructors to parse content where deprecated and 
replaced by the following three interfaces:
+
+1. __ContentSource__ allows to parse Content that is available as stream, byte 
array or string.
+2. __ContentReference__ allows to parse a Reference (e.g. a URL) to a 
ContentItem. The derefernce() method of this interface is used by the 
ContentItemFactory to convert a ContentReference to a ContentSource.
+3. __ContentSink__ allows to obtain an OutputStream to an initially empty Blob 
that can later be used to stream the content. This is intended to be used by 
EnhancementEngine that need to convert content from one format to an other 
because it allows to avoid caching the converted content in-memory.
+
 ### Multipart MIME serialization
 
 <span style="float:right"> ![ContentItem Multipart MIME 
format](contentitemmultipartmime.png "This figure provides an overview how 
Content Items are serialized as MultiPart MIME")</span>

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitemfactory.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitemfactory.mdtext?rev=1324631&view=auto
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitemfactory.mdtext
 (added)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitemfactory.mdtext
 Wed Apr 11 08:32:35 2012
@@ -0,0 +1,133 @@
+Title: Content Item Factory
+
+The ContentItemFactory is used by the Stanbol Enhancer to create 
[ContentItem](contentitem.html) and Blob instances. ContentItemFactory 
implementation typically register themselves as OSGI service. The Stanbol 
Enhancer will use the factory implementation with the highest "service.ranking" 
to create ContentItems and Blobs for requests on the RESTful API. When using 
the Java API any ContentItem implementation can be used.
+
+### ContentItemFactory interface
+
+The interface of the ContentItemFactory defines the following methods to 
create ContentItems
+
+    :::java
+    + createContentItem(ContentSource source) : ContentItem
+    + createContentItem(String prefix, ContentSource source) : ContentItem
+    + createContentItem(UriRef id, ContentSource source) : ContentItem
+    + createContentItem(String prefix, ContentSource source, MGraph metadata) 
: ContentItem
+    + createContentItem(UriRef id, ContentSource source, MGraph metadata) : 
ContentItem
+    + createContentItem(ContentReference reference) : ContentItem
+    + createContentItem(ContentReference reference, MGraph metadata) : 
ContentItem
+
+The content for created ContentItem can be passed by using either a 
ContentSource or a ContentReference. The Stanbol Enhancer Servicesapi module 
provides implementations for creating ContentSources for Java streams, byte 
arrays and string object as well as ContentReferences for URLs. For details see 
the sections below.
+
+The URI of the created ContentItem is determined as follows:
+
+* if no URI is passed, than it is calculated by using a default prefix plus an 
digest over the passed content. This ensures that of the some content is passed 
several times the created ContentItems will use the same id.
+* methods that take a __prefix__ will also generate the URI by calculating a 
digest over the passed content. However the passed prefix will be used instead 
of the default one.
+* If an __UriRef id__ is passed, than that URI is used as id for the content 
item.
+
+The ContentItemFactory allows also to parse pre-existing metadata. All RDF 
triples in the passed MGraph are guaranteed to be added to the metadata of the 
created ContentItems. Note that implementations are free to directly use the 
passed MGraph instance for the metadata or to create an new MGraph instance and 
copy all triples of the passed instance.
+
+The following methods of the ContentItemFactory can be used to create Blobs
+
+    :::java
+    + createBlob(ContentSource source) : Blob
+    + createBlob(ContentReference reference) : Blob
+    + createContentSink(String mediaType) : ContentSink
+
+The Blob interface is used by the Stanbol Enhancer to represent content. Blobs 
are added to ContentItems as [content parts](contentitem.html#content_parts). 
In addition to the ContentSource and ContentReference interfaces that are also 
supported for the creation of ContentItems for the creation of Blobs also a 
ContentSink can be used. A ContentSink allows to obtain an OutputStream to an 
initially empty Blob that can later be used to stream the content. This is 
intended to be used by EnhancementEngine that need to convert content from one 
format to an other because it allows to avoid caching the converted content 
in-memory.
+
+### ContentItem implementations
+
+By default the Stanbol Enhancer provides two 
ContentItemFactory/ContentItem/Blob implementations. Users can control the 
implementation used by the Stanbol Enhancer by configuring the 
"service.ranking" property of the different ContentItemFactory implementations 
(e.g. via the configuration tab of the Apache Felix Web Console). The 
implementation with the highest "service.ranking" will be used by the Stanbol 
Enhancer to create ContentItems and Blobs. 
+
+#### In-memory ContentItem
+
+This implementation manages contents - Blobs - as byte arrays that are kept 
in-memory. While this ensures fast access to the passed content it also might 
cause problems if the Stanbol Enhancer is used to process big media files. 
Nonetheless this is currently used as default, because for typical usage 
scenarios content processed by the Stanbol Enhancer easily fits into memory.
+
+The ContentItemFactory of this implementation registers itself with a 
"service.ranking" of 100 and is therefore used as default by the Stanbol 
Enhancer.
+
+#### File-based ContentItem
+
+This implementation differs from the in-memory one that it stores content - 
Blobs - in temporary files on the hard disc. All other information such as the 
metadata or non Blob content parts are still kept in-memory. This 
implementation is intended to be used by users that use the Stanbol Enhancer to 
process big media files such as TIFF images, MP3 files, rich text files 
including big graphics or even video files. 
+
+The ContentItemFactory of the the file based implementation is registered with 
a "service.ranking" of 50. To use it as default users need to ensure that the 
ranking of this implementation higher than the one of the in-memory 
implementation.
+
+### ContentSource
+
+This interface describes the source of a content. It defines the following API
+
+    :::java
+    /** the content as stream */
+    + getStream() : InputStream
+    /** the content as byte array */
+    + getData() : byte[]
+    /** optionally the media type of the content */
+    + getMediaType() : String
+    /** optionally the file name of the content */
+    + getFileName() : String
+    /** optionally additional headers */
+    + getHeaders() : Map<String,List<String>> 
+
+The ContentSource interface defines methods for obtaining the wrapped content 
as InputStream and byte[]. This is mainly to avoid unnecessary copying of 
content. Implementors of ContentItems SHOULD prefer to call 
+
+* ContentSource#getData() if the ContentItem/Blob implementation will store 
the content as byte[] in-memory
+* ContentSource#getStream() if the content of a ContentSource is streamed to a 
file, database, CMS or any other target outside the JVM.
+
+The following implementations of this interface are provided by the Stanbol 
Enhnacer servicesapi module
+
+* StreamSource: A ContentSource wrapping an InputStream. Multiple calls to 
#getStream() are not be supported and will cause IllegalStateExceptions. Calls 
to #getData() will load the contents of the stream to an in memory.
+* ByteArraySource: A ContentSource implementation that uses a byte array to 
store represent the content. All constructors take the byte array representing 
the content as parameter. Calls to #getData() MUST NOT copy the byte array to 
avoid duplications.
+* StringSource: A ContentSource implementation that directly allows to parse a 
String instance. The constructors convert the passed String to an byte array by 
using the passed Charset. UTF-8 is used as default. This implementation is 
based on the ByteArraySource.
+ 
+
+### ContentReference
+
+This interface allows to describe content that is not yet locally available. 
The Stanbol Enhancer will dereference the content when automatically when 
needed.
+
+    :::java
+    /** the Reference to the content */
+    + gerReference() : String
+    /** dereferences the content */
+    + dereference() : ContentSource
+
+When referenced content is dereferenced by the Stanbol Enhancer depends on 
many factors. Earliest it may be dereferenced by the 
createBlob/createContentItem methods of a ContentItemFactory implementation. At 
latest it will be dereferenced when the referenced content is first used by the 
Stanbol Enhancer (e.g. on a call to ContentItem#getStream() or 
ContentItem#getMimeType()).
+
+By default an ContentReference implementation for Java URLs is provided by the 
Stanbol Enhancer servicesapi module. This implementation replaces the 
WebContentItem that was used for obtaining content from URL until Stanbol 
version 0.9.0-incubating. 
+
+
+### ContentSink
+
+EnhancementEngines that do convert passed content (e.g. the 
[TikaEngine](engines/tikaengine.html)) are often capable to so stream 
processing on content - meaning that the do not need to load the whole content 
in memory while analyzing it. To support this operation mode also within the 
StanbolEnhancer the ContentSink interface place an important role as it allows 
to create an - initially empty - Blob and than "stream" the content to it while 
processing the content.
+
+The following method of the ContentItemFactory can be used to create a 
ContentSink
+
+    :::java
+    /** Creates a new ContentSink */
+    + createContentSink(String mediaType) : ContentSink;
+
+The ContentSink interface provides the OutputStream as well as the created Blob
+
+    :::java
+    /** Getter for the OutputStream */
+    + getOutputStream() : OutputStream;
+    /** Getter for the Blob */
+    + getBlob() : Blob;
+
+__Note:__ User MUST NOT parse the Blob of a ContentSink to any other 
components until all the data are written to the OutputStream, because this may 
cause that other components to read partial data when calling Blob#getStream(). 
This feature is intended to reduce the memory footprint and not to support 
concurrent writing and reading of data as supported by pipes.
+
+#### Intended Usage:
+
+This example shows a typical usage of a ContentSink within the 
processEnhancement(..) method of an EnhancementEngine that needs to transform 
some content.
+
+    :::java
+    ContentItem ci; //the content item to process
+    ContentSink plainTextSink = 
contentItemFactory.createContentSink("text/plain");
+    Writer writer = new 
OutputStreamWriter(plainTextSink.getOutputStream,"UTF-8");
+    try {
+    // parse the writer to the framework that extracts the text
+    } finally {
+        IOUtils.closeQuietly(writer);
+    }
+    //now add the Blob to the ContentItem
+    UriRef textBlobUri; //create an UriRef for the Blob
+    ci.addPart(textBlobUri, plainTextSink.getBlob());
+    plainTextSink = null;
+


Reply via email to