Author: rwesten
Date: Wed Apr 11 08:32:35 2012
New Revision: 1324631
URL: http://svn.apache.org/viewvc?rev=1324631&view=rev
Log:
added documentation of the ContentItemFactory, ContentSource, ContentReference
and ContentSink interfaces (STANBOL-573)
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitemfactory.mdtext
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitem.mdtext
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitem.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitem.mdtext?rev=1324631&r1=1324630&r2=1324631&view=diff
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitem.mdtext
(original)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitem.mdtext
Wed Apr 11 08:32:35 2012
@@ -2,7 +2,7 @@ Title: Content Item
<span style="float:right"> </span>
-The ContentItem is the object which represents the content to be enhanced by
Apache Stanbol. It is created based on the data provided by the enhancement
request and used throughout the enhancement process to store results.
Therefore, after the enhancement process has finished, the ContentItem
represents the result of the Apache Stanbol enhancement process.
+The ContentItem is the object which represents the content to be enhanced by
Apache Stanbol. It is created based on the data provided by the enhancement
request and used throughout the enhancement process to store results.
Therefore, after the enhancement process has finished, the ContentItem
represents the result of the Apache Stanbol enhancement process. ContentItem
instances are created by using the
[ContentItemFactory](contentitemfactory.html) service.
The following section describes the interface of the ContentItem in detail:
@@ -94,6 +94,18 @@ However, whenever components need to ens
While accessing content items within an [enhancement engine](engines) there is
an exception to this rule. If an engine declares that it only supports the
<code>SYNCHRONOUS</code> enhancement mode, then the [enhancement job
manager](enhancementjobmanager.html) needs to take care that an engine has
exclusive access to the _CotentItem_. In this case implementors of enhancement
engines need not to care about using read/write locks.
+### ContentItemFactory
+
+Since version 0.10.0 ContentItems and Blobs are created by using the
[ContentItemFactory](contentitemfactory.html). ContentItemFactory
implementation register themselves as OSGI service. By default the
implementation with the highest "service.ranking" is used by the
StanbolEnhancer to create instances. By default two implementations are
available. The in-memory and a file-based one where the in-memory
implementation is used as default.
+
+Most users will not need to change the default ContentItem implementation.
However if the Enhancer is used to extract metadata from gib media files such
as EXIF metadata from big images, ID3 from MP3 files ... than changing the
default from the InMemoryContentItemFactory to the FileContentItemFactory might
considerable reduce the memory footprint.
+
+With the introduction of the ContentItemFactory also all ContentItem
implementation specific constructors to parse content where deprecated and
replaced by the following three interfaces:
+
+1. __ContentSource__ allows to parse Content that is available as stream, byte
array or string.
+2. __ContentReference__ allows to parse a Reference (e.g. a URL) to a
ContentItem. The derefernce() method of this interface is used by the
ContentItemFactory to convert a ContentReference to a ContentSource.
+3. __ContentSink__ allows to obtain an OutputStream to an initially empty Blob
that can later be used to stream the content. This is intended to be used by
EnhancementEngine that need to convert content from one format to an other
because it allows to avoid caching the converted content in-memory.
+
### Multipart MIME serialization
<span style="float:right"> </span>
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitemfactory.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitemfactory.mdtext?rev=1324631&view=auto
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitemfactory.mdtext
(added)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/contentitemfactory.mdtext
Wed Apr 11 08:32:35 2012
@@ -0,0 +1,133 @@
+Title: Content Item Factory
+
+The ContentItemFactory is used by the Stanbol Enhancer to create
[ContentItem](contentitem.html) and Blob instances. ContentItemFactory
implementation typically register themselves as OSGI service. The Stanbol
Enhancer will use the factory implementation with the highest "service.ranking"
to create ContentItems and Blobs for requests on the RESTful API. When using
the Java API any ContentItem implementation can be used.
+
+### ContentItemFactory interface
+
+The interface of the ContentItemFactory defines the following methods to
create ContentItems
+
+ :::java
+ + createContentItem(ContentSource source) : ContentItem
+ + createContentItem(String prefix, ContentSource source) : ContentItem
+ + createContentItem(UriRef id, ContentSource source) : ContentItem
+ + createContentItem(String prefix, ContentSource source, MGraph metadata)
: ContentItem
+ + createContentItem(UriRef id, ContentSource source, MGraph metadata) :
ContentItem
+ + createContentItem(ContentReference reference) : ContentItem
+ + createContentItem(ContentReference reference, MGraph metadata) :
ContentItem
+
+The content for created ContentItem can be passed by using either a
ContentSource or a ContentReference. The Stanbol Enhancer Servicesapi module
provides implementations for creating ContentSources for Java streams, byte
arrays and string object as well as ContentReferences for URLs. For details see
the sections below.
+
+The URI of the created ContentItem is determined as follows:
+
+* if no URI is passed, than it is calculated by using a default prefix plus an
digest over the passed content. This ensures that of the some content is passed
several times the created ContentItems will use the same id.
+* methods that take a __prefix__ will also generate the URI by calculating a
digest over the passed content. However the passed prefix will be used instead
of the default one.
+* If an __UriRef id__ is passed, than that URI is used as id for the content
item.
+
+The ContentItemFactory allows also to parse pre-existing metadata. All RDF
triples in the passed MGraph are guaranteed to be added to the metadata of the
created ContentItems. Note that implementations are free to directly use the
passed MGraph instance for the metadata or to create an new MGraph instance and
copy all triples of the passed instance.
+
+The following methods of the ContentItemFactory can be used to create Blobs
+
+ :::java
+ + createBlob(ContentSource source) : Blob
+ + createBlob(ContentReference reference) : Blob
+ + createContentSink(String mediaType) : ContentSink
+
+The Blob interface is used by the Stanbol Enhancer to represent content. Blobs
are added to ContentItems as [content parts](contentitem.html#content_parts).
In addition to the ContentSource and ContentReference interfaces that are also
supported for the creation of ContentItems for the creation of Blobs also a
ContentSink can be used. A ContentSink allows to obtain an OutputStream to an
initially empty Blob that can later be used to stream the content. This is
intended to be used by EnhancementEngine that need to convert content from one
format to an other because it allows to avoid caching the converted content
in-memory.
+
+### ContentItem implementations
+
+By default the Stanbol Enhancer provides two
ContentItemFactory/ContentItem/Blob implementations. Users can control the
implementation used by the Stanbol Enhancer by configuring the
"service.ranking" property of the different ContentItemFactory implementations
(e.g. via the configuration tab of the Apache Felix Web Console). The
implementation with the highest "service.ranking" will be used by the Stanbol
Enhancer to create ContentItems and Blobs.
+
+#### In-memory ContentItem
+
+This implementation manages contents - Blobs - as byte arrays that are kept
in-memory. While this ensures fast access to the passed content it also might
cause problems if the Stanbol Enhancer is used to process big media files.
Nonetheless this is currently used as default, because for typical usage
scenarios content processed by the Stanbol Enhancer easily fits into memory.
+
+The ContentItemFactory of this implementation registers itself with a
"service.ranking" of 100 and is therefore used as default by the Stanbol
Enhancer.
+
+#### File-based ContentItem
+
+This implementation differs from the in-memory one that it stores content -
Blobs - in temporary files on the hard disc. All other information such as the
metadata or non Blob content parts are still kept in-memory. This
implementation is intended to be used by users that use the Stanbol Enhancer to
process big media files such as TIFF images, MP3 files, rich text files
including big graphics or even video files.
+
+The ContentItemFactory of the the file based implementation is registered with
a "service.ranking" of 50. To use it as default users need to ensure that the
ranking of this implementation higher than the one of the in-memory
implementation.
+
+### ContentSource
+
+This interface describes the source of a content. It defines the following API
+
+ :::java
+ /** the content as stream */
+ + getStream() : InputStream
+ /** the content as byte array */
+ + getData() : byte[]
+ /** optionally the media type of the content */
+ + getMediaType() : String
+ /** optionally the file name of the content */
+ + getFileName() : String
+ /** optionally additional headers */
+ + getHeaders() : Map<String,List<String>>
+
+The ContentSource interface defines methods for obtaining the wrapped content
as InputStream and byte[]. This is mainly to avoid unnecessary copying of
content. Implementors of ContentItems SHOULD prefer to call
+
+* ContentSource#getData() if the ContentItem/Blob implementation will store
the content as byte[] in-memory
+* ContentSource#getStream() if the content of a ContentSource is streamed to a
file, database, CMS or any other target outside the JVM.
+
+The following implementations of this interface are provided by the Stanbol
Enhnacer servicesapi module
+
+* StreamSource: A ContentSource wrapping an InputStream. Multiple calls to
#getStream() are not be supported and will cause IllegalStateExceptions. Calls
to #getData() will load the contents of the stream to an in memory.
+* ByteArraySource: A ContentSource implementation that uses a byte array to
store represent the content. All constructors take the byte array representing
the content as parameter. Calls to #getData() MUST NOT copy the byte array to
avoid duplications.
+* StringSource: A ContentSource implementation that directly allows to parse a
String instance. The constructors convert the passed String to an byte array by
using the passed Charset. UTF-8 is used as default. This implementation is
based on the ByteArraySource.
+
+
+### ContentReference
+
+This interface allows to describe content that is not yet locally available.
The Stanbol Enhancer will dereference the content when automatically when
needed.
+
+ :::java
+ /** the Reference to the content */
+ + gerReference() : String
+ /** dereferences the content */
+ + dereference() : ContentSource
+
+When referenced content is dereferenced by the Stanbol Enhancer depends on
many factors. Earliest it may be dereferenced by the
createBlob/createContentItem methods of a ContentItemFactory implementation. At
latest it will be dereferenced when the referenced content is first used by the
Stanbol Enhancer (e.g. on a call to ContentItem#getStream() or
ContentItem#getMimeType()).
+
+By default an ContentReference implementation for Java URLs is provided by the
Stanbol Enhancer servicesapi module. This implementation replaces the
WebContentItem that was used for obtaining content from URL until Stanbol
version 0.9.0-incubating.
+
+
+### ContentSink
+
+EnhancementEngines that do convert passed content (e.g. the
[TikaEngine](engines/tikaengine.html)) are often capable to so stream
processing on content - meaning that the do not need to load the whole content
in memory while analyzing it. To support this operation mode also within the
StanbolEnhancer the ContentSink interface place an important role as it allows
to create an - initially empty - Blob and than "stream" the content to it while
processing the content.
+
+The following method of the ContentItemFactory can be used to create a
ContentSink
+
+ :::java
+ /** Creates a new ContentSink */
+ + createContentSink(String mediaType) : ContentSink;
+
+The ContentSink interface provides the OutputStream as well as the created Blob
+
+ :::java
+ /** Getter for the OutputStream */
+ + getOutputStream() : OutputStream;
+ /** Getter for the Blob */
+ + getBlob() : Blob;
+
+__Note:__ User MUST NOT parse the Blob of a ContentSink to any other
components until all the data are written to the OutputStream, because this may
cause that other components to read partial data when calling Blob#getStream().
This feature is intended to reduce the memory footprint and not to support
concurrent writing and reading of data as supported by pipes.
+
+#### Intended Usage:
+
+This example shows a typical usage of a ContentSink within the
processEnhancement(..) method of an EnhancementEngine that needs to transform
some content.
+
+ :::java
+ ContentItem ci; //the content item to process
+ ContentSink plainTextSink =
contentItemFactory.createContentSink("text/plain");
+ Writer writer = new
OutputStreamWriter(plainTextSink.getOutputStream,"UTF-8");
+ try {
+ // parse the writer to the framework that extracts the text
+ } finally {
+ IOUtils.closeQuietly(writer);
+ }
+ //now add the Blob to the ContentItem
+ UriRef textBlobUri; //create an UriRef for the Blob
+ ci.addPart(textBlobUri, plainTextSink.getBlob());
+ plainTextSink = null;
+