[jira] [Created] (STANBOL-577) Add Interfaces for parsing Content

Rupert Westenthaler (Created) (JIRA) Thu, 05 Apr 2012 00:33:51 -0700

Add Interfaces for parsing Content
----------------------------------

                 Key: STANBOL-577
                 URL: https://issues.apache.org/jira/browse/STANBOL-577
             Project: Stanbol
          Issue Type: Sub-task
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler



Currently different types of ContentItem define there own constructors that do 
fit there specific implementation. e.g. the InMemoryBlob defines constructors 
that allow to parse the content as ByteArray. This makes completely sense for 
this implementation, because directly allows to parse the data if they are 
already loaded in memory. The WebContentItem as an other example can not 
support a Constructor taking a byte array, because at the time of construction 
only the URL of - reference to - the content is available. Also for a File 
based ContentItem implementation a constructor with an byte array would not be 
preferable as the whole point of such an implementation would be to avoid to 
load the whole content in memory.

However with the introduction of a factory pattern to construct ContentItems 
the interfaces used to parse content MUST be normalized - because they are part 
of the API of the ContentItemFactory interface. To solve this the following two 
interfaces are added to the Stanbol Enhancer API


First the __ContentSource__ interface intended to be used for already 
dereferenced content

    ** the content as stream */
    + getStream() : InputStream
    /** the content as byte array */
    + getData() : byte[]
    /** optionally the media type of the content */
    + getMediaType() : String
    /** optionally the file name of the content */
    + getFileName() : String
    /** optionally additional headers */
    + getHeaders() : Map<String,List<String>>
        
With the following default implementations:

* StreamSource: A ContentSource wrapping an InputStream. Multiple calls to 
#getStream() will not be supported. Calls to #getData() will load the contents 
provided by the stream into memory.
* ByteArraySource: A ContentSource implementation that internally uses a byte 
array. To be used in cases where users need to parse content to the Stanbol 
Enhancer that is already loaded in-memory. Calls to #getData() MUST NOT copy 
the internal byte array. 
* StringSource: A ContentSource implementation that directly allows to parse a 
String instance.

Note that ContentItem/Blob implementations that

* store the content in-memory should prefer to call ContentSource#getData() to 
retrieve the content from the ContentSource
* stream the content to a file/database/CMS need to use 
ContentSource#getStream() to avoid loading the whole content in-memory!

Second the __ContentReference__ interface intended to be used to create 
ContentItems/Blons for content where only a reference is available.

    /** the Reference to the content */
    + gerReference() : String
    /** dereferences the content */
    + dereference() : ContentSource
    
With the following default implementation:

* UrlReference: Allows to use any Java URL to reference a Content. This 
basically is a replacement for the current WebContentItem implementation.

Both interfaces and implementations will be part of the Stanbol Enhancer 
Services API module.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (STANBOL-577) Add Interfaces for parsing Content

Reply via email to