Author: rwesten
Date: Mon Jan 9 10:11:21 2012
New Revision: 1229080
URL: http://svn.apache.org/viewvc?rev=1229080&view=rev
Log:
Specification for the Java API related to STANBOL-414 and STANBOL-46
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/STANBOL-414-specification.mdtext
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/STANBOL-414-specification.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/STANBOL-414-specification.mdtext?rev=1229080&view=auto
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/STANBOL-414-specification.mdtext
(added)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/STANBOL-414-specification.mdtext
Mon Jan 9 10:11:21 2012
@@ -0,0 +1,258 @@
+This Documents provides the specification of the Java API for the extensions
to the RESTful services to the Stanbol Enhancer as mentioned by
[STANBOL-414](https://issues.apache.org/jira/browse/STANBOL-414).
+
+Enhancement Chains
+----
+
+A Chain represents a configuration that defines what engines and in what order
are used to process ContentItems. Chains are registered as OSGI services and
identified by the "stanbol.enhancer.chain.name" property.
+
+### Chain
+
+The Chain provides it's configuration in form of an RDF graph.
+
+ /** Getter for the execution plan */
+ + getExecutionPlan() : Graph
+ /** Getter for the name of the Engines referenced by this Chain */
+ + getEngines() : Set<String>
+
+The getEngines method may return the list of engines name in any order. It is
mainly intended for situations where only the engines used by a chain need to
be known (e.g. visualized) but the actual chain needs not to be executed.
+
+The returned Graph holding the execution plan MUST BE read-only AND final.
Meaning that a change in the configuration of a Chain MUST NOT change the graph
returned by calls to the getExecutionPlan method.
+
+Because the configuration of a Chain might change at any time JobManager MUST
retrieve the Graph holding the execution plan before they start the actual
processing of the ContentItem. This plan MUST BE used for the whole enhancement
process. Later changes to the configuration MUST NOT be reflected in the
enhancement of a ContentItem.
+
+### ChainManager
+
+The Chainmanager is a service that tracks all Chains registered as a service
in the OSGI Environment of the Stanbol Enhancer. It provides an simple API to
retrieve a chain based on its name
+
+ /** Getter for the Chain for a given name */
+ + getChain(Stirng name) : Chain
+ /** Getter for all Chains for a name sorted by service ranking */
+ + getChains(String name) : List<Chain>
+ /** Checks if there is a chain for the given name */
+ + isChain(String name) : boolean
+ /** Getter for the default chain */
+ + getDefault() : Chain
+
+The default Chain is used if no chain is specified in an request (e.g. when
calling the /engines endpoint). The default Chain is the chain with the highest
service ranking.
+
+ALTERNATIVE: The default Chain is the Chain with
"stanbol.enhancer.chain.name=default" and the highest service ranking. If no
Chain with the name "default" exists the Chain with the highest service ranking
is assumed to be the default chain.
+
+The default configuration of Stanbol MUST provide a Chain instance with the
name "stanbol.enhancer.chain.name=default" an service ranking of
Integer.MIN_VALUE that includes all currently active
+enhancement engines.
+
+### ExecutionPlan
+
+The execution plan need to be created by the chain based on it's current
configuration. This plan is read only and MUST NOT be changed if the
configuration of the Chain changes. This means that the Chain MUST create a new
Graph instance if the execution plan changes as a result of a change in the
configuration. It MUST NOT change any execution plan parsed to other components
by the getExecutionPlan() method.
+
+The RDFS schema used for the execution plan is defined as follows.
+
+ * Namespace: ep : http://stanbol.apache.org/ontology/enhancer/executionplan#
+ * __ep:ExecutionNode__ : Class used for all Nodes representing the execution
of an Enhancement Engine.
+ * __ep:engine__ (domain: ep:ExecutionNode; range: xsd:string): The property
used to link to the Enhancement Engine by the name of the engine.
+ * __ep:dependsOn__ (domain: ep:ExecutionNode; range: ep:ExecutionNode)
Defines that the execution of this node depends on the completion of the
referenced one.
+ * __ep:optional__ (domain: ep:ExecutionNode; range: xsd:boolean) Can be used
to specify that the execution of this EnhancementEngine is optional. If this
property is set to TRUE an engine will be marked as executed even if it
execution was not possible (e.g. because an engine with this name was not
active) or the execution failed (e.g. because of the Exception).
+
+#### Example:
+
+This example shows an ExecutionPlan with three nodes for the "langId", "ner",
"dbpediaLinking" "geonamesLinking" and "zemanta" engine. Note that this names
refer to actual EnhancementEngine Services registered with the current OSGI
Environment.
+
+This example assumes that
+
+* "langId" is the singleton instance of LangIdEnhancementEngine
+* "ner" is the default instance of the NamedEntityExtractionEnhancementEngine
engine
+* "dbpediaLinking" is an instance of the NamedEntityTaggingEngine configured
to use the dbpedia.org ReferencedSite of the Entityhub
+* "geonamesLinking" is an instance of the NamedEntityTaggingEngine configured
to use the geonames.org ReferencedSite
+* "zemanta" is the singleton instance of the ZemantaEnhancementEngine
+
+The RDF graph of such a chain would look:
+
+ urn:node1
+ rdf:type stanbol:ExecutionNode
+ stanbol:engine langId
+
+ urn:node2
+ rdf:type ep:ExecutionNode
+ ep:dependsOn urn:node1
+ ep:engine ner
+
+ urn:node3
+ rdf:type ep:ExecutionNode
+ ep:dependsOn urn:node1
+ ep:engine dbpediaLinking
+
+ urn:node4
+ rdf:type ep:ExecutionNode
+ ep:dependsOn urn:node1
+ ep:engine geonamesLinking
+
+ urn:node5
+ rdf:type ep:ExecutionNode
+ ep:engine zemanta
+ ep:optional "true"^^xsd:boolean
+
+This plan defines that the "langId" and the "zemanta" engine do not depend on
anything and can therefore be executed from the start (even in parallel if the
JobManager execution this chains supports this). The execution of the "ner"
engine depends on the extraction of the language and the execution of the
entity linking to dbpedia and geonames depends on the "ner" engine. Note that
the execution of the "dbpediaLinking" and "geonamesLinking" could be also
processed in parallel.
+
+
+#### ExecutionPlan Utility:
+
+The Enhancer MUST also define an Utility that provides the following utility
+
+ /** Getter for the list of executable ep:ExecutionNodes */
+ + getExecuteable(Graph executionPlan, Set<NonLiteral> completed) :
Collection<NonLiteral>
+
+This method takes an execution plan and the list of already executed nodes as
input and return the list of ExecutionNodes that can be executed next. The
existing utility methods within the EnhancementEngineHelper can be used to
retrieve further information from the ex:ExecutionNode's returned by this
method.
+
+Typically code using this utility will look like this (pseudo code)
+
+ Graph executionPlan = chain.getExecuctionPlan();
+ Map<String, EnhancementEngine> engines =
enhancementEngineManager.getActiveEngines(chain);
+ Collection<NonLiteral> executed = new HashSet<NonLiteral>();
+ Collection<NonLiteral> next;
+ while(!(next = ExecutionPlanUtils.getExecuteable(plan,
executed)).isEmpty()){
+ for(NonLiteral node : next){
+ EnhancementEngine engine = engines.get(
+ EnhancementEngineHelper.getString(executionPlan,node,
EX_ENGINE));
+ Boolean optional = EnhancementEngineHelper.get(
+ executionPlan,node,EX_OPTIONAL,Boolean.class,literalFactory);
+ /* Execute the Engine */
+ completed.add(node);
+ }
+ }
+
+Chain implementations
+----
+
+### WeightedChain
+
+This Chain implementation takes a List of Engines names as input and uses the
"org.apache.stanbol.enhancer.engine.order " metadata provided by such engines
to calculate the ExecutionGraph.
+
+Similar the current WeightedJobManager implementation Engines would be
dependent to each other based on decreasing order values. Engines with the same
order value would could be executed in parallel.
+
+This implementation is targeted for easy configuration - just a list of the
engine names contained within a chain - but has limited possibilities to
control the execution order within an chain. However it is expected that it
provides enough flexibility for most of the usage scenarios
+
+### GraphChain
+
+This Chain implementation is based on a ExecutionGraph parsed os configuration.
+
+TODO: define how users con provide such serialized graphs.
+
+NOTE: We could also provide the possibility that the execution graph is parsed
as an additional parameter to a specific request to the enhancer.
+
+### DefaultChain
+
+Implementation that keeps track of all currently active EnhancementEngine and
registers itself as a Chain service with the
"stanbol.enhancer.chain.name=default" an service ranking of Integer.MIN_VALUE.
+
+This will provide a Chain returned by ChainManager.getDefault() that will
result in the same enhancement process as Stanbol used before the addition of
Chains.
+
+Note that users can change the default chain by either stopping this component
of adding an other Chain with "stanbol.enhancer.chain.name=default" and an
higher service ranking.
+
+### SingleEngineChain
+
+This is basically an Adapter that allows to execute a single EnhancementEngine
within a Chain. This types of Chains will not be registered as OSGI service.
Instances will be created on request for single EnhancementEngines and directly
parsed to the EnhancementJobManager implementation.
+
+Note that pre-existing metadata might still be parsed within a multipart
content item as defined by STANBOL-414.
+
+Enhancement Engines
+----
+
+This sections gives an overview about changes to the Java API for
EnhancementEngines and also defines the new EnhancementEngineManager service.
+
+### EnhancementEngine
+
+With the extension to the Stanbol Enhancer engines will provide additional
metadata.
+
+* __Name:__ Defined by the value of the property
"stanbol.enhancer.engine.name" it will be used to access Engines on the Stanbol
RESTful interface
+* __Service Ranking:__ The service ranking property defined by OSGI will be
used to decide which engine to use in case several active EnhancementEngines do
use the same name. In such cases only the Engine with the highest ranking will
be used to enhance ContentItems.
+* __Configuration:__ Each EnhacementEngien MAY provide an RDF graph with its
configuration. This graph will be returned on GET request on the URL of the
EnhancementEngine. If no configuration is known for the engine this MUST at
least return a single triple with the name for the engine.
+
+_TODO:_ To correctly construct this graph the Engine needs to know this URL.
This could e.g. be provided by some OSGI environment parameter set by the
JerseyApplication. As an alternative we could also parse this URI as an
parameter to the getEngineConfig method.
+
+This changes will result in the following adapted interface for Enhancement
Engines.
+
+ /** Getter for the value of the "stanbol.enhancer.engine.name" property */
+ + getName() : String
+ /** Getter for the service ranking of this engine **/
+ + getRanking() : int
+ /** The configuration of the Engine as RDF Graph or NULL. **/
+ + getEngineConfig() : Graph
+ + canEnhance(ContentItem ci) : int
+ + computeEnhacements(ContentItem ci)
+
+
+### EnhancementEngineManager
+
+New Utility that keeps track of all active EnhancementEngines and supports
lookup for Enhancement Engines based on the "stanbol.enhancer.engine.name"
property.
+
+ + getEngine(String name) : EnhancementEngine
+ + getEngines(String name) : List<EnhancementEngine>
+ + isEngine(String name) : boolean
+ + getActiveEngines(Chain chain) : Map<String,EnhancementEngine>
+
+Enhancement Process
+----
+
+This section describes canines to the Enhancement Process by the addition of
the Chains. It also provides a specification of how EnhancementEngines and
EnhancementJobManager implementations need to take care to allow asynchronous
and in parallel execution of multiple EnhancementEngines for the same
ContentItem.
+
+Note that Work on asynchronous enhancement process is covered by
[STANBOL-46](https://issues.apache.org/jira/browse/STANBOL-46)
+
+### EnhancementJobManager
+
+This interface of the EnhancementJobManager will change due to the addition of
Chains and in future only contain a single Method allowing to enhance a
ContentItem by using the execution plan provided by the parsed Chain.
+
+ + enhanceContent(ContentItem ci, Chain chain)
+
+
+Information about the Enhancement Engines are now available by
+
+* _Chain#getEngines():_ This returns the names of all Engines referenced by a
Chain
+* _EnhancemetnEngineManager#getActuveEngines(Chain chain):_ This retunes the
currently active Engines based on the configuration of the chain.
+
+By combining the results of both methods it is easy to retrieve the List of
Engines used by a Cahin and also to check if a Chain can be executed based on
the currently active EnhancementEngines.
+
+The getter for the active EnhancementEngines now also takes the Chain as P
+
+The getter for the active enhancement engines is intended to be used to check
if all Chains referenced by a Chain (see Chain#getEngines() method) are
currently active.
+
+### ContentItem
+
+Also the Interface of the ContentItem needs to undergo a slight change to add
the ability for read/write locks to the MGraph holding the metadata. For
details how this see the following sections about Asynchronous Execution.
+
+Because of that the type of the return value of the getMetadata method needs
to be changed from MGraph to LocakableMGraph
+
+ + getMetadata() : LockableMGraph
+
+### AsynchronousExecution
+
+The "EnhancementEnigne#canEnhance(ContentItem ci) : int" method can indicate
if an engine can or can not enhance an ContentItem. In addition this method can
also indicate to the EnhancementJobManager if an Engine supports the
asynchronous Execution. This section specifies how the EnhancementJobManager
needs to use this information to support asynchronous and/or parallel execution
of multiple EnhancementEngines.
+
+As soon as EnhancementEngines are executed asynchronously this might also
result in situations where multiple Engines need to access the ContentItem
concurrently. Therefore the access to the ContentItem - especially to the
metadata - MUST BE synchronized. Implementors of EnhancementEngines MUST
especially be careful if using Iterators as returned by the Clerezzas
TripleCollection, MGraph and also the GraphNode utility. Because such Iterators
will throw ConcurrentModificationExceptions if the underlaying graph is
modified during iteration.
+
+Because of that Engines that support EnhancementEngine#ENHANCE_ASYNC need to
use the
[ReadWriteLock](http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/locks/ReadWriteLock.html)
provided by the LockableMGraph returned by ContentItem#getMetadata(). The
following code snippets show how to use read and write locks with the metadata
graph.
+
+ LocakableMGraph metadata = ci.getMetadata();
+ Lock readLock = metadata.getLock().readLock();
+ readLock.lock();
+ try {
+ Iterator<Triple> it = metadata.filter(â¦);
+ while(it.hasNext()){
+ /** process the triples */
+ }
+ } finally {
+ readlock.unlock();
+ }
+
+ Lock writeLock = metadata.getLock().writeLock();
+ writeLock.lock();
+ try {
+ /** write new Enhancements to the Graph */
+ } finally {
+ writelock.unlock();
+ }
+
+__IMPORTANT:__ Do not try to get a write lock within a read lock because this
may be the cause of deadlocks. Thats because read locks can be obtained
simultaneously by multiple threads while write locks are exclusive. So if two
thread with a read lock try to also obtain a write lock they will block each
other.
+
+EnhancementEngines that do NOT support EnhancementEngine#ENHANCE_ASYNC -
meaning that the canEnhance method only returns
EnhancementEngine#CANNOT_ENHANCE or EnhancementEngine#ENHANCE_SYNCHRONOUS - do
not need to obtain read and write locks. The EnhancementJobManager
implementation MUST ensure that they to have exclusive access to the
Enhancement Graph. This can be either done by obtaining a write lock before
calling such enhancement engines or by ensuring the no other engines are called
in parallel.
+
+In cases where the EnhancementJobManager can execute multiple engines in
parallel it is good practice to first start the execution of Engines that do
support EnhancementEngine#ENHANCE_ASYNC. This will allow such engines to obtain
a read lock to read the data necessary for there calculations before the
EnhancementJobManager needs to obtain an exclusive write lock for calling
EnhancementEngines that do only support EnhancementEngine#ENHANCE_SYNCHRONOUS.
+
+