STANBOL-414-specification.mdtext

rwesten Mon, 09 Jan 2012 02:11:59 -0800

Author: rwesten
Date: Mon Jan  9 10:11:21 2012
New Revision: 1229080

URL: http://svn.apache.org/viewvc?rev=1229080&view=rev
Log:
Specification for the Java API related to STANBOL-414 and STANBOL-46


Added:
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/STANBOL-414-specification.mdtext

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/STANBOL-414-specification.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/STANBOL-414-specification.mdtext?rev=1229080&view=auto
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/STANBOL-414-specification.mdtext
 (added)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/STANBOL-414-specification.mdtext
 Mon Jan  9 10:11:21 2012
@@ -0,0 +1,258 @@
+This Documents provides the specification of the Java API for the extensions 
to the RESTful services to the Stanbol Enhancer as mentioned by 
[STANBOL-414](https://issues.apache.org/jira/browse/STANBOL-414).
+
+Enhancement Chains
+----
+
+A Chain represents a configuration that defines what engines and in what order 
are used to process ContentItems. Chains are registered as OSGI services and 
identified by the "stanbol.enhancer.chain.name" property.
+
+### Chain
+
+The Chain provides it's configuration in form of an RDF graph.
+
+    /** Getter for the execution plan */
+    + getExecutionPlan() : Graph
+    /** Getter for the name of the Engines referenced by this Chain */
+    + getEngines() : Set<String>
+
+The getEngines method may return the list of engines name in any order. It is 
mainly intended for situations where only the engines used by a chain need to 
be known (e.g. visualized) but the actual chain needs not to be executed.
+
+The returned Graph holding the execution plan MUST BE read-only AND final. 
Meaning that a change in the configuration of a Chain MUST NOT change the graph 
returned by calls to the getExecutionPlan method.
+
+Because the configuration of a Chain might change at any time JobManager MUST 
retrieve the Graph holding the execution plan before they start the actual 
processing of the ContentItem. This plan MUST BE used for the whole enhancement 
process. Later changes to the configuration MUST NOT be reflected in the 
enhancement of a ContentItem.
+
+### ChainManager
+
+The Chainmanager is a service that tracks all Chains registered as a service 
in the OSGI Environment of the Stanbol Enhancer. It provides an simple API to 
retrieve a chain based on its name
+
+    /** Getter for the Chain for a given name */
+    + getChain(Stirng name) : Chain
+    /** Getter for all Chains for a name sorted by service ranking */
+    + getChains(String name) : List<Chain>
+    /** Checks if there is a chain for the given name */
+    + isChain(String name) : boolean
+    /** Getter for the default chain */
+    + getDefault() : Chain
+
+The default Chain is used if no chain is specified in an request (e.g. when 
calling the /engines endpoint). The default Chain is the chain with the highest 
service ranking.
+
+ALTERNATIVE: The default Chain is the Chain with 
"stanbol.enhancer.chain.name=default" and the highest service ranking. If no 
Chain with the name "default" exists the Chain with the highest service ranking 
is assumed to be the default chain.
+
+The default configuration of Stanbol MUST provide a Chain instance with the 
name "stanbol.enhancer.chain.name=default" an service ranking of 
Integer.MIN_VALUE that includes all currently active 
+enhancement engines.
+
+### ExecutionPlan
+
+The execution plan need to be created by the chain based on it's current 
configuration. This plan is read only and MUST NOT be changed if the 
configuration of the Chain changes. This means that the Chain MUST create a new 
Graph instance if the execution plan changes as a result of a change in the 
configuration. It MUST NOT change any execution plan parsed to other components 
by the getExecutionPlan() method.
+
+The RDFS schema used for the execution plan is defined as follows.
+
+ * Namespace: ep : http://stanbol.apache.org/ontology/enhancer/executionplan#
+ * __ep:ExecutionNode__ : Class used for all Nodes representing the execution 
of an Enhancement Engine.
+ * __ep:engine__ (domain: ep:ExecutionNode; range: xsd:string): The property 
used to link to the Enhancement Engine by the name of the engine.
+ * __ep:dependsOn__ (domain: ep:ExecutionNode; range: ep:ExecutionNode) 
Defines that the execution of this node depends on the completion of the 
referenced one.
+ * __ep:optional__ (domain: ep:ExecutionNode; range: xsd:boolean) Can be used 
to specify that the execution of this EnhancementEngine is optional. If this 
property is set to TRUE an engine will be marked as executed even if it 
execution was not possible (e.g. because an engine with this name was not 
active) or the execution failed (e.g. because of the Exception). 
+
+#### Example:
+
+This example shows an ExecutionPlan with three nodes for the "langId", "ner", 
"dbpediaLinking" "geonamesLinking" and "zemanta" engine. Note that this names 
refer to actual EnhancementEngine Services registered with the current OSGI 
Environment.
+
+This example assumes that
+
+* "langId" is the singleton instance of LangIdEnhancementEngine
+* "ner" is the default instance of the NamedEntityExtractionEnhancementEngine 
engine
+* "dbpediaLinking" is an instance of the NamedEntityTaggingEngine configured 
to use the dbpedia.org ReferencedSite of the Entityhub
+* "geonamesLinking" is an instance of the NamedEntityTaggingEngine configured 
to use the geonames.org ReferencedSite
+* "zemanta" is the singleton instance of the ZemantaEnhancementEngine
+
+The RDF graph of such a chain would look:
+
+    urn:node1
+        rdf:type stanbol:ExecutionNode
+        stanbol:engine langId
+
+    urn:node2
+        rdf:type ep:ExecutionNode
+        ep:dependsOn urn:node1
+        ep:engine ner
+
+    urn:node3
+        rdf:type ep:ExecutionNode
+        ep:dependsOn urn:node1
+        ep:engine dbpediaLinking
+
+    urn:node4
+        rdf:type ep:ExecutionNode
+        ep:dependsOn urn:node1
+        ep:engine geonamesLinking
+
+    urn:node5
+        rdf:type ep:ExecutionNode
+        ep:engine zemanta
+        ep:optional "true"^^xsd:boolean
+
+This plan defines that the "langId" and the "zemanta" engine do not depend on 
anything and can therefore be executed from the start (even in parallel if the 
JobManager execution this chains supports this). The execution of the "ner" 
engine depends on the extraction of the language and the execution of the 
entity linking to dbpedia and geonames depends on the "ner" engine. Note that 
the execution of the "dbpediaLinking" and "geonamesLinking" could be also 
processed in parallel.
+
+
+#### ExecutionPlan Utility:
+
+The Enhancer MUST also define an Utility that provides the following utility
+    
+    /** Getter for the list of executable ep:ExecutionNodes */
+    + getExecuteable(Graph executionPlan, Set<NonLiteral> completed) : 
Collection<NonLiteral>
+
+This method takes an execution plan and the list of already executed nodes as 
input and return the list of ExecutionNodes that can be executed next. The 
existing utility methods within the EnhancementEngineHelper can be used to 
retrieve further information from the ex:ExecutionNode's returned by this 
method.
+
+Typically code using this utility will look like this (pseudo code)
+
+    Graph executionPlan = chain.getExecuctionPlan();
+    Map<String, EnhancementEngine> engines = 
enhancementEngineManager.getActiveEngines(chain);
+    Collection<NonLiteral> executed = new HashSet<NonLiteral>();
+    Collection<NonLiteral> next;
+    while(!(next = ExecutionPlanUtils.getExecuteable(plan, 
executed)).isEmpty()){
+        for(NonLiteral node : next){
+            EnhancementEngine engine = engines.get(
+                EnhancementEngineHelper.getString(executionPlan,node, 
EX_ENGINE));
+            Boolean optional = EnhancementEngineHelper.get(
+                executionPlan,node,EX_OPTIONAL,Boolean.class,literalFactory);
+            /* Execute the Engine */
+            completed.add(node);
+        }
+    }
+
+Chain implementations
+----
+
+### WeightedChain
+
+This Chain implementation takes a List of Engines names as input and uses the 
"org.apache.stanbol.enhancer.engine.order " metadata provided by such engines 
to calculate the ExecutionGraph.
+
+Similar the current WeightedJobManager implementation Engines would be 
dependent to each other based on decreasing order values. Engines with the same 
order value would could be executed in parallel.
+
+This implementation is targeted for easy configuration - just a list of the 
engine names contained within a chain - but has limited possibilities to 
control the execution order within an chain. However it is expected that it 
provides enough flexibility for most of the usage scenarios
+
+### GraphChain
+
+This Chain implementation is based on a ExecutionGraph parsed os configuration.
+
+TODO: define how users con provide such serialized graphs.
+
+NOTE: We could also provide the possibility that the execution graph is parsed 
as an additional parameter to a specific request to the enhancer.
+
+### DefaultChain
+
+Implementation that keeps track of all currently active EnhancementEngine and 
registers itself as a Chain service with the 
"stanbol.enhancer.chain.name=default" an service ranking of Integer.MIN_VALUE.
+
+This will provide a Chain returned by ChainManager.getDefault() that will 
result in the same enhancement process as Stanbol used before the addition of 
Chains.
+
+Note that users can change the default chain by either stopping this component 
of adding an other Chain with "stanbol.enhancer.chain.name=default" and an 
higher service ranking. 
+
+### SingleEngineChain
+
+This is basically an Adapter that allows to execute a single EnhancementEngine 
within a Chain. This types of Chains will not be registered as OSGI service. 
Instances will be created on request for single EnhancementEngines and directly 
parsed to the EnhancementJobManager implementation.
+
+Note that pre-existing metadata might still be parsed within a multipart 
content item as defined by STANBOL-414.
+
+Enhancement Engines
+----
+
+This sections gives an overview about changes to the Java API for 
EnhancementEngines and also defines the new EnhancementEngineManager service.
+
+### EnhancementEngine
+
+With the extension to the Stanbol Enhancer engines will provide additional 
metadata.
+
+* __Name:__ Defined by the value of the property 
"stanbol.enhancer.engine.name" it will be used to access Engines on the Stanbol 
RESTful interface
+* __Service Ranking:__ The service ranking property defined by OSGI will be 
used to decide which engine to use in case several active EnhancementEngines do 
use the same name. In such cases only the Engine with the highest ranking will 
be used to enhance ContentItems.
+* __Configuration:__ Each EnhacementEngien MAY provide an RDF graph with its 
configuration. This graph will be returned on GET request on the URL of the 
EnhancementEngine. If no configuration is known for the engine this MUST at 
least return a single triple with the name for the engine.
+
+_TODO:_ To correctly construct this graph the Engine needs to know this URL. 
This could e.g. be provided by some OSGI environment parameter set by the 
JerseyApplication. As an alternative we could also parse this URI as an 
parameter to the getEngineConfig method.
+
+This changes will result in the following adapted interface for Enhancement 
Engines.
+
+    /** Getter for the value of the "stanbol.enhancer.engine.name" property */
+    + getName() : String
+    /** Getter for the service ranking of this engine **/
+    + getRanking() : int
+    /** The configuration of the Engine as RDF Graph or NULL. **/
+    + getEngineConfig() : Graph
+    + canEnhance(ContentItem ci) : int
+    + computeEnhacements(ContentItem ci)
+
+
+### EnhancementEngineManager
+
+New Utility that keeps track of all active EnhancementEngines and supports 
lookup for Enhancement Engines based on the "stanbol.enhancer.engine.name" 
property.
+
+    + getEngine(String name) : EnhancementEngine
+    + getEngines(String name) : List<EnhancementEngine>
+    + isEngine(String name) : boolean
+    + getActiveEngines(Chain chain) : Map<String,EnhancementEngine>
+
+Enhancement Process
+----
+
+This section describes canines to the Enhancement Process by the addition of 
the Chains. It also provides a specification of how EnhancementEngines and 
EnhancementJobManager implementations need to take care to allow asynchronous 
and in parallel execution of multiple EnhancementEngines for the same 
ContentItem. 
+
+Note that Work on asynchronous enhancement process is covered by 
[STANBOL-46](https://issues.apache.org/jira/browse/STANBOL-46)
+
+### EnhancementJobManager
+
+This interface of the EnhancementJobManager will change due to the addition of 
Chains and in future only contain a single Method allowing to enhance a 
ContentItem by using the execution plan provided by the parsed Chain.
+
+    + enhanceContent(ContentItem ci, Chain chain)
+    
+
+Information about the Enhancement Engines are now available by
+
+* _Chain#getEngines():_ This returns the names of all Engines referenced by a 
Chain
+* _EnhancemetnEngineManager#getActuveEngines(Chain chain):_ This retunes the 
currently active Engines based on the configuration of the chain.
+
+By combining the results of both methods it is easy to retrieve the List of 
Engines used by a Cahin and also to check if a Chain can be executed based on 
the currently active EnhancementEngines.
+
+The getter for the active EnhancementEngines now also takes the Chain as P
+
+The getter for the active enhancement engines is intended to be used to check 
if all Chains referenced by a Chain (see Chain#getEngines() method) are 
currently active.
+
+### ContentItem
+
+Also the Interface of the ContentItem needs to undergo a slight change to add 
the ability for read/write locks to the MGraph holding the metadata. For 
details how this see the following sections about Asynchronous Execution.
+
+Because of that the type of the return value of the getMetadata method needs 
to be changed from MGraph to LocakableMGraph
+
+    + getMetadata() : LockableMGraph
+
+### AsynchronousExecution
+
+The "EnhancementEnigne#canEnhance(ContentItem ci) : int" method can indicate 
if an engine can or can not enhance an ContentItem. In addition this method can 
also indicate to the EnhancementJobManager if an Engine supports the 
asynchronous Execution. This section specifies how the EnhancementJobManager 
needs to use this information to support asynchronous and/or parallel execution 
of multiple EnhancementEngines.
+
+As soon as EnhancementEngines are executed asynchronously this might also 
result in situations where multiple Engines need to access the ContentItem 
concurrently. Therefore the access to the ContentItem - especially to the 
metadata - MUST BE synchronized. Implementors of EnhancementEngines MUST 
especially be careful if using Iterators as returned by the Clerezzas 
TripleCollection, MGraph and also the GraphNode utility. Because such Iterators 
will throw ConcurrentModificationExceptions if the underlaying graph is 
modified during iteration.
+
+Because of that Engines that support EnhancementEngine#ENHANCE_ASYNC need to 
use the 
[ReadWriteLock](http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/locks/ReadWriteLock.html)
 provided by the  LockableMGraph returned by ContentItem#getMetadata(). The 
following code snippets show how to use read and write locks with the metadata 
graph.
+
+    LocakableMGraph metadata = ci.getMetadata();
+    Lock readLock = metadata.getLock().readLock();
+    readLock.lock();
+    try {
+       Iterator<Triple> it = metadata.filter(â¦);
+        while(it.hasNext()){
+            /** process the triples */
+        }
+    } finally {
+        readlock.unlock();
+    }
+    
+    Lock writeLock = metadata.getLock().writeLock();
+    writeLock.lock();
+    try {
+        /** write new Enhancements to the Graph */
+    } finally {
+        writelock.unlock();
+    }
+
+__IMPORTANT:__ Do not try to get a write lock within a read lock because this 
may be the cause of deadlocks. Thats because read locks can be obtained 
simultaneously by multiple threads while write locks are exclusive. So if two 
thread with a read lock try to also obtain a write lock they will block each 
other. 
+
+EnhancementEngines that do NOT support EnhancementEngine#ENHANCE_ASYNC - 
meaning that the canEnhance method only returns 
EnhancementEngine#CANNOT_ENHANCE or EnhancementEngine#ENHANCE_SYNCHRONOUS - do 
not need to obtain read and write locks. The EnhancementJobManager 
implementation MUST ensure that they to have exclusive access to the 
Enhancement Graph. This can be either done by obtaining a write lock before 
calling such enhancement engines or by ensuring the no other engines are called 
in parallel.
+
+In cases where the EnhancementJobManager can execute multiple engines in 
parallel it is good practice to first start the execution of Engines that do 
support EnhancementEngine#ENHANCE_ASYNC. This will allow such engines to obtain 
a read lock to read the data necessary for there calculations before the 
EnhancementJobManager needs to obtain an exclusive write lock for calling 
EnhancementEngines that do only support EnhancementEngine#ENHANCE_SYNCHRONOUS.
+
+

svn commit: r1229080 - /incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/STANBOL-414-specification.mdtext

Reply via email to