Semantic Intexes
----------------
Key: STANBOL-499
URL: https://issues.apache.org/jira/browse/STANBOL-499
Project: Stanbol
Issue Type: Sub-task
Components: Content Hub
Reporter: Rupert Westenthaler
The SemanticIndex is the Interface used by the ContentHub to semantically index
ContentItems (2nd level store). It is anticipated that a ContentHub will manage
multiple semantic indexes of possible different implementations.
Expected Implementations of this Interface include
* The current Solr/LDPath based semantic index component
* The current Contenthub default index (also Solr based)
* A SPARQL based variant implemented by a Triple Store
The remaining Specification includes the definition of the SemanticIndex
interface as well as the SemanticIndexManager.
SemanticIndex
--------------------
The Java interface for semantic indexes as used by the Apache Stanbol Contenthub
### Identification
:::java
/** The name of the Index */
+ getName()
/** An optional free text description */
+ getDescription()
The name of the semantic index is intended to be used for simple lookups as
well as relative paths within the RESTful interfaces. However it MUST NOT be
considered as unique. See section [Semantic Index
Management](#Semantic_Index_Management) for details on how to resolve name
conflicts.
### Indexing
First the interface defines methods for indexing/removing documents to the
semantic index
:::java
/** Indexes the parsed ContentItem */
+ index(ContentItem ci) : boolean
/** Deletes the ContentItme with the parsed di */
+ remove(String ciUri)
/** Ensures that changes to the index are persisted */
+ persist(long revision)
/** Getter for the highest successfully persisted revision */
+ getRevision() : long
The boolean returned by the index method allows to indicate if the parsed
ContentItem was actually included to the Semantic Index. Seamtic index may
define filters on the content items to be included in the semantic index.
The persist Method is intended to be used to indicate the Semantic Index that
indexing has been finished. This allows the semantic index to form batches over
multiple calls to index(..) and remove(..) what may improve performance when
indexing multiple ContentItems.
In addition it is used to parse the highest revision of a indexed content item.
If no revision was yet announced to a Semantic index - persist(..) was never
called - than getRevision() shall return a negative number.
The revision will be used by the ContentHub to re-synchronize the contents of a
semantic index enhanced ContentItems present in [Store](store.html) when it
becomes active. Usually the long value will represent the time in milliseconds
such as returned by <code>System.currentTimeMillis()</code> but this is no
requirement. It is only important that after each change of the Store interface
results in an increase of this number.
All above methods may throw an SemanticIndexingException. This is a sub class
of ContenthubException.
### Index State
Semantic Indexes do provide the following state information
/** The state of the semantic index */
+ getState() : IndexState
The IndexState is a simple Java enum that defines the following states:
* <code>UNINIT</code> : The index was defined, the configuration is ok, but the
contents are not yet indexed and the indexing has not yet started. (Intended to
be used as default state after creations)
* <code>INDEXING</code>: The (initial) indexing of content items is currently
in progress. This indicates that the index is currently NOT active.
* <code>ACTIVE</code>: The semantic index is available and in sync
* <code>REINDEXING</code>: The (re)-indexing of content times is currently in
progress. This indicates that the configuration of the semantic index was
changed in a way that requires to rebuild the whole semantic index. This still
requires the index to be active - meaning the searches can be performed
normally - but recent updates/changes to ContentItems might not be reflected.
This also indicates that the index will be replaced by a different version
(maybe with changed fields) in the near future.
Note that there are no states for INACTIVE and ERROR. This is because such kind
of states are already convert by the normal OSGI component live-cycle. All the
above IndexStates require the SemanticIndex component to be active.
### Index Inspection
The semantic index interface provides a very simple API to inspect the
configuration of the semantic index. This part of the Interface is considered
to be optional. Implementations that can not provide such information shall
return <code>null</code> to calls of the below methods.
:::java
/** The names of all fields defined by this Index */
+ getFieldsNames() : List<String>
/** Getter for the field properties */
+ getFieldProperties(String name) : Map<String,Object>
Keys for well known properties shall be defined by the services API of the
ContentHub. This includes the following:
:::java
/** The xsd:dataType for the values of a field */
DATATYPE
Implementation specific keys shall be defined by the implementations of the
semantic index interface. Here are possible keys for a LDPath based Semantic
Index implementation
:::java
/** The LDPath rule used for a field */
LDPATH
### Search
The semantic index does NOT define methods to search it's contents as the
intension is to directly use the search APIs of the technologies/framewoks used
to hold the semantic index such as
* [Apache Solr](http://lucene.apache.org/solr) RESTful API
* SPARQL in case a TripleStore is used as Semantic index.
* Contenthub featured search interface
However the semantic index should return the URI and the type of the endpoint
:::java
/** Getter for all supported search endpoints */
getSearchEndpoints() : Map<String,String>
This method returns as keys the type of the search Endpoint and as value the
URL of the RESTful service endpoint.
e.g. the valued for the semantic index with the name "default" supporting SOLR
and Contenthub featured search.
:::text
"CONTENTHUB" : "http://localhost:8080/contenthub/search/featured"
"SOLR" : "http://localhost:8080/solr/contenthub/default"
An other example for an index with the name "knowledgebase" that supports an
SPARQL endpoint
:::text
"SPARQL" : "http://localhost:8080/sparql/contenthub/knowledgebase"
Semantic Index Management
-------------------------
Semantic Indexes are registered as OSGI component implementing the
"SemanticIndex" interface as described above. All active semantic indexes are
managed by the SemanticIndexManager component as follows:
### Interface
Provides an Java API that allows to lookup of all active semantic indexes. This
includes indexes in the UNINT, INDEXING, ACTIVE and REINDEXING state.
Lookup of semantic index is supported based on name, and search endpoint type.
:::java
+ getIndex(String name) : SemanticIndex
+ getIndexes(String name) : List<SemanticIndex>
+ getIndex(String endpointType) : SemanticIndex
+ getIndexes(String endpointType) : List<SemanticIndex>
+ getIndex(String name, String endpointType) : SemanticIndex
+ getIndexes(String name, String endpointType) : List<SemanticIndex>
A typical query would be for an index with the name "simple" with the "SOLR"
endpoint.
:::java
SemanticIndexManager indexManager;
SemanticIndex index = indexManager.getIndex("simple", EndpointType.SOLR)
String solrEndpoint = index.getSearchEndpoints().get(EndpointType.SOLR);
The methods returning a single Index need to resolve cases with multiple
matches by returning the SemanticIndex service
1. with the highest "service.ranking" and
2. the lowest "service.id
This ensures the behavior to be consistent with the typical rules for service
selection as defined by the OSGI specification.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira