[ 
https://issues.apache.org/jira/browse/STANBOL-499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler updated STANBOL-499:
----------------------------------------

    Description: 
The SemanticIndex is the Interface used by the ContentHub to semantically index 
ContentItems (2nd level store). It is anticipated that a ContentHub will manage 
multiple semantic indexes of possible different implementations.

Expected Implementations of this Interface include

* The current Solr/LDPath based semantic index component
* The current Contenthub default index (also Solr based)
* A SPARQL based variant implemented by a Triple Store

The remaining Specification includes the definition of the SemanticIndex 
interface as well as the SemanticIndexManager.

SemanticIndex
--------------------

The Java interface for semantic indexes as used by the Apache Stanbol Contenthub

### Identification

    :::java
    /** The name of the Index */
    + getName()
    /** An optional free text description */
    + getDescription()

The name of the semantic index is intended to be used for simple lookups as 
well as relative paths within the RESTful interfaces. However it MUST NOT be 
considered as unique. See section [Semantic Index 
Management](#Semantic_Index_Management) for details on how to resolve name 
conflicts.

### Indexing

First the interface defines methods for indexing/removing documents to the 
semantic index

    :::java
    /** Indexes the parsed ContentItem */
    + index(ContentItem ci) : boolean
    /** Deletes the ContentItme with the parsed di */
    + remove(String ciUri)
    /** Ensures that changes to the index are persisted */
    + persist(long revision)
    /** Getter for the highest successfully persisted revision */
    + getRevision() : long

The boolean returned by the index method allows to indicate if the parsed 
ContentItem was actually included to the Semantic Index. Seamtic index may 
define filters on the content items to be included in the semantic index.

The persist Method is intended to be used to indicate the Semantic Index that 
indexing has been finished. This allows the semantic index to form batches over 
multiple calls to index(..) and remove(..) what may improve performance when 
indexing multiple ContentItems.

In addition it is used to parse the highest revision of a indexed content item. 
If no revision was yet announced to a Semantic index - persist(..) was never 
called - than getRevision() shall return a negative number.

The revision will be used by the ContentHub to re-synchronize the contents of a 
semantic index enhanced ContentItems present in [Store](store.html) when it 
becomes active. Usually the long value will represent the time in milliseconds 
such as returned by <code>System.currentTimeMillis()</code> but this is no 
requirement. It is only important that after each change of the Store interface 
results in an increase of this number.

All above methods may throw an SemanticIndexingException. This is a sub class 
of ContenthubException.

### Index State

Semantic Indexes do provide the following state information
    
    /** The state of the semantic index */
    + getState() : IndexState

The IndexState is a simple Java enum that defines the following states:

* <code>UNINIT</code> : The index was defined, the configuration is ok, but the 
contents are not yet indexed and the indexing has not yet started. (Intended to 
be used as default state after creations)
* <code>INDEXING</code>: The (initial) indexing of content items is currently 
in progress. This indicates that the index is currently NOT active.
* <code>ACTIVE</code>: The semantic index is available and in sync
* <code>REINDEXING</code>: The (re)-indexing of content times is currently in 
progress. This indicates that the configuration of the semantic index was 
changed in a way that requires to rebuild the whole semantic index. This still 
requires the index to be active - meaning the searches can be performed 
normally - but recent updates/changes to ContentItems might not be reflected. 
This also indicates that the index will be replaced by a different version 
(maybe with changed fields) in the near future.

Note that there are no states for INACTIVE and ERROR. This is because such kind 
of states are already convert by the normal OSGI component live-cycle. All the 
above IndexStates require the SemanticIndex component to be active.

### Index Inspection


The semantic index interface provides a very simple API to inspect the 
configuration of the semantic index. This part of the Interface is considered 
to be optional. Implementations that can not provide such information shall 
return <code>null</code> to calls of the below methods.

    :::java
    /** The names of all fields defined by this Index */
    + getFieldsNames() : List<String>
    /** Getter for the field properties */
    + getFieldProperties(String name) : Map<String,Object>

Keys for well known properties shall be defined by the services API of the 
ContentHub. This includes the following:

    :::java
    /** The xsd:dataType for the values of a field */
    DATATYPE

Implementation specific keys shall be defined by the implementations of the 
semantic index interface. Here are possible keys for a LDPath based Semantic 
Index implementation

    :::java
    /** The LDPath rule used for a field */
    LDPATH


### Search

The semantic index does NOT define methods to search it's contents as the 
intension is to directly use the search APIs of the technologies/framewoks used 
to hold the semantic index such as

* [Apache Solr](http://lucene.apache.org/solr) RESTful API
* SPARQL in case a TripleStore is used as Semantic index.
* Contenthub featured search interface

However the semantic index has two methods that can be used to get information 
about supported search interfaces.

    :::java
    /** Getter for all supported RESTful search endpoints */
    getRESTSearchEndpoints() : Map<String,String>
    /** Getter for all supported search components */
    getSearchEndpoints() : Map<Class,ServiceReference>

The method returning the RESTful search interfaces uses a key representing the 
type of the RESTful service. The method returning the Components uses the Java 
interface (Class) as key and a OSGI ServiceReference to the actual component as 
value. The later is intended to be used by users that want to perform queries 
on the Contenthub by using the Java API.

TODO: Define a set of properties that SemanticIndex implementations MUST add to 
search component so that users can also use normal ServiceTracker and 
@Reference annotations to use search components!

e.g. the valued for the semantic index with the name "default" supporting SOLR 
and Contenthub featured search as RESTful search services

    :::text
    "CONTENTHUB" : "http://localhost:8080/contenthub/search/featured";
    "SOLR" : "http://localhost:8080/solr/contenthub/default";

in addition the following search Components are supported

    :::text
    org.apache.stanbol.contenthub.servicesapi.search.featured.FeaturedSearch : 
{service-reference-instance}
    org.apache.stanbol.contenthub.servicesapi.search.solr.SolrSearch : 
{an-other-service-reference-instance}
    org.apache.solr.client.solrj.SolrServer : 
{an-service-reference-to-the-solr-server}



An other example for an index with the name "knowledgebase" that supports an 
SPARQL endpoint

    :::text
    "SPARQL" : "http://localhost:8080/sparql/contenthub/knowledgebase";

as RESTful service and

    :::text
    org.apache.clerezza.rdf.core.sparql.QueryEngine *)

as component to perform SARQL queries.

*) NOTE that the QueryEngine interface is used here only as example. A real 
implementation would need to wrap this by some other Interface that does not 
need the TcManager and TripleCollection to execute a query. Such two MUST be 
provided by the "knowledgebase" SemanticIndex.

Semantic Index Management
-------------------------

Semantic Indexes are registered as OSGI component implementing the 
"SemanticIndex" interface as described above. All active semantic indexes are 
managed by the SemanticIndexManager component as follows:


### Interface

Provides an Java API that allows to lookup of all active semantic indexes. This 
includes indexes in the UNINT, INDEXING, ACTIVE and REINDEXING state.

Lookup of semantic index is supported based on name, and search endpoint type.

    :::java
    + getIndex(String name) : SemanticIndex
    + getIndexes(String name) : List<SemanticIndex>

    + getIndex(String endpointType) : SemanticIndex
    + getIndexes(String endpointType) : List<SemanticIndex>

    + getIndex(String name, String endpointType) : SemanticIndex
    + getIndexes(String name, String endpointType) : List<SemanticIndex>

A typical query would be for an index with the name "simple" with the "SOLR" 
endpoint.

    :::java
    SemanticIndexManager indexManager;
    SemanticIndex index = indexManager.getIndex("simple", EndpointType.SOLR)
    String solrEndpoint = index.getSearchEndpoints().get(EndpointType.SOLR);

The methods returning a single Index need to resolve cases with multiple 
matches by returning the SemanticIndex service

1. with the highest "service.ranking" and
2. the lowest "service.id

This ensures the behavior to be consistent with the typical rules for service 
selection as defined by the OSGI specification.



  was:
The SemanticIndex is the Interface used by the ContentHub to semantically index 
ContentItems (2nd level store). It is anticipated that a ContentHub will manage 
multiple semantic indexes of possible different implementations.

Expected Implementations of this Interface include

* The current Solr/LDPath based semantic index component
* The current Contenthub default index (also Solr based)
* A SPARQL based variant implemented by a Triple Store

The remaining Specification includes the definition of the SemanticIndex 
interface as well as the SemanticIndexManager.

SemanticIndex
--------------------

The Java interface for semantic indexes as used by the Apache Stanbol Contenthub

### Identification

    :::java
    /** The name of the Index */
    + getName()
    /** An optional free text description */
    + getDescription()

The name of the semantic index is intended to be used for simple lookups as 
well as relative paths within the RESTful interfaces. However it MUST NOT be 
considered as unique. See section [Semantic Index 
Management](#Semantic_Index_Management) for details on how to resolve name 
conflicts.

### Indexing

First the interface defines methods for indexing/removing documents to the 
semantic index

    :::java
    /** Indexes the parsed ContentItem */
    + index(ContentItem ci) : boolean
    /** Deletes the ContentItme with the parsed di */
    + remove(String ciUri)
    /** Ensures that changes to the index are persisted */
    + persist(long revision)
    /** Getter for the highest successfully persisted revision */
    + getRevision() : long

The boolean returned by the index method allows to indicate if the parsed 
ContentItem was actually included to the Semantic Index. Seamtic index may 
define filters on the content items to be included in the semantic index.

The persist Method is intended to be used to indicate the Semantic Index that 
indexing has been finished. This allows the semantic index to form batches over 
multiple calls to index(..) and remove(..) what may improve performance when 
indexing multiple ContentItems.

In addition it is used to parse the highest revision of a indexed content item. 
If no revision was yet announced to a Semantic index - persist(..) was never 
called - than getRevision() shall return a negative number.

The revision will be used by the ContentHub to re-synchronize the contents of a 
semantic index enhanced ContentItems present in [Store](store.html) when it 
becomes active. Usually the long value will represent the time in milliseconds 
such as returned by <code>System.currentTimeMillis()</code> but this is no 
requirement. It is only important that after each change of the Store interface 
results in an increase of this number.

All above methods may throw an SemanticIndexingException. This is a sub class 
of ContenthubException.

### Index State

Semantic Indexes do provide the following state information
    
    /** The state of the semantic index */
    + getState() : IndexState

The IndexState is a simple Java enum that defines the following states:

* <code>UNINIT</code> : The index was defined, the configuration is ok, but the 
contents are not yet indexed and the indexing has not yet started. (Intended to 
be used as default state after creations)
* <code>INDEXING</code>: The (initial) indexing of content items is currently 
in progress. This indicates that the index is currently NOT active.
* <code>ACTIVE</code>: The semantic index is available and in sync
* <code>REINDEXING</code>: The (re)-indexing of content times is currently in 
progress. This indicates that the configuration of the semantic index was 
changed in a way that requires to rebuild the whole semantic index. This still 
requires the index to be active - meaning the searches can be performed 
normally - but recent updates/changes to ContentItems might not be reflected. 
This also indicates that the index will be replaced by a different version 
(maybe with changed fields) in the near future.

Note that there are no states for INACTIVE and ERROR. This is because such kind 
of states are already convert by the normal OSGI component live-cycle. All the 
above IndexStates require the SemanticIndex component to be active.

### Index Inspection


The semantic index interface provides a very simple API to inspect the 
configuration of the semantic index. This part of the Interface is considered 
to be optional. Implementations that can not provide such information shall 
return <code>null</code> to calls of the below methods.

    :::java
    /** The names of all fields defined by this Index */
    + getFieldsNames() : List<String>
    /** Getter for the field properties */
    + getFieldProperties(String name) : Map<String,Object>

Keys for well known properties shall be defined by the services API of the 
ContentHub. This includes the following:

    :::java
    /** The xsd:dataType for the values of a field */
    DATATYPE

Implementation specific keys shall be defined by the implementations of the 
semantic index interface. Here are possible keys for a LDPath based Semantic 
Index implementation

    :::java
    /** The LDPath rule used for a field */
    LDPATH


### Search

The semantic index does NOT define methods to search it's contents as the 
intension is to directly use the search APIs of the technologies/framewoks used 
to hold the semantic index such as

* [Apache Solr](http://lucene.apache.org/solr) RESTful API
* SPARQL in case a TripleStore is used as Semantic index.
* Contenthub featured search interface

However the semantic index should return the URI and the type of the endpoint

    :::java
    /** Getter for all supported search endpoints */
    getSearchEndpoints() : Map<String,String>

This method returns as keys the type of the search Endpoint and as value the 
URL of the RESTful service endpoint.

e.g. the valued for the semantic index with the name "default" supporting SOLR 
and Contenthub featured search.

    :::text
    "CONTENTHUB" : "http://localhost:8080/contenthub/search/featured";
    "SOLR" : "http://localhost:8080/solr/contenthub/default";

An other example for an index with the name "knowledgebase" that supports an 
SPARQL endpoint

    :::text
    "SPARQL" : "http://localhost:8080/sparql/contenthub/knowledgebase";

Semantic Index Management
-------------------------

Semantic Indexes are registered as OSGI component implementing the 
"SemanticIndex" interface as described above. All active semantic indexes are 
managed by the SemanticIndexManager component as follows:


### Interface

Provides an Java API that allows to lookup of all active semantic indexes. This 
includes indexes in the UNINT, INDEXING, ACTIVE and REINDEXING state.

Lookup of semantic index is supported based on name, and search endpoint type.

    :::java
    + getIndex(String name) : SemanticIndex
    + getIndexes(String name) : List<SemanticIndex>

    + getIndex(String endpointType) : SemanticIndex
    + getIndexes(String endpointType) : List<SemanticIndex>

    + getIndex(String name, String endpointType) : SemanticIndex
    + getIndexes(String name, String endpointType) : List<SemanticIndex>

A typical query would be for an index with the name "simple" with the "SOLR" 
endpoint.

    :::java
    SemanticIndexManager indexManager;
    SemanticIndex index = indexManager.getIndex("simple", EndpointType.SOLR)
    String solrEndpoint = index.getSearchEndpoints().get(EndpointType.SOLR);

The methods returning a single Index need to resolve cases with multiple 
matches by returning the SemanticIndex service

1. with the highest "service.ranking" and
2. the lowest "service.id

This ensures the behavior to be consistent with the typical rules for service 
selection as defined by the OSGI specification.



    
> Contenthub: Semantic Indexes
> ----------------------------
>
>                 Key: STANBOL-499
>                 URL: https://issues.apache.org/jira/browse/STANBOL-499
>             Project: Stanbol
>          Issue Type: Sub-task
>          Components: Content Hub
>            Reporter: Rupert Westenthaler
>
> The SemanticIndex is the Interface used by the ContentHub to semantically 
> index ContentItems (2nd level store). It is anticipated that a ContentHub 
> will manage multiple semantic indexes of possible different implementations.
> Expected Implementations of this Interface include
> * The current Solr/LDPath based semantic index component
> * The current Contenthub default index (also Solr based)
> * A SPARQL based variant implemented by a Triple Store
> The remaining Specification includes the definition of the SemanticIndex 
> interface as well as the SemanticIndexManager.
> SemanticIndex
> --------------------
> The Java interface for semantic indexes as used by the Apache Stanbol 
> Contenthub
> ### Identification
>     :::java
>     /** The name of the Index */
>     + getName()
>     /** An optional free text description */
>     + getDescription()
> The name of the semantic index is intended to be used for simple lookups as 
> well as relative paths within the RESTful interfaces. However it MUST NOT be 
> considered as unique. See section [Semantic Index 
> Management](#Semantic_Index_Management) for details on how to resolve name 
> conflicts.
> ### Indexing
> First the interface defines methods for indexing/removing documents to the 
> semantic index
>     :::java
>     /** Indexes the parsed ContentItem */
>     + index(ContentItem ci) : boolean
>     /** Deletes the ContentItme with the parsed di */
>     + remove(String ciUri)
>     /** Ensures that changes to the index are persisted */
>     + persist(long revision)
>     /** Getter for the highest successfully persisted revision */
>     + getRevision() : long
> The boolean returned by the index method allows to indicate if the parsed 
> ContentItem was actually included to the Semantic Index. Seamtic index may 
> define filters on the content items to be included in the semantic index.
> The persist Method is intended to be used to indicate the Semantic Index that 
> indexing has been finished. This allows the semantic index to form batches 
> over multiple calls to index(..) and remove(..) what may improve performance 
> when indexing multiple ContentItems.
> In addition it is used to parse the highest revision of a indexed content 
> item. If no revision was yet announced to a Semantic index - persist(..) was 
> never called - than getRevision() shall return a negative number.
> The revision will be used by the ContentHub to re-synchronize the contents of 
> a semantic index enhanced ContentItems present in [Store](store.html) when it 
> becomes active. Usually the long value will represent the time in 
> milliseconds such as returned by <code>System.currentTimeMillis()</code> but 
> this is no requirement. It is only important that after each change of the 
> Store interface results in an increase of this number.
> All above methods may throw an SemanticIndexingException. This is a sub class 
> of ContenthubException.
> ### Index State
> Semantic Indexes do provide the following state information
>     
>     /** The state of the semantic index */
>     + getState() : IndexState
> The IndexState is a simple Java enum that defines the following states:
> * <code>UNINIT</code> : The index was defined, the configuration is ok, but 
> the contents are not yet indexed and the indexing has not yet started. 
> (Intended to be used as default state after creations)
> * <code>INDEXING</code>: The (initial) indexing of content items is currently 
> in progress. This indicates that the index is currently NOT active.
> * <code>ACTIVE</code>: The semantic index is available and in sync
> * <code>REINDEXING</code>: The (re)-indexing of content times is currently in 
> progress. This indicates that the configuration of the semantic index was 
> changed in a way that requires to rebuild the whole semantic index. This 
> still requires the index to be active - meaning the searches can be performed 
> normally - but recent updates/changes to ContentItems might not be reflected. 
> This also indicates that the index will be replaced by a different version 
> (maybe with changed fields) in the near future.
> Note that there are no states for INACTIVE and ERROR. This is because such 
> kind of states are already convert by the normal OSGI component live-cycle. 
> All the above IndexStates require the SemanticIndex component to be active.
> ### Index Inspection
> The semantic index interface provides a very simple API to inspect the 
> configuration of the semantic index. This part of the Interface is considered 
> to be optional. Implementations that can not provide such information shall 
> return <code>null</code> to calls of the below methods.
>     :::java
>     /** The names of all fields defined by this Index */
>     + getFieldsNames() : List<String>
>     /** Getter for the field properties */
>     + getFieldProperties(String name) : Map<String,Object>
> Keys for well known properties shall be defined by the services API of the 
> ContentHub. This includes the following:
>     :::java
>     /** The xsd:dataType for the values of a field */
>     DATATYPE
> Implementation specific keys shall be defined by the implementations of the 
> semantic index interface. Here are possible keys for a LDPath based Semantic 
> Index implementation
>     :::java
>     /** The LDPath rule used for a field */
>     LDPATH
> ### Search
> The semantic index does NOT define methods to search it's contents as the 
> intension is to directly use the search APIs of the technologies/framewoks 
> used to hold the semantic index such as
> * [Apache Solr](http://lucene.apache.org/solr) RESTful API
> * SPARQL in case a TripleStore is used as Semantic index.
> * Contenthub featured search interface
> However the semantic index has two methods that can be used to get 
> information about supported search interfaces.
>     :::java
>     /** Getter for all supported RESTful search endpoints */
>     getRESTSearchEndpoints() : Map<String,String>
>     /** Getter for all supported search components */
>     getSearchEndpoints() : Map<Class,ServiceReference>
> The method returning the RESTful search interfaces uses a key representing 
> the type of the RESTful service. The method returning the Components uses the 
> Java interface (Class) as key and a OSGI ServiceReference to the actual 
> component as value. The later is intended to be used by users that want to 
> perform queries on the Contenthub by using the Java API.
> TODO: Define a set of properties that SemanticIndex implementations MUST add 
> to search component so that users can also use normal ServiceTracker and 
> @Reference annotations to use search components!
> e.g. the valued for the semantic index with the name "default" supporting 
> SOLR and Contenthub featured search as RESTful search services
>     :::text
>     "CONTENTHUB" : "http://localhost:8080/contenthub/search/featured";
>     "SOLR" : "http://localhost:8080/solr/contenthub/default";
> in addition the following search Components are supported
>     :::text
>     org.apache.stanbol.contenthub.servicesapi.search.featured.FeaturedSearch 
> : {service-reference-instance}
>     org.apache.stanbol.contenthub.servicesapi.search.solr.SolrSearch : 
> {an-other-service-reference-instance}
>     org.apache.solr.client.solrj.SolrServer : 
> {an-service-reference-to-the-solr-server}
> An other example for an index with the name "knowledgebase" that supports an 
> SPARQL endpoint
>     :::text
>     "SPARQL" : "http://localhost:8080/sparql/contenthub/knowledgebase";
> as RESTful service and
>     :::text
>     org.apache.clerezza.rdf.core.sparql.QueryEngine *)
> as component to perform SARQL queries.
> *) NOTE that the QueryEngine interface is used here only as example. A real 
> implementation would need to wrap this by some other Interface that does not 
> need the TcManager and TripleCollection to execute a query. Such two MUST be 
> provided by the "knowledgebase" SemanticIndex.
> Semantic Index Management
> -------------------------
> Semantic Indexes are registered as OSGI component implementing the 
> "SemanticIndex" interface as described above. All active semantic indexes are 
> managed by the SemanticIndexManager component as follows:
> ### Interface
> Provides an Java API that allows to lookup of all active semantic indexes. 
> This includes indexes in the UNINT, INDEXING, ACTIVE and REINDEXING state.
> Lookup of semantic index is supported based on name, and search endpoint type.
>     :::java
>     + getIndex(String name) : SemanticIndex
>     + getIndexes(String name) : List<SemanticIndex>
>     + getIndex(String endpointType) : SemanticIndex
>     + getIndexes(String endpointType) : List<SemanticIndex>
>     + getIndex(String name, String endpointType) : SemanticIndex
>     + getIndexes(String name, String endpointType) : List<SemanticIndex>
> A typical query would be for an index with the name "simple" with the "SOLR" 
> endpoint.
>     :::java
>     SemanticIndexManager indexManager;
>     SemanticIndex index = indexManager.getIndex("simple", EndpointType.SOLR)
>     String solrEndpoint = index.getSearchEndpoints().get(EndpointType.SOLR);
> The methods returning a single Index need to resolve cases with multiple 
> matches by returning the SemanticIndex service
> 1. with the highest "service.ranking" and
> 2. the lowest "service.id
> This ensures the behavior to be consistent with the typical rules for service 
> selection as defined by the OSGI specification.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to