[jira] [Commented] (STANBOL-187) Extendable indexing infrastructure for the Entityhub

Rupert Westenthaler (JIRA) Fri, 29 Apr 2011 03:37:49 -0700

    [ 
https://issues.apache.org/jira/browse/STANBOL-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026933#comment-13026933
 ]


Rupert Westenthaler commented on STANBOL-187:
---------------------------------------------

Some initial Documentation:

- - -

Indexing API:

(0) IndexingComponent:
The parent interface of most of the following interfaces. This is used to set 
the configuration, start the initialization and close the component as soon as 
the indexing has finished.

(1) Indexing Source

Source information are divided in two categories:
 - Entity Data: Provides the Data for the entity (Representation)
 - Entity ID/Score: Provide id and Score (e.g. pageRank) for the Entity

There are two modi for indexing:
(a) Iterate of the data and lookup/calculate the score
(b) Iterate over the entity ids/scores and lookup the data

for (a) the following interfaces are used:
 - EntityDataIterable to iterate over the entity data
 - EntityScoreProvider to provide or calculate the score based on the entity 
data
This modus is optimal in case the data are provided by a source that does not 
allow ID based retrieval (e.g. a file). It is often also the preferred mode 
when one needs to index all entities.

for (b) the following interfaces are used
 - EntityIterator: iterator over entity id and score
 - EntityDataProvider: used to lookup the data for the entity based on the id
This modus is intended to be used if one wants only to index an part of all the 
entities provided by the source. The EntityIterator can be used to specify the 
entities to be indexed (e.g. based on a file providing the IDs of the entities 
to be indexed) This feature will be needed to resolve STANBOL-92, STANBOL-93 
and STANBOL-163.


(2) Score Normaliser

This Interface provides the possibility to process score values provided for 
Entities (e.g. to calculate the pageRank based on the number of incoming links)
The Score Normaliser is an optional component. If one is present it is applied 
to the score provided by the Indexing Source.
The Score Normaliser interface supports chaining of different instances (e.g. 
first calculate the natural Log of the incoming links and than normalizing the 
returned values within the range [0..1].

(3) EntityProcessor

This Interface takes a Representation (data of the entity) as input and returns 
a modified version. This is an optional component.
The intension is to provide an extension point for services like schema 
translation, filters (for fields, languages, ...).  An EntityProcessor that 
uses the FieldMapping functionally of the Entityhub is included.

(4) IndexingDestination

This interface is used to get the Yard (storage component of the Entityhub) to 
store the processed entities. In addition it defines a method that is used by 
the indexer to tell the destination that the indexing has finished. 
Implementations need to support the creation of distribution files used to load 
the indexed data into the Entityhub.


Indexing Process:

The indexing process is defined by the Indexer interface and implemented by the 
IndexerImpl. Indexer instances are created by using the IndexerFactory.

The process defines the following state:
 - UNINITIALISED: All components are present and configured but not yet 
initialized
 - INITIALISING: During the initialization
 - INITIALISED: The initialization of the components has finished. Ready to 
start the indexing
 - INDEXING: During the indexing process
 - INDEXED: The indexing of the entities has finished
 - FINALISING: during the finalization phase (e.g. creating the distribution 
files)
 - FINISHED: The indexing has finished.

The indexing interface provides the index() method that allows to perform the 
whole process with a single method call. It also defines methods to perform the 
single steps of the indexing process
 - initialiseIndexingSources(): UNINITIALISED > INITIALISED
 - indexAllEntities(): INITIALISED > INDEXED
 - finaliseIndexingTarget(): INDEXED > FINISHED

All these methods will block until the target state is reached. The index() 
method can be called in any of the UNINITIALISED, INITIALISED and INDEXED and 
will block until the FINISHED state is reached.

The indexing process uses the consumer/producer pattern where the
 - Indexing Source produces Indexed Entities
 - Entity Processor consumes Indexed Entities and produces Processed Entities
 - Indexing Destination consumes Processed Entities and produces Finished 
Entities
 - an internal component consumes Finished Entities and provides status updates 
every 10000 indexed entities
In addition every component can produce Errors that are processed (currently 
only logged) by an Error Processor
An interface that allows to register an own component that can handle errors 
will be added later.

Currently a single thread is used for each component, but the implementation 
would already support the usage of multiple threads (e.g. to process entities). 
However note that the different steps do run simultaneously. BlockingQueues are 
used to buffer some entities between the steps.

Configuration of the Indexing Process:

The configuration of the IndexingProcess is based on the following file 
structure

/indexing -> the root folder
/indexing/config -> the folder holding all the configuration
/indexing/config/indexing.properties -> the main configuration file
/indexing/resources/ -> provides the resources for the indexing process (e.g. 
the Files with the entity data, scores, schema definitions …)
/indexing/destination/ -> stores data created by the indexing process (e.g. the 
Solr Index with the indexed entities)
/indexing/dist/ -> contains the files needed to load the indexed data into the 
Entityhub

Some details to the "indexing.properties" File:

It uses the following syntax:
{key}={value1},{param1}:{paramValue},{param2}:paramValue2};{value2}…

keys:
 - Supported keys are defined in IndexingConstants
 - Full UTF-8 can be used for keys (java.util.Properties is NOT used for 
parsing)

value:
 - multiple values are separated by ';'
 - parameters can be added to values. The first parameter starts after the 
first ','

param:
 - multiple parameters are separated by ','
 - The first ':' is used to separate the parameter name with the parameter 
value.
 - A parameter MUST NOT have an value
 - the indexing configuration defines some parameter that can be used with 
every configuration. Other parameter are not processed but parsed to the 
component associated with the current value. (see setConfiguration method in 
the IndexingComponent interface)

special parameter: 

The "config" the value of this parameter is used to load additional properties 
form a config file form the "/indexing/config" directory.
e.g. the configuration

scoreNormalizer=org.apache.stanbol.entityhub.indexing.core.normaliser.RangeNormaliser,config:range

would load the configuration rom the file "/indexing/config/range.properties" 
and parse it to the RangeNormaliser instance.

NOTE: the the IndexingConfig instance is also parsed to the components by using 
the key "indexingConfig" (IndexingConfig.KEY_INDEXING_CONFIG)

The unit tests within the indexing core bundle are a good starting point for 
exploring how to use/configure the new indexing infrastructure. As soon as the 
current indexing utilities are moved to this new infrastructure they will 
provide even better examples.

best
Rupert Westenthaler

> Extendable indexing infrastructure for the Entityhub
> ----------------------------------------------------
>
>                 Key: STANBOL-187
>                 URL: https://issues.apache.org/jira/browse/STANBOL-187
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently the Entityhub includes some utilities to create Indexes for 
> dbPedia, geonames and dblp. There exists also an generic RDF indexer that is 
> used by the dbPedia and dblp however also this implementation is not 
> extendable and not really suitable to add features requested by issues like 
> STANBOL-92, STANBOL-93 and STANBOL-163.
> The goal is to create an infrastructure that provides an implementation of
>  - the indexing workflow
>  - configuration and initialization
> and defines Interfaces that allows to plug in
>  - different Data Sources
>  - entity ranking implementations
>  - entity data mapper (e.g. filtering some fields, schema translations ...)
>  - indexing targets (the Yard that stores the indexed entities)
> The existing Indexing utilities need to be moved to use the new Infrastructure

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-187) Extendable indexing infrastructure for the Entityhub

Reply via email to