[
https://issues.apache.org/jira/browse/STANBOL-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026933#comment-13026933
]
Rupert Westenthaler commented on STANBOL-187:
---------------------------------------------
Some initial Documentation:
- - -
Indexing API:
(0) IndexingComponent:
The parent interface of most of the following interfaces. This is used to set
the configuration, start the initialization and close the component as soon as
the indexing has finished.
(1) Indexing Source
Source information are divided in two categories:
- Entity Data: Provides the Data for the entity (Representation)
- Entity ID/Score: Provide id and Score (e.g. pageRank) for the Entity
There are two modi for indexing:
(a) Iterate of the data and lookup/calculate the score
(b) Iterate over the entity ids/scores and lookup the data
for (a) the following interfaces are used:
- EntityDataIterable to iterate over the entity data
- EntityScoreProvider to provide or calculate the score based on the entity
data
This modus is optimal in case the data are provided by a source that does not
allow ID based retrieval (e.g. a file). It is often also the preferred mode
when one needs to index all entities.
for (b) the following interfaces are used
- EntityIterator: iterator over entity id and score
- EntityDataProvider: used to lookup the data for the entity based on the id
This modus is intended to be used if one wants only to index an part of all the
entities provided by the source. The EntityIterator can be used to specify the
entities to be indexed (e.g. based on a file providing the IDs of the entities
to be indexed) This feature will be needed to resolve STANBOL-92, STANBOL-93
and STANBOL-163.
(2) Score Normaliser
This Interface provides the possibility to process score values provided for
Entities (e.g. to calculate the pageRank based on the number of incoming links)
The Score Normaliser is an optional component. If one is present it is applied
to the score provided by the Indexing Source.
The Score Normaliser interface supports chaining of different instances (e.g.
first calculate the natural Log of the incoming links and than normalizing the
returned values within the range [0..1].
(3) EntityProcessor
This Interface takes a Representation (data of the entity) as input and returns
a modified version. This is an optional component.
The intension is to provide an extension point for services like schema
translation, filters (for fields, languages, ...). An EntityProcessor that
uses the FieldMapping functionally of the Entityhub is included.
(4) IndexingDestination
This interface is used to get the Yard (storage component of the Entityhub) to
store the processed entities. In addition it defines a method that is used by
the indexer to tell the destination that the indexing has finished.
Implementations need to support the creation of distribution files used to load
the indexed data into the Entityhub.
Indexing Process:
The indexing process is defined by the Indexer interface and implemented by the
IndexerImpl. Indexer instances are created by using the IndexerFactory.
The process defines the following state:
- UNINITIALISED: All components are present and configured but not yet
initialized
- INITIALISING: During the initialization
- INITIALISED: The initialization of the components has finished. Ready to
start the indexing
- INDEXING: During the indexing process
- INDEXED: The indexing of the entities has finished
- FINALISING: during the finalization phase (e.g. creating the distribution
files)
- FINISHED: The indexing has finished.
The indexing interface provides the index() method that allows to perform the
whole process with a single method call. It also defines methods to perform the
single steps of the indexing process
- initialiseIndexingSources(): UNINITIALISED > INITIALISED
- indexAllEntities(): INITIALISED > INDEXED
- finaliseIndexingTarget(): INDEXED > FINISHED
All these methods will block until the target state is reached. The index()
method can be called in any of the UNINITIALISED, INITIALISED and INDEXED and
will block until the FINISHED state is reached.
The indexing process uses the consumer/producer pattern where the
- Indexing Source produces Indexed Entities
- Entity Processor consumes Indexed Entities and produces Processed Entities
- Indexing Destination consumes Processed Entities and produces Finished
Entities
- an internal component consumes Finished Entities and provides status updates
every 10000 indexed entities
In addition every component can produce Errors that are processed (currently
only logged) by an Error Processor
An interface that allows to register an own component that can handle errors
will be added later.
Currently a single thread is used for each component, but the implementation
would already support the usage of multiple threads (e.g. to process entities).
However note that the different steps do run simultaneously. BlockingQueues are
used to buffer some entities between the steps.
Configuration of the Indexing Process:
The configuration of the IndexingProcess is based on the following file
structure
/indexing -> the root folder
/indexing/config -> the folder holding all the configuration
/indexing/config/indexing.properties -> the main configuration file
/indexing/resources/ -> provides the resources for the indexing process (e.g.
the Files with the entity data, scores, schema definitions …)
/indexing/destination/ -> stores data created by the indexing process (e.g. the
Solr Index with the indexed entities)
/indexing/dist/ -> contains the files needed to load the indexed data into the
Entityhub
Some details to the "indexing.properties" File:
It uses the following syntax:
{key}={value1},{param1}:{paramValue},{param2}:paramValue2};{value2}…
keys:
- Supported keys are defined in IndexingConstants
- Full UTF-8 can be used for keys (java.util.Properties is NOT used for
parsing)
value:
- multiple values are separated by ';'
- parameters can be added to values. The first parameter starts after the
first ','
param:
- multiple parameters are separated by ','
- The first ':' is used to separate the parameter name with the parameter
value.
- A parameter MUST NOT have an value
- the indexing configuration defines some parameter that can be used with
every configuration. Other parameter are not processed but parsed to the
component associated with the current value. (see setConfiguration method in
the IndexingComponent interface)
special parameter:
The "config" the value of this parameter is used to load additional properties
form a config file form the "/indexing/config" directory.
e.g. the configuration
scoreNormalizer=org.apache.stanbol.entityhub.indexing.core.normaliser.RangeNormaliser,config:range
would load the configuration rom the file "/indexing/config/range.properties"
and parse it to the RangeNormaliser instance.
NOTE: the the IndexingConfig instance is also parsed to the components by using
the key "indexingConfig" (IndexingConfig.KEY_INDEXING_CONFIG)
The unit tests within the indexing core bundle are a good starting point for
exploring how to use/configure the new indexing infrastructure. As soon as the
current indexing utilities are moved to this new infrastructure they will
provide even better examples.
best
Rupert Westenthaler
> Extendable indexing infrastructure for the Entityhub
> ----------------------------------------------------
>
> Key: STANBOL-187
> URL: https://issues.apache.org/jira/browse/STANBOL-187
> Project: Stanbol
> Issue Type: Improvement
> Components: Entity Hub
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> Currently the Entityhub includes some utilities to create Indexes for
> dbPedia, geonames and dblp. There exists also an generic RDF indexer that is
> used by the dbPedia and dblp however also this implementation is not
> extendable and not really suitable to add features requested by issues like
> STANBOL-92, STANBOL-93 and STANBOL-163.
> The goal is to create an infrastructure that provides an implementation of
> - the indexing workflow
> - configuration and initialization
> and defines Interfaces that allows to plug in
> - different Data Sources
> - entity ranking implementations
> - entity data mapper (e.g. filtering some fields, schema translations ...)
> - indexing targets (the Yard that stores the indexed entities)
> The existing Indexing utilities need to be moved to use the new Infrastructure
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira