Rupert Westenthaler created STANBOL-593:
-------------------------------------------
Summary: EntityIterator implementation based on Jena TDB that
allows to filter Entities based on Triple Filters
Key: STANBOL-593
URL: https://issues.apache.org/jira/browse/STANBOL-593
Project: Stanbol
Issue Type: New Feature
Components: Entity Hub
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler
Fix For: 0.10.0-incubating
The FieldValueProcessor (EntityProcessor) already allows to filter Entities
based on Triple Filters. However this requires to Iterate over all entities -
something very ineffective if one wants only to index a rather small fraction
of all Entities.
To achieve better performance in such cases one needs an Component that uses a
similar functionality to filter Entities within the Indexing Source. Such an
implementation is very easy to implement based on Jena TDB as the low level API
natively supports filtered iterators.
Indexing configurations would than use a EntityIterator/EntityDataProvider
combination as source for the indexing. A according configuration would look
like
entityIdIterator=org.apache.stanbol.entityhub.indexing.source.jenatdb.ResourceFilterIterator,config:entityTypes.properties
entityDataProvider=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata
the entityTypes.properties file would require the following properties
field=rdf:type
values=dbp-ont:Person;dbp-ont:Place;dbp-ont:Organisation
With this configuration the indexing process would only iterate over Persons,
Places and Organisations present within the IndexingSource.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira