Rupert Westenthaler created STANBOL-1016:
--------------------------------------------
Summary: Add RDF Triple Filter support to the Jena TDB Indexing
Source
Key: STANBOL-1016
URL: https://issues.apache.org/jira/browse/STANBOL-1016
Project: Stanbol
Issue Type: Sub-task
Components: Entityhub
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler
The freebase.com dump has ~1.200.000.000 triples. Loading those triples to Jena
TDB takes ages if the RAM (available to the memory mapped files) is not huge
enough to hold the data. If the number of imported triples exceeds the
available RAM the import speed deceases to ~7k triples/sec on an SSD. For
reaching those 7k triple/sec the logs show 1,5k reads and 1k writes per second
so import speeds on normal hard discs should be much slower.
As most of the Triples contained in the freebase dump are not relevant for
indexing this issue will introduce a new feature to the Jena TDB Indexing
Source that allows - on a very low level - to filter out triples.
This Filter will be based on Triples provided by the Riot parser and define a
single method
accept(Node subject, Node predicate, Node object) : boolean
In addition the interface will extend IndexingComponent, what will allow to
configure it via the configuration file of the
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource
The parameter used to configure the filter will be called "import-filter" and
the value MUST BE the Class name of the used implementation.
The configuration of the jenatdb.RdfIndexingSource will be parsed to the Import
Filters #setConfiguration(..) method. This means that users will need to add
configuration properties of for the Import Filter to the configuration of the
RdfIndexingSource.
To keep things simple the RdfImportFilter interface will be specific to the
Jena TDB Indexing Source.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira