Rupert Westenthaler created STANBOL-1016:
--------------------------------------------

             Summary: Add RDF Triple Filter support to the Jena TDB Indexing 
Source
                 Key: STANBOL-1016
                 URL: https://issues.apache.org/jira/browse/STANBOL-1016
             Project: Stanbol
          Issue Type: Sub-task
          Components: Entityhub
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


The freebase.com dump has ~1.200.000.000 triples. Loading those triples to Jena 
TDB takes ages if the RAM (available to the memory mapped files) is not huge 
enough to hold the data. If the number of imported triples exceeds the 
available RAM the import speed deceases to ~7k triples/sec on an SSD.  For 
reaching those 7k triple/sec the logs show 1,5k reads and 1k writes per second 
so import speeds on normal hard discs should be much slower.

As most of the Triples contained in the freebase dump are not relevant for 
indexing this issue will introduce a new feature to the Jena TDB Indexing 
Source that allows - on a very low level - to filter out triples.

This Filter will be based on Triples provided by the Riot parser and define a 
single method

    accept(Node subject, Node predicate, Node object) : boolean

In addition the interface will extend IndexingComponent, what will allow to 
configure it via the configuration file of the 

    org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource

The parameter used to configure the filter will be called "import-filter" and 
the value MUST BE the Class name of the used implementation.

The configuration of the jenatdb.RdfIndexingSource will be parsed to the Import 
Filters #setConfiguration(..) method. This means that users will need to add 
configuration properties of for the Import Filter to the configuration of the 
RdfIndexingSource.

To keep things simple the RdfImportFilter interface will be specific to the 
Jena TDB Indexing Source.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to