[
https://issues.apache.org/jira/browse/STANBOL-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018093#comment-14018093
]
A. Soroka commented on STANBOL-1125:
------------------------------------
Freebase is not the only source of data that would benefit from this kind of
tool. I've been indexing dumps of the U.S. Library of Congress' Linked Data
([#note1]) and found the resource requirements onerous, and I'm going to have
to set up some kind of special workflow to index the Virtual International
Authority File. ([#note2]) Many of the data sources in which I am interested
require little or no processing of the kind we would write in LDPath programs--
they are to be indexed virtually "plain". For those data sources that do
require some processing of simple kinds (e.g. translating predicates), it might
be possible to allow that as a step on a single Representation before storing
that Representation into a Yard, instead of as a transduction over the whole
store of RDF.
I would be interested in working on this problem. It seems to me at a first
glance that with suitable restrictions on the inputs (perhaps N-Triples files
sorted by subject URI?) a very performant streaming solution could be developed
with a minimal cost in space for computation.
({anchor:note1}1) Controlled vocabularies for subject headings and name
authorities widely used in library metadata, available at:
http://id.loc.gov/download/
({anchor:note2}2) Also a name authority system, but federated from the linked
data of several different national libraries, available at:
http://viaf.org/viaf/data/
> Create a lightweight EntityHub Indexing Tool for Freebase
> ---------------------------------------------------------
>
> Key: STANBOL-1125
> URL: https://issues.apache.org/jira/browse/STANBOL-1125
> Project: Stanbol
> Issue Type: Improvement
> Components: Entityhub
> Reporter: Rafa Haro
>
> Due to the enormous size of the dumps, current Freebase indexing tool in
> Stanbol can't barely work in machines without several gigas of RAM and/or SSD
> disks. JenaTDB importer has been identified as the bootle neck of the
> indexing process. To use an RDF database is mandatory in order to, for
> instance, use LDPath programs at indexing time.
> The idea is to develop a lightweight indexing tool that stream data from the
> dumps and push it directly to Solr. Despite losing some functionality, it is
> possible for any user to generate Freebase EntityHub indexes from any dump.
--
This message was sent by Atlassian JIRA
(v6.2#6252)