large dataset into EntityHub

aj...@virginia.edu Tue, 23 Jul 2013 12:25:17 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi, Stanbol folks!


I'm trying to index a largeish (about 260M triples) dataset into a Solr-backed 
EntityHub [1], but not having much success. I'm getting "out of heap" errors in 
the load-to-Jena stage, even with a 4GB heap. The process doesn't make it past 
about 33M triples.

I know that Jena can scale way, way beyond that size, so I'm wondering if 
anyone else has tried datasets of similar size with success? Is it possible 
that there's a memory leak in the generic RDF indexer code?

I've considered trying to break up the dataset, but it's full of blank nodes, 
which makes that trickier, and I'm not at all confident that I could 
successfully merge the resulting Solr indexes to make a coherent EntityHub Solr 
core.

I'd be grateful for any advice or suggestions as to other routes to take (other 
than trying to assemble an even larger heap for the process, which is not a 
very good long-term solution). For example, is there a supported way to index 
into a Clerezza-backed EntityHub, which would let me tackle the problem of 
loading into Jena TDB without using Stanbol gear?

Thanks!

[1] http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz

- ---
A. Soroka
The University of Virginia Library

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJR7thUAAoJEATpPYSyaoIk93oIAIfS17YRkbGcrDf0x/mlgE3P
x/iziR0aT+MXUgeKU0jYE72vp1ixvmypkVnWkZqns5w4rKbd1OothnMHPPTbK6H9
EamRaAMylg3vXtdelw4ot9sr0Rd+3kIv63YMUne8VkU2/boXoEB+sDpm+QXlGJmF
Fj1Tpq22PIGpi+haYjauYOx2kbOx33OHHZ62IWk5Fa85rTV80M5m/avBnOljnZKS
E20HgXK5fjBCTPWyjyr8gl4Ur15eBPD/eetT/7jr+TLMG+SMIB/TdS2kyPNLGa7O
w2yiuQeuxHyrVlmHQo6db9gEh2RvrZfhNgcC+EbbCEA6nT502Fa0URKzC+oi50w=
=S2QM
-----END PGP SIGNATURE-----

large dataset into EntityHub

Reply via email to