-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, Stanbol folks!
I'm trying to index a largeish (about 260M triples) dataset into a Solr-backed EntityHub [1], but not having much success. I'm getting "out of heap" errors in the load-to-Jena stage, even with a 4GB heap. The process doesn't make it past about 33M triples. I know that Jena can scale way, way beyond that size, so I'm wondering if anyone else has tried datasets of similar size with success? Is it possible that there's a memory leak in the generic RDF indexer code? I've considered trying to break up the dataset, but it's full of blank nodes, which makes that trickier, and I'm not at all confident that I could successfully merge the resulting Solr indexes to make a coherent EntityHub Solr core. I'd be grateful for any advice or suggestions as to other routes to take (other than trying to assemble an even larger heap for the process, which is not a very good long-term solution). For example, is there a supported way to index into a Clerezza-backed EntityHub, which would let me tackle the problem of loading into Jena TDB without using Stanbol gear? Thanks! [1] http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz - --- A. Soroka The University of Virginia Library -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.19 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJR7thUAAoJEATpPYSyaoIk93oIAIfS17YRkbGcrDf0x/mlgE3P x/iziR0aT+MXUgeKU0jYE72vp1ixvmypkVnWkZqns5w4rKbd1OothnMHPPTbK6H9 EamRaAMylg3vXtdelw4ot9sr0Rd+3kIv63YMUne8VkU2/boXoEB+sDpm+QXlGJmF Fj1Tpq22PIGpi+haYjauYOx2kbOx33OHHZ62IWk5Fa85rTV80M5m/avBnOljnZKS E20HgXK5fjBCTPWyjyr8gl4Ur15eBPD/eetT/7jr+TLMG+SMIB/TdS2kyPNLGa7O w2yiuQeuxHyrVlmHQo6db9gEh2RvrZfhNgcC+EbbCEA6nT502Fa0URKzC+oi50w= =S2QM -----END PGP SIGNATURE-----