-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 See: https://issues.apache.org/jira/browse/STANBOL-1143
Thanks! Is there any plan to move fully over to Git for VC? - --- A. Soroka The University of Virginia Library On Jul 25, 2013, at 12:30 AM, Rupert Westenthaler wrote: > Hi > > On Wed, Jul 24, 2013 at 6:32 PM, aj...@virginia.edu <aj...@virginia.edu> > wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Okay, I've made those changes (and improved the help for Urify a bit) at: >> >> https://github.com/ajs6f/stanbol/tree/UrifyImprovements >> >> but because that's a fork of Stanbol's repo, I don't think I can issue a >> pull request to y'all for it. >> > > Do you know how to create a patch. If so it would be good if you could > create such a patch and attach it to a Jira. > >> On the main topic, as I understand your explanation, it _should_ be possible >> to load a dataset with massive numbers of blank nodes into Jena without >> swamping the heap, but it would require that Jena persist its store of blank >> nodes to disk while the import is going on, which it doesn't do and which >> would be horribly slow. Is that a correct understanding? >> > > At least this is my understanding. For details I would ask this same > question on the Apache Jena mailing list. > > best > Rupert > >> Thanks very much for your help! >> >> - --- >> A. Soroka >> The University of Virginia Library >> >> On Jul 24, 2013, at 9:43 AM, Rupert Westenthaler wrote: >> >>> On Wed, Jul 24, 2013 at 2:44 PM, aj...@virginia.edu <aj...@virginia.edu> >>> wrote: >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA1 >>>> >>>> Thanks! This is exactly what I needed to hear. I will try out Urify pronto. >>>> >>>> Is there a convenient place I can add some documentation about these >>>> issues with a pointer to Urify? Perhaps in the README for the generic RDF >>>> indexer? >>>> >>> >>> I think the README would be a good place to add such information. The >>> problem with importing datasets with a lot of Bnodes is nothing >>> Stanbol nor Jena specific. AFAIK all RDF frameworks are affected by >>> that. >>> >>> best >>> Rupert >>> >>>> - --- >>>> A. Soroka >>>> The University of Virginia Library >>>> >>>> On Jul 24, 2013, at 12:24 AM, Rupert Westenthaler wrote: >>>> >>>>> On Tue, Jul 23, 2013 at 9:24 PM, aj...@virginia.edu <aj...@virginia.edu> >>>>> wrote: >>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>> Hash: SHA1 >>>>>> >>>>>> Hi, Stanbol folks! >>>>>> >>>>>> I'm trying to index a largeish (about 260M triples) dataset into a >>>>>> Solr-backed EntityHub [1], but not having much success. I'm getting "out >>>>>> of heap" errors in the load-to-Jena stage, even with a 4GB heap. The >>>>>> process doesn't make it past about 33M triples. >>>>>> >>>>>> I know that Jena can scale way, way beyond that size, so I'm wondering >>>>>> if anyone else has tried datasets of similar size with success? Is it >>>>>> possible that there's a memory leak in the generic RDF indexer code? >>>>>> >>>>>> I've considered trying to break up the dataset, but it's full of blank >>>>>> nodes, which makes that trickier, and I'm not at all confident that I >>>>>> could successfully merge the resulting Solr indexes to make a coherent >>>>>> EntityHub Solr core. >>>>> >>>>> The blank nodes are the reason for the OOM errors, as Jena needs to >>>>> keep all blank nodes in memory when parsing the RDF file. I had a >>>>> similar problem when importing Musicbrainz with > 250 million Bnodes. >>>>> >>>>> Because of that I created a small utility that converts BNodes to >>>>> URNs. It is called Urify (org.apache.stanbol.entityhub.indexing.Urify) >>>>> and is part of the entityhub.indexing.core module. I have always run >>>>> it from eclipse, but you should be also able to run it with java by >>>>> putting one of the Entityhub Indexing Tool runnable jars in the >>>>> classpath. >>>>> >>>>> The other possibility is to increase the heap memory so that all >>>>> Bnodes do fit into memory. However NOTE that the Stanbol Entityhub >>>>> does also not support Bnodes. Therefore therefore those node would get >>>>> ignored - or if enabled - be converted to URNs during the indexing >>>>> step (see STANBOL-765 [1]) >>>>> >>>>> So my advice would be to use the Urify utility to transcode the RDF >>>>> dump before importing the data JenaTDB >>>>> >>>>> best >>>>> Rupert >>>>> >>>>> [1] https://issues.apache.org/jira/browse/STANBOL-765 >>>>> >>>>>> >>>>>> I'd be grateful for any advice or suggestions as to other routes to take >>>>>> (other than trying to assemble an even larger heap for the process, >>>>>> which is not a very good long-term solution). For example, is there a >>>>>> supported way to index into a Clerezza-backed EntityHub, which would let >>>>>> me tackle the problem of loading into Jena TDB without using Stanbol >>>>>> gear? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> [1] http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz >>>>>> >>>>>> - --- >>>>>> A. Soroka >>>>>> The University of Virginia Library >>>>>> >>>>>> -----BEGIN PGP SIGNATURE----- >>>>>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin) >>>>>> Comment: GPGTools - http://gpgtools.org >>>>>> >>>>>> iQEcBAEBAgAGBQJR7thUAAoJEATpPYSyaoIk93oIAIfS17YRkbGcrDf0x/mlgE3P >>>>>> x/iziR0aT+MXUgeKU0jYE72vp1ixvmypkVnWkZqns5w4rKbd1OothnMHPPTbK6H9 >>>>>> EamRaAMylg3vXtdelw4ot9sr0Rd+3kIv63YMUne8VkU2/boXoEB+sDpm+QXlGJmF >>>>>> Fj1Tpq22PIGpi+haYjauYOx2kbOx33OHHZ62IWk5Fa85rTV80M5m/avBnOljnZKS >>>>>> E20HgXK5fjBCTPWyjyr8gl4Ur15eBPD/eetT/7jr+TLMG+SMIB/TdS2kyPNLGa7O >>>>>> w2yiuQeuxHyrVlmHQo6db9gEh2RvrZfhNgcC+EbbCEA6nT502Fa0URKzC+oi50w= >>>>>> =S2QM >>>>>> -----END PGP SIGNATURE----- >>>>> >>>>> >>>>> >>>>> -- >>>>> | Rupert Westenthaler rupert.westentha...@gmail.com >>>>> | Bodenlehenstraße 11 ++43-699-11108907 >>>>> | A-5500 Bischofshofen >>>> >>>> -----BEGIN PGP SIGNATURE----- >>>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin) >>>> Comment: GPGTools - http://gpgtools.org >>>> >>>> iQEcBAEBAgAGBQJR78w5AAoJEATpPYSyaoIkPz0H+gMTW5sYaylcAAuzOTvFCdZm >>>> csP+aqq/w0QwUBe/whSUZGU6Rl55zG+PT1I7ZViVC+tRtIBHyvrQ0t6OqU0Hnb+S >>>> Z8cxx82DDmWDs2euXN0mVVM0/oWkjL6X46TL3bfNqqo5wqbaVeoRZEeFj4T1hnuP >>>> nW1gPwo1Tgi2D4RlBnf1IadFTcTVoWJgiRW50zPnH3mGTgynfDLR3f0+7C8WoZOi >>>> 06lX2700oChPS6s46As2ybKkZCIpw6bGKwMqKUtYH+58S38ZXppMRhC9XHZSqPL4 >>>> cmlMRAnqHxGAonPgtXrrHXyzhhvRGvKxCAz1H2MyAwueLCZD3KbpWu1f8hb9a0o= >>>> =J3qL >>>> -----END PGP SIGNATURE----- >>> >>> >>> >>> -- >>> | Rupert Westenthaler rupert.westentha...@gmail.com >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> | A-5500 Bischofshofen >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG/MacGPG2 v2.0.19 (Darwin) >> Comment: GPGTools - http://gpgtools.org >> >> iQEcBAEBAgAGBQJR8AGoAAoJEATpPYSyaoIk3Q4IAJ+ySsMM6sbaB5Rt9d5Fky8I >> gSIQB2697hZJRYGYLXQl9RLqyC8UxRPCYS1u5RypthySto7GPKA22jwR8hCoCRlF >> xAKbJmxWGpK0hoLyIc21oGhg0mF1Co2dDFPSD0L1z92/+iS6gXyDjYdgoZ3iQKcT >> k5N0d/BmzQTAKXVCLYaBIxXodP4UtBu/XUO32gWg+ghSU8TKbfOCTzGncD5YzGVD >> 5lPZWMfO1JunSPk1ZkJOsB0pWoSFVOKP5yfcfJ2ygT4xH3m3WrI8iwYiH7Iw+tnp >> P07Rs99Mm4/doIx+Jzrcxeob2dOTBIZxIQ5Dh7MZkXoQQs1QtiNOEG3Prqa4Iwo= >> =u9de >> -----END PGP SIGNATURE----- > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.19 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJR9+qzAAoJEATpPYSyaoIk0oAH/1k4svsvZ0zB+vAaWjlB+Y3/ 3d75j9Pe0h1Wk3pmWeZ0m8bJYnFg0SXIdPWtYMeT4Zthx4njNUihUrZfPx7lijd0 moUbRVPbk3vzJxiyUim480wN4vVMmPKHhz9AsxLeAnAkIeQ5EnNJvQRtJa10ypCY PLj0OqjOdR6/PLsPQlxW/HOI2Mldk3JCCnyXxaRbLRMxvoQT7yjWHrCGU3PZ5gFH TUBhNBhxpJd5ZZ9/EXNy88XbeSSA5W9+hNqqhutCDufRLn5HLBYBMybgFhvKh+nx xIwl8mIfm4aytsFhO48g6luemjUUvvCivEkuKqSC9SgBchJAcv6KmLRZ/0A4thk= =KECy -----END PGP SIGNATURE-----