Re: large dataset into EntityHub

aj...@virginia.edu Tue, 30 Jul 2013 09:34:38 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

See: https://issues.apache.org/jira/browse/STANBOL-1143


Thanks!

Is there any plan to move fully over to Git for VC?

- ---
A. Soroka
The University of Virginia Library

On Jul 25, 2013, at 12:30 AM, Rupert Westenthaler wrote:

> Hi
> 
> On Wed, Jul 24, 2013 at 6:32 PM, aj...@virginia.edu <aj...@virginia.edu> 
> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> 
>> Okay, I've made those changes (and improved the help for Urify a bit) at:
>> 
>> https://github.com/ajs6f/stanbol/tree/UrifyImprovements
>> 
>> but because that's a fork of Stanbol's repo, I don't think I can issue a 
>> pull request to y'all for it.
>> 
> 
> Do you know how to create a patch. If so it would be good if you could
> create such a patch and attach it to a Jira.
> 
>> On the main topic, as I understand your explanation, it _should_ be possible 
>> to load a dataset with massive numbers of blank nodes into Jena without 
>> swamping the heap, but it would require that Jena persist its store of blank 
>> nodes to disk while the import is going on, which it doesn't do and which 
>> would be horribly slow. Is that a correct understanding?
>> 
> 
> At least this is my understanding. For details I would ask this same
> question on the Apache Jena mailing list.
> 
> best
> Rupert
> 
>> Thanks very much for your help!
>> 
>> - ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> On Jul 24, 2013, at 9:43 AM, Rupert Westenthaler wrote:
>> 
>>> On Wed, Jul 24, 2013 at 2:44 PM, aj...@virginia.edu <aj...@virginia.edu> 
>>> wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>> 
>>>> Thanks! This is exactly what I needed to hear. I will try out Urify pronto.
>>>> 
>>>> Is there a convenient place I can add some documentation about these 
>>>> issues with a pointer to Urify? Perhaps in the README for the generic RDF 
>>>> indexer?
>>>> 
>>> 
>>> I think the README would be a good place to add such information. The
>>> problem with importing datasets with a lot of Bnodes is nothing
>>> Stanbol nor Jena specific. AFAIK all RDF frameworks are affected by
>>> that.
>>> 
>>> best
>>> Rupert
>>> 
>>>> - ---
>>>> A. Soroka
>>>> The University of Virginia Library
>>>> 
>>>> On Jul 24, 2013, at 12:24 AM, Rupert Westenthaler wrote:
>>>> 
>>>>> On Tue, Jul 23, 2013 at 9:24 PM, aj...@virginia.edu <aj...@virginia.edu> 
>>>>> wrote:
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA1
>>>>>> 
>>>>>> Hi, Stanbol folks!
>>>>>> 
>>>>>> I'm trying to index a largeish (about 260M triples) dataset into a 
>>>>>> Solr-backed EntityHub [1], but not having much success. I'm getting "out 
>>>>>> of heap" errors in the load-to-Jena stage, even with a 4GB heap. The 
>>>>>> process doesn't make it past about 33M triples.
>>>>>> 
>>>>>> I know that Jena can scale way, way beyond that size, so I'm wondering 
>>>>>> if anyone else has tried datasets of similar size with success? Is it 
>>>>>> possible that there's a memory leak in the generic RDF indexer code?
>>>>>> 
>>>>>> I've considered trying to break up the dataset, but it's full of blank 
>>>>>> nodes, which makes that trickier, and I'm not at all confident that I 
>>>>>> could successfully merge the resulting Solr indexes to make a coherent 
>>>>>> EntityHub Solr core.
>>>>> 
>>>>> The blank nodes are the reason for the OOM errors, as Jena needs to
>>>>> keep all blank nodes in memory when parsing the RDF file. I had a
>>>>> similar problem when importing Musicbrainz with > 250 million Bnodes.
>>>>> 
>>>>> Because of that I created a small utility that converts BNodes to
>>>>> URNs. It is called Urify (org.apache.stanbol.entityhub.indexing.Urify)
>>>>> and is part of the entityhub.indexing.core module. I have always run
>>>>> it from eclipse, but you should be also able to run it with java by
>>>>> putting one of the Entityhub Indexing Tool runnable jars in the
>>>>> classpath.
>>>>> 
>>>>> The other possibility is to increase the heap memory so that all
>>>>> Bnodes do fit into memory. However NOTE that the Stanbol Entityhub
>>>>> does also not support Bnodes. Therefore therefore those node would get
>>>>> ignored - or if enabled - be converted to URNs during the indexing
>>>>> step (see STANBOL-765 [1])
>>>>> 
>>>>> So my advice would be to use the Urify utility to transcode the RDF
>>>>> dump before importing the data JenaTDB
>>>>> 
>>>>> best
>>>>> Rupert
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/STANBOL-765
>>>>> 
>>>>>> 
>>>>>> I'd be grateful for any advice or suggestions as to other routes to take 
>>>>>> (other than trying to assemble an even larger heap for the process, 
>>>>>> which is not a very good long-term solution). For example, is there a 
>>>>>> supported way to index into a Clerezza-backed EntityHub, which would let 
>>>>>> me tackle the problem of loading into Jena TDB without using Stanbol 
>>>>>> gear?
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> [1] http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz
>>>>>> 
>>>>>> - ---
>>>>>> A. Soroka
>>>>>> The University of Virginia Library
>>>>>> 
>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>>>>>> Comment: GPGTools - http://gpgtools.org
>>>>>> 
>>>>>> iQEcBAEBAgAGBQJR7thUAAoJEATpPYSyaoIk93oIAIfS17YRkbGcrDf0x/mlgE3P
>>>>>> x/iziR0aT+MXUgeKU0jYE72vp1ixvmypkVnWkZqns5w4rKbd1OothnMHPPTbK6H9
>>>>>> EamRaAMylg3vXtdelw4ot9sr0Rd+3kIv63YMUne8VkU2/boXoEB+sDpm+QXlGJmF
>>>>>> Fj1Tpq22PIGpi+haYjauYOx2kbOx33OHHZ62IWk5Fa85rTV80M5m/avBnOljnZKS
>>>>>> E20HgXK5fjBCTPWyjyr8gl4Ur15eBPD/eetT/7jr+TLMG+SMIB/TdS2kyPNLGa7O
>>>>>> w2yiuQeuxHyrVlmHQo6db9gEh2RvrZfhNgcC+EbbCEA6nT502Fa0URKzC+oi50w=
>>>>>> =S2QM
>>>>>> -----END PGP SIGNATURE-----
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>> 
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>>>> Comment: GPGTools - http://gpgtools.org
>>>> 
>>>> iQEcBAEBAgAGBQJR78w5AAoJEATpPYSyaoIkPz0H+gMTW5sYaylcAAuzOTvFCdZm
>>>> csP+aqq/w0QwUBe/whSUZGU6Rl55zG+PT1I7ZViVC+tRtIBHyvrQ0t6OqU0Hnb+S
>>>> Z8cxx82DDmWDs2euXN0mVVM0/oWkjL6X46TL3bfNqqo5wqbaVeoRZEeFj4T1hnuP
>>>> nW1gPwo1Tgi2D4RlBnf1IadFTcTVoWJgiRW50zPnH3mGTgynfDLR3f0+7C8WoZOi
>>>> 06lX2700oChPS6s46As2ybKkZCIpw6bGKwMqKUtYH+58S38ZXppMRhC9XHZSqPL4
>>>> cmlMRAnqHxGAonPgtXrrHXyzhhvRGvKxCAz1H2MyAwueLCZD3KbpWu1f8hb9a0o=
>>>> =J3qL
>>>> -----END PGP SIGNATURE-----
>>> 
>>> 
>>> 
>>> --
>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>> 
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>> Comment: GPGTools - http://gpgtools.org
>> 
>> iQEcBAEBAgAGBQJR8AGoAAoJEATpPYSyaoIk3Q4IAJ+ySsMM6sbaB5Rt9d5Fky8I
>> gSIQB2697hZJRYGYLXQl9RLqyC8UxRPCYS1u5RypthySto7GPKA22jwR8hCoCRlF
>> xAKbJmxWGpK0hoLyIc21oGhg0mF1Co2dDFPSD0L1z92/+iS6gXyDjYdgoZ3iQKcT
>> k5N0d/BmzQTAKXVCLYaBIxXodP4UtBu/XUO32gWg+ghSU8TKbfOCTzGncD5YzGVD
>> 5lPZWMfO1JunSPk1ZkJOsB0pWoSFVOKP5yfcfJ2ygT4xH3m3WrI8iwYiH7Iw+tnp
>> P07Rs99Mm4/doIx+Jzrcxeob2dOTBIZxIQ5Dh7MZkXoQQs1QtiNOEG3Prqa4Iwo=
>> =u9de
>> -----END PGP SIGNATURE-----
> 
> 
> 
> -- 
> | Rupert Westenthaler             rupert.westentha...@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJR9+qzAAoJEATpPYSyaoIk0oAH/1k4svsvZ0zB+vAaWjlB+Y3/
3d75j9Pe0h1Wk3pmWeZ0m8bJYnFg0SXIdPWtYMeT4Zthx4njNUihUrZfPx7lijd0
moUbRVPbk3vzJxiyUim480wN4vVMmPKHhz9AsxLeAnAkIeQ5EnNJvQRtJa10ypCY
PLj0OqjOdR6/PLsPQlxW/HOI2Mldk3JCCnyXxaRbLRMxvoQT7yjWHrCGU3PZ5gFH
TUBhNBhxpJd5ZZ9/EXNy88XbeSSA5W9+hNqqhutCDufRLn5HLBYBMybgFhvKh+nx
xIwl8mIfm4aytsFhO48g6luemjUUvvCivEkuKqSC9SgBchJAcv6KmLRZ/0A4thk=
=KECy
-----END PGP SIGNATURE-----

Re: large dataset into EntityHub

Reply via email to