On 07/12/11 20:37, Andy Seaborne wrote:
On 07/12/11 16:07, Jean-Marc Vanel wrote:
Hi

Hi there,


I 'm trying to load in TDB the musicbrainz n-triples dump from :
http://linkedbrainz.c4dmpresents.org/content/rdf-dump

musicbrainz dump in N-Triples is BIG !!!
% ls -l musicbrainz_ngs_dump.rdf.ttl
-rw-r--r-- 1 jmv jmv 25719386678 16 juin 14:58
musicbrainz_ngs_dump.rdf.ttl

The wc command needs 13mn to traverse it !
% time wc musicbrainz_ngs_dump.rdf.ttl
178995221 829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
wc musicbrainz_ngs_dump.rdf.ttl 710,46s user 20,37s system 94% cpu
12:50,43 total

Which means 179 millions of triples !

Large but not that large.


I had this stack :

Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

A possible cause is many large literals in the data.

(I don't know the data set)

It looks like it's because the data (which isn't Turtle, or as claimed N3, but N-triples already) has huge numbers of bNodes in it.

The parser keeps a map of the bnode labels so that when it's reused in a file, it's the same bNode, but that requires state to be keep and the file seems to have a massive number of bNodes.

Set the heap large and hope.

I tried 4G and the parser ran -

There are internally to the parser different bNode label policies but these aren't exposed in a convenient way currently.

Another way is to convert the bNodes to URIs - the data is simply full for synthetic URIs so why it uses bNodes I don't know.

        Andy

Reply via email to