On 07/12/11 20:37, Andy Seaborne wrote:
On 07/12/11 16:07, Jean-Marc Vanel wrote:
Hi
Hi there,
I 'm trying to load in TDB the musicbrainz n-triples dump from :
http://linkedbrainz.c4dmpresents.org/content/rdf-dump
musicbrainz dump in N-Triples is BIG !!!
% ls -l musicbrainz_ngs_dump.rdf.ttl
-rw-r--r-- 1 jmv jmv 25719386678 16 juin 14:58
musicbrainz_ngs_dump.rdf.ttl
The wc command needs 13mn to traverse it !
% time wc musicbrainz_ngs_dump.rdf.ttl
178995221 829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
wc musicbrainz_ngs_dump.rdf.ttl 710,46s user 20,37s system 94% cpu
12:50,43 total
Which means 179 millions of triples !
Large but not that large.
I had this stack :
Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
A possible cause is many large literals in the data.
(I don't know the data set)
It looks like it's because the data (which isn't Turtle, or as claimed
N3, but N-triples already) has huge numbers of bNodes in it.
The parser keeps a map of the bnode labels so that when it's reused in
a file, it's the same bNode, but that requires state to be keep and the
file seems to have a massive number of bNodes.
Set the heap large and hope.
I tried 4G and the parser ran -
There are internally to the parser different bNode label policies but
these aren't exposed in a convenient way currently.
Another way is to convert the bNodes to URIs - the data is simply full
for synthetic URIs so why it uses bNodes I don't know.
Andy