On 07/12/11 16:07, Jean-Marc Vanel wrote:
Hi
Hi there,
I 'm trying to load in TDB the musicbrainz n-triples dump from :
http://linkedbrainz.c4dmpresents.org/content/rdf-dump
musicbrainz dump in N-Triples is BIG !!!
% ls -l musicbrainz_ngs_dump.rdf.ttl
-rw-r--r-- 1 jmv jmv 25719386678 16 juin 14:58 musicbrainz_ngs_dump.rdf.ttl
The wc command needs 13mn to traverse it !
% time wc musicbrainz_ngs_dump.rdf.ttl
178995221 829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
wc musicbrainz_ngs_dump.rdf.ttl 710,46s user 20,37s system 94% cpu
12:50,43 total
Which means 179 millions of triples !
Large but not that large.
I had this stack :
Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
A possible cause is many large literals in the data.
(I don't know the data set)
Your on a 64bit machine so you can set the heapsize larger. The default
in the script works on a 32 bit machine (where java is limited to ~1.5G
heap)
...
tdbloader --loc ~/tdb_data musicbrainz_ngs_dump.rdf.ttl 4380,35s user
111,88s system 47% cpu 2:39:01,08 total
I think I had already 9 million of triples from dbPedia in the
database (not sure) before loading mbz .
The loader is faster on an empty database.
I kept the original size 1.2Gb of the script.
This was with TDB-0.8.10 .
% java -version
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
% uname -a
Linux oem-laptop 2.6.32-5-amd64 #1 SMP Mon Oct 3 03:59:20 UTC 2011
x86_64 GNU/Linux
( Debian )
Try tdbloader2 (which currently only works on Linux and only on empty
databases).
Laptops are slower than servers.
SSDs are faster than mag disks.
Is the current state of the data base corrupted ?
Most likely.
Of course I can reload with more memory, but I need to understand
better what TDB does while loading.
Apparently it populates a bplustree in memory while loading .
It just happens the limit at that point. The actual point it hits the
heap limit isn't always the cause of most of the memory usage.
Does it also happen in normal functioning ? I mean for querying.
For loading this dataset, is it just a matter of splitting before
loading in several steps?
Then the tool should do it itself .
No need to split input.
I usually suggest parsing to N-triples first though "riot --validate" to
check the data because you don't want to get part way through a load and
find it's got bad data in it. If you do check , keep the N-triples as
loading is faster from N-triples.
Is there any hope that this dataset works on TDB ?
Yes.
Have I reached the limit ?
The preset limits are of necessity a guess.
PS
There is an unfinished sentence in http://openjena.org/wiki/TDB/Architecture :
The default storage of each indexes
thanks