On 07/12/11 16:07, Jean-Marc Vanel wrote:
Hi

Hi there,


I 'm trying to load in TDB the musicbrainz n-triples dump from :
http://linkedbrainz.c4dmpresents.org/content/rdf-dump

musicbrainz dump in N-Triples is BIG !!!
  % ls -l  musicbrainz_ngs_dump.rdf.ttl
-rw-r--r-- 1 jmv jmv 25719386678 16 juin  14:58 musicbrainz_ngs_dump.rdf.ttl

The wc command needs 13mn to traverse it !
  % time wc musicbrainz_ngs_dump.rdf.ttl
   178995221   829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
wc musicbrainz_ngs_dump.rdf.ttl  710,46s user 20,37s system 94% cpu
12:50,43 total

Which means 179 millions of triples !

Large but not that large.


I had this stack :

Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
   Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

A possible cause is many large literals in the data.

(I don't know the data set)

Your on a 64bit machine so you can set the heapsize larger. The default in the script works on a 32 bit machine (where java is limited to ~1.5G heap)

...

tdbloader --loc ~/tdb_data musicbrainz_ngs_dump.rdf.ttl  4380,35s user
111,88s system 47% cpu 2:39:01,08 total

I think I had already 9 million of triples from dbPedia in the
database (not sure) before loading mbz .

The loader is faster on an empty database.

I kept the original size 1.2Gb of the script.

This was with TDB-0.8.10 .

% java -version
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
% uname -a
Linux oem-laptop 2.6.32-5-amd64 #1 SMP Mon Oct 3 03:59:20 UTC 2011
x86_64 GNU/Linux
( Debian )

Try tdbloader2 (which currently only works on Linux and only on empty databases).

Laptops are slower than servers.
SSDs are faster than mag disks.


Is the current state of the data base corrupted ?

Most likely.


Of course I can reload with more memory, but I need to understand
better what TDB does while loading.
Apparently it populates a bplustree in memory while loading .

It just happens the limit at that point. The actual point it hits the heap limit isn't always the cause of most of the memory usage.

Does it also happen in normal functioning ? I mean for querying.
For loading this dataset, is it just a matter of splitting before
loading in several steps?
Then the tool should do it itself .

No need to split input.

I usually suggest parsing to N-triples first though "riot --validate" to check the data because you don't want to get part way through a load and find it's got bad data in it. If you do check , keep the N-triples as loading is faster from N-triples.


Is there any hope that this dataset works on TDB ?

Yes.

Have I reached the limit ?

The preset limits are of necessity a guess.


PS
There is an unfinished sentence in http://openjena.org/wiki/TDB/Architecture :
The default storage of each indexes

thanks

Reply via email to