This is probably pushing Jena beyond it's design limits, but I thought I'd
report on it anyway.

I needed to test some things with large data sets, so I tried to load the
data from http://basekb.com/

Once extracted from the tar.gz file, it creates a directory called baseKB
filled with 1024 gzipped nt files.

On my first attempt, I grabbed a fresh copy of Fuseki 0.2.5 and started it
with TDB storage. I didn't want to individually load 1024 files from the
control panel, so I used zcat to dump everything into one file and tried
loading from the GUI. This failed in short order with RIOT complaining of
memory:

13:24:31 WARN  Fuseki               :: [1] RC = 500 : Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:234)
at java.lang.StringBuilder.toString(StringBuilder.java:405)
at org.openjena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:476)
...etc...

I'm wondering if RIOT really needed to run out of memory?

Anyway, I went back to the individual files. That meant using a non-gui
approach. I wasn't sure about using a media type for nt, but that's
compatible with Turtle, so I used test/turtle.

I threw away the DB directory and started again. This time I tried to load
the files with the following bash:

for i in *.nt.gz; do
  echo "Loading $i"
  zcat $i | curl -x POST -H "Content-Type: text/turtle" --upload-file - "
http://localhost:3030/dataset/data?default";
done

This started reasonably well. A number of warnings showed up on the server
side, due to bad language tags and invalid IRIs, but it kept going.
However, on the 20th file I started seeing these:
Loading triples0000.nt.gz
Loading triples0001.nt.gz
Loading triples0002.nt.gz
Loading triples0003.nt.gz
Loading triples0004.nt.gz
Loading triples0005.nt.gz
Loading triples0006.nt.gz
Loading triples0007.nt.gz
Loading triples0008.nt.gz
Loading triples0009.nt.gz
Loading triples0010.nt.gz
Loading triples0011.nt.gz
Loading triples0012.nt.gz
Loading triples0013.nt.gz
Loading triples0014.nt.gz
Loading triples0015.nt.gz
Loading triples0016.nt.gz
Loading triples0017.nt.gz
Loading triples0018.nt.gz
Loading triples0019.nt.gz
Error 500: GC overhead limit exceeded


Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
Loading triples0020.nt.gz
Error 500: GC overhead limit exceeded


Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
Loading triples0021.nt.gz
Error 500: GC overhead limit exceeded


Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)

This kept going until triples0042.nt.gz where it hung for hours.

Meanwhile, on the server, I was still seeing parser warnings, but also
messages like:
17:01:26 WARN  SPARQL_REST$HttpActionREST :: Transaction still active in
endWriter - no commit or abort seen (forced abort)
17:01:26 WARN  Fuseki               :: [33] RC = 500 : GC overhead limit
exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded

When I finally killed it (with ctrl-C), I got several stack traces in the
stdout log. They appeared to indicate a bad state, so I've saved them and
put them up at:  http://pastebin.com/yar5Pq85

While OOM is very hard to deal with, I'm still surprised to see it hit this
way, so I thought you might be interested to see it.

Regards,
Paul Gearon

Reply via email to