This is probably pushing Jena beyond it's design limits, but I thought I'd report on it anyway.
I needed to test some things with large data sets, so I tried to load the data from http://basekb.com/ Once extracted from the tar.gz file, it creates a directory called baseKB filled with 1024 gzipped nt files. On my first attempt, I grabbed a fresh copy of Fuseki 0.2.5 and started it with TDB storage. I didn't want to individually load 1024 files from the control panel, so I used zcat to dump everything into one file and tried loading from the GUI. This failed in short order with RIOT complaining of memory: 13:24:31 WARN Fuseki :: [1] RC = 500 : Java heap space java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:2694) at java.lang.String.<init>(String.java:234) at java.lang.StringBuilder.toString(StringBuilder.java:405) at org.openjena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:476) ...etc... I'm wondering if RIOT really needed to run out of memory? Anyway, I went back to the individual files. That meant using a non-gui approach. I wasn't sure about using a media type for nt, but that's compatible with Turtle, so I used test/turtle. I threw away the DB directory and started again. This time I tried to load the files with the following bash: for i in *.nt.gz; do echo "Loading $i" zcat $i | curl -x POST -H "Content-Type: text/turtle" --upload-file - " http://localhost:3030/dataset/data?default" done This started reasonably well. A number of warnings showed up on the server side, due to bad language tags and invalid IRIs, but it kept going. However, on the 20th file I started seeing these: Loading triples0000.nt.gz Loading triples0001.nt.gz Loading triples0002.nt.gz Loading triples0003.nt.gz Loading triples0004.nt.gz Loading triples0005.nt.gz Loading triples0006.nt.gz Loading triples0007.nt.gz Loading triples0008.nt.gz Loading triples0009.nt.gz Loading triples0010.nt.gz Loading triples0011.nt.gz Loading triples0012.nt.gz Loading triples0013.nt.gz Loading triples0014.nt.gz Loading triples0015.nt.gz Loading triples0016.nt.gz Loading triples0017.nt.gz Loading triples0018.nt.gz Loading triples0019.nt.gz Error 500: GC overhead limit exceeded Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100) Loading triples0020.nt.gz Error 500: GC overhead limit exceeded Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100) Loading triples0021.nt.gz Error 500: GC overhead limit exceeded Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100) This kept going until triples0042.nt.gz where it hung for hours. Meanwhile, on the server, I was still seeing parser warnings, but also messages like: 17:01:26 WARN SPARQL_REST$HttpActionREST :: Transaction still active in endWriter - no commit or abort seen (forced abort) 17:01:26 WARN Fuseki :: [33] RC = 500 : GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded When I finally killed it (with ctrl-C), I got several stack traces in the stdout log. They appeared to indicate a bad state, so I've saved them and put them up at: http://pastebin.com/yar5Pq85 While OOM is very hard to deal with, I'm still surprised to see it hit this way, so I thought you might be interested to see it. Regards, Paul Gearon