Hi Paul, Thanks for the report. This is a known issue in Fuseki (see JENA-309 [1]). I have plans to work on this soon. Also I'm a little surprised that your second attempt after breaking it into chunks failed, I'll take a look at that.
I am also working on a related issue (JENA-330 [2]) that will eliminate limits on SPARQL Update queries. I hope to have that checked into the trunk soon. -Stephen [1] https://issues.apache.org/jira/browse/JENA-309 [2] https://issues.apache.org/jira/browse/JENA-330 On Fri, Nov 2, 2012 at 5:24 PM, Paul Gearon <gea...@ieee.org> wrote: > This is probably pushing Jena beyond it's design limits, but I thought I'd > report on it anyway. > > I needed to test some things with large data sets, so I tried to load the > data from http://basekb.com/ > > Once extracted from the tar.gz file, it creates a directory called baseKB > filled with 1024 gzipped nt files. > > On my first attempt, I grabbed a fresh copy of Fuseki 0.2.5 and started it > with TDB storage. I didn't want to individually load 1024 files from the > control panel, so I used zcat to dump everything into one file and tried > loading from the GUI. This failed in short order with RIOT complaining of > memory: > > 13:24:31 WARN Fuseki :: [1] RC = 500 : Java heap space > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOfRange(Arrays.java:2694) > at java.lang.String.<init>(String.java:234) > at java.lang.StringBuilder.toString(StringBuilder.java:405) > at org.openjena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:476) > ...etc... > > I'm wondering if RIOT really needed to run out of memory? > > Anyway, I went back to the individual files. That meant using a non-gui > approach. I wasn't sure about using a media type for nt, but that's > compatible with Turtle, so I used test/turtle. > > I threw away the DB directory and started again. This time I tried to load > the files with the following bash: > > for i in *.nt.gz; do > echo "Loading $i" > zcat $i | curl -x POST -H "Content-Type: text/turtle" --upload-file - " > http://localhost:3030/dataset/data?default" > done > > This started reasonably well. A number of warnings showed up on the server > side, due to bad language tags and invalid IRIs, but it kept going. > However, on the 20th file I started seeing these: > Loading triples0000.nt.gz > Loading triples0001.nt.gz > Loading triples0002.nt.gz > Loading triples0003.nt.gz > Loading triples0004.nt.gz > Loading triples0005.nt.gz > Loading triples0006.nt.gz > Loading triples0007.nt.gz > Loading triples0008.nt.gz > Loading triples0009.nt.gz > Loading triples0010.nt.gz > Loading triples0011.nt.gz > Loading triples0012.nt.gz > Loading triples0013.nt.gz > Loading triples0014.nt.gz > Loading triples0015.nt.gz > Loading triples0016.nt.gz > Loading triples0017.nt.gz > Loading triples0018.nt.gz > Loading triples0019.nt.gz > Error 500: GC overhead limit exceeded > > > Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100) > Loading triples0020.nt.gz > Error 500: GC overhead limit exceeded > > > Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100) > Loading triples0021.nt.gz > Error 500: GC overhead limit exceeded > > > Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100) > > This kept going until triples0042.nt.gz where it hung for hours. > > Meanwhile, on the server, I was still seeing parser warnings, but also > messages like: > 17:01:26 WARN SPARQL_REST$HttpActionREST :: Transaction still active in > endWriter - no commit or abort seen (forced abort) > 17:01:26 WARN Fuseki :: [33] RC = 500 : GC overhead limit > exceeded > java.lang.OutOfMemoryError: GC overhead limit exceeded > > When I finally killed it (with ctrl-C), I got several stack traces in the > stdout log. They appeared to indicate a bad state, so I've saved them and > put them up at: http://pastebin.com/yar5Pq85 > > While OOM is very hard to deal with, I'm still surprised to see it hit this > way, so I thought you might be interested to see it. > > Regards, > Paul Gearon