Hi Paul,

Thanks for the report.  This is a known issue in Fuseki (see JENA-309
[1]).  I have plans to work on this soon.  Also I'm a little surprised
that your second attempt after breaking it into chunks failed, I'll
take a look at that.

I am also working on a related issue (JENA-330 [2]) that will
eliminate limits on SPARQL Update queries.  I hope to have that
checked into the trunk soon.

-Stephen

[1] https://issues.apache.org/jira/browse/JENA-309
[2] https://issues.apache.org/jira/browse/JENA-330



On Fri, Nov 2, 2012 at 5:24 PM, Paul Gearon <gea...@ieee.org> wrote:
> This is probably pushing Jena beyond it's design limits, but I thought I'd
> report on it anyway.
>
> I needed to test some things with large data sets, so I tried to load the
> data from http://basekb.com/
>
> Once extracted from the tar.gz file, it creates a directory called baseKB
> filled with 1024 gzipped nt files.
>
> On my first attempt, I grabbed a fresh copy of Fuseki 0.2.5 and started it
> with TDB storage. I didn't want to individually load 1024 files from the
> control panel, so I used zcat to dump everything into one file and tried
> loading from the GUI. This failed in short order with RIOT complaining of
> memory:
>
> 13:24:31 WARN  Fuseki               :: [1] RC = 500 : Java heap space
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOfRange(Arrays.java:2694)
> at java.lang.String.<init>(String.java:234)
> at java.lang.StringBuilder.toString(StringBuilder.java:405)
> at org.openjena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:476)
> ...etc...
>
> I'm wondering if RIOT really needed to run out of memory?
>
> Anyway, I went back to the individual files. That meant using a non-gui
> approach. I wasn't sure about using a media type for nt, but that's
> compatible with Turtle, so I used test/turtle.
>
> I threw away the DB directory and started again. This time I tried to load
> the files with the following bash:
>
> for i in *.nt.gz; do
>   echo "Loading $i"
>   zcat $i | curl -x POST -H "Content-Type: text/turtle" --upload-file - "
> http://localhost:3030/dataset/data?default";
> done
>
> This started reasonably well. A number of warnings showed up on the server
> side, due to bad language tags and invalid IRIs, but it kept going.
> However, on the 20th file I started seeing these:
> Loading triples0000.nt.gz
> Loading triples0001.nt.gz
> Loading triples0002.nt.gz
> Loading triples0003.nt.gz
> Loading triples0004.nt.gz
> Loading triples0005.nt.gz
> Loading triples0006.nt.gz
> Loading triples0007.nt.gz
> Loading triples0008.nt.gz
> Loading triples0009.nt.gz
> Loading triples0010.nt.gz
> Loading triples0011.nt.gz
> Loading triples0012.nt.gz
> Loading triples0013.nt.gz
> Loading triples0014.nt.gz
> Loading triples0015.nt.gz
> Loading triples0016.nt.gz
> Loading triples0017.nt.gz
> Loading triples0018.nt.gz
> Loading triples0019.nt.gz
> Error 500: GC overhead limit exceeded
>
>
> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
> Loading triples0020.nt.gz
> Error 500: GC overhead limit exceeded
>
>
> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
> Loading triples0021.nt.gz
> Error 500: GC overhead limit exceeded
>
>
> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
>
> This kept going until triples0042.nt.gz where it hung for hours.
>
> Meanwhile, on the server, I was still seeing parser warnings, but also
> messages like:
> 17:01:26 WARN  SPARQL_REST$HttpActionREST :: Transaction still active in
> endWriter - no commit or abort seen (forced abort)
> 17:01:26 WARN  Fuseki               :: [33] RC = 500 : GC overhead limit
> exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> When I finally killed it (with ctrl-C), I got several stack traces in the
> stdout log. They appeared to indicate a bad state, so I've saved them and
> put them up at:  http://pastebin.com/yar5Pq85
>
> While OOM is very hard to deal with, I'm still surprised to see it hit this
> way, so I thought you might be interested to see it.
>
> Regards,
> Paul Gearon

Reply via email to