10M total I hope :-)

Yes, that's the total for this experiment. Would you say that is getting to
the upper limit of what's possible?

Nowhere near.

32 bit isn't great for scaling but it just goes slower, not breaks.

64 bit and 1billion is possible if the queries are simple (no data mining in SPARQL over 1B triples just yet!) Everythign works - but too slow.


Is this on a 32 bit machine or a 64 bit machine? Also, which JVM is it?

32 bit machine and standard 1.6 JVM.

What does the data look like?

Pretty standard RDF/XML, ranging in size from 50 - 400 lines of XML. Here's
one example:


Looks OK - was just checking it's not full of very, very large literals.

<rdf:RDF

</rdf:RDF>


        fResourceDataset.getLock().enterCriticalSection(Lock.WRITE);
        try {
              Model model = fResourceDataset.getNamedModel(resourceURI);
              model.read(instream, null);
              //model.close();
        } finally { fResourceDataset.getLock().leaveCriticalSection
() ; }
        instream.close();

After calling this code about 2-3 thousand times, it starts to run much
slower, and then eventually I get an exception like this:

        Exception in thread "pool-3-thread-43"
java.lang.OutOfMemoryError:
Java heap space

Could you provide a complete minimal example please?  There are some
details like how fResourceDataset is set that might make a difference.

It might be hard to get a simple example.

fResourceDataset is created like this:

     TDBFactory.createDataset(dirName);

I remove the directory between runs, so it starts with an empty dataset.

I also have this initialization in my program:

       static {
             // Configure Jena TDB so that the default data graph in SPARQL
queries will be the union of all named graphs.
             // Each resource added to the index will be stored in a
separate TDB data graph.
             // The actual default (hidden) data graph will be used to store
configuration information for the index.
             TDB.getContext().set(TDB.symUnionDefaultGraph, true);

Better off for updates just to be simpler but it should not matter.

             TDB.setOptimizerWarningFlag(false); //TODO do we need to
provide a BGP optimizer?

No need for updates.

       }

Could any of this be causing problems?

That looks good to me.



The stacktrace might be useful as well although it is not proof exactly
where the memory is in use.

It might make more sense for me to try to track this down further myself,
if you can just confirm that you don't see anything wrong with how I'm
using Jena, I'll take it from there.

OK

RDF/XML parsing is expensive - N-Triples is fastest.

Is the difference really large? Are there any performance numbers available
that show Jena performance and load speeds that can be expected?

It's really, really difficult to give useful, honest performance numbers. Hardware matters, portables have slow disks, 64 bit is better than 32.

But a good workflow is, if getting data from elsewhere, to check it for parse errors and bad URIs or literals, then load it. Validation can be done by RDF/XML -> N-triples then load the N-triples.

It's in the bulk loader that N-triples makes a big advantage (x2 or more).

        Andy



    Andy

Thanks a lot for your help!

Frank.

Reply via email to