Re: OutOfMemoryError while loading datagraphs

Frank Budinsky Tue, 08 Mar 2011 07:05:57 -0800

> > I tried increasing the amount of memory, but that just increased the
number
> > of calls that succeed (e.g., 10000 vs 2000) before getting the
exception.
>
> What sizes of heap are you using?


I've been experimenting with various heap sizes, noticing that the bigger I
make it the longer it runs
before crashing. When using the the default (small) heap size (-xmx64m) the
test program fails at about 5800 graphs. If I bump it all the way up to
-xmx1200m, like you did, I suspect I will also be able to run the test to
completion (100,000 graphs), but it takes very long to run (more than 2
hours on my machine). I'm guessing this is also running much faster for
you?

Extrapolating from what you're saying it looks like I would need a heap of
6G, or so, to hit my original target of 500,000 graphs (about 50M triples
total). Does that sound right? That is, needing to run with such a huge
heap?

Thanks,
Frank.

Andy Seaborne <[email protected]> wrote on 03/08/2011 09:23:50
AM:

> [image removed]
>
> Re: OutOfMemoryError while loading datagraphs
>
> Andy Seaborne
>
> to:
>
> jena-users
>
> 03/08/2011 09:24 AM
>
> Please respond to jena-users
>
> (Frank sent me the detached file)
>
> Frank,
>
> I'm on a 64 bit machine, but I'm settign direct mode and limiting the
> Java heap size with -Xmx
>
> With a heap of 1200M, java reports 1066M max memory, and the test runs.
> With a heap of 500M, java reports 444M max memory, and the test stops at
> 11800.
>
> Things will be a little different for 32 bit but should be approximately
> the same.  TDB is doing the same things.
>
> Tweaking the block cache sizes (sorry, magic needed) down to 5000 (read,
> default 10000) and 1000 (write, default 2000), it runs at 500M, but
slower.
>
> There are quite a few files for named graphs so small changes in cache
> size get multiplied (x12, I think).
>
> > I tried increasing the amount of memory, but that just increased the
number
> > of calls that succeed (e.g., 10000 vs 2000) before getting the
exception.
>
> What sizes of heap are you using?
>
>    Andy
>
> On 07/03/11 18:34, Frank Budinsky wrote:
> > Hi Andy,
> >
> > I created a simple standalone test program that roughly simulates what
> > my application is doing and it also crashes with the same
> > OutOfMemoryError exception. I've attached it here. Would it be possible
> > for you to give it a try?
> >
> > /(See attached file: TDBOutOfMemoryTest.java)/
> >
> > Just change TDB_DIR to some new empty database location and run. It
> > get's the OutOfMemoryError at around 5800 graphs when I run it with
> > default VM params.
> >
> > Thanks,
> > Frank.
> >
> >
> > Andy Seaborne <[email protected]> wrote on 03/02/2011
> > 09:38:51 AM:
> >
> >  > [image removed]
> >  >
> >  > Re: OutOfMemoryError while loading datagraphs
> >  >
> >  > Andy Seaborne
> >  >
> >  > to:
> >  >
> >  > jena-users
> >  >
> >  > 03/02/2011 09:41 AM
> >  >
> >  > Please respond to jena-users
> >  >
> >  > Hi Frank,
> >  >
> >  > On 28/02/11 14:48, Frank Budinsky wrote:
> >  > >
> >  > > Hi Andy,
> >  > >
> >  > > I did some further analysis of my OutOfMemeoryError problem, and
> > this is
> >  > > what I've discovered. The problem seems to be that there is one
> > instance of
> >  > > class NodeTupleTableConcrete that contains an ever growing set of
> > tuples
> >  > > which eventually uses up all the available heap space and then
crashes.
> >  > >
> >  > > To be more specific, this field in class TupleTable:
> >  > >
> >  > > private final TupleIndex[] indexes ;
> >  > >
> >  > > seems to contain 6 continually growing TupleIndexRecord instances
> >  > > (BPlusTrees). From my measurements, this seems to eat up
> > approximately 1G
> >  > > of heap for every 1M triples in the Dataset (i.e., about 1K per
> > datagraph).
> >  > > So, to load my 100K datagraphs (~10M total triples) it would seem
> > to need
> >  > > 10G of heap space.
> >  >
> >  > There are 6 indexes for named graphs (see the files GSPO etc). TDB
uses
> >  > total indexing which puts a lot of work at load time but means any
> >  > lookup needed is always done with an index scan. The code can run
with
> >  > less indexes - the minimum is one - but that is no exposed in the
> >  > configuration.
> >  >
> >  > Each index holds quads (4 NodeIds, a NodeId is 64 bits on disk). As
the
> >  > index grows the data goes to disk. There is a finite LRU cache in
front
> >  > of each index.
> >  >
> >  > Does your dataset have a location? If has no location, it's all
> >  > in-memory with a RAM-disk like structure. This is for small-scale
> >  > testing only - it really does read and write blocks out of the RAM
disk
> >  > by copy to give strict disk-like semantics.
> >  >
> >  > There is also a NodeTable mapping between NodeId and Node (Jena's
> >  > graph-level RDF Term class). This has a cache in front of it .
> >  > readPropertiesFile
> >  > The long-ish literals maybe the problem. The node table cache is
> >  > fixed-number, not bounded by size.
> >  >
> >  > The sizeof the caches are controlled by:
> >  >
> >  > SystemTDB.Node2NodeIdCacheSize
> >  > SystemTDB.NodeId2NodeCacheSize
> >  >
> >  > These are not easy to control but either (1) get the source code and
> >  > alter the default values (2) see the code in SystemTDB that uses a
> >  > properties file.
> >  >
> >  > If you can end me a copy of the data, I can try loading it here.
> >  >
> >  > > Does this make sense? How is it supposed to work? Shouldn't the
triples
> >  > > from previously loaded named graphs be eligable for GC when I'm
> > loading the
> >  > > next named graph? Could it be that I'm holding onto something
that's
> >  > > preventing GC in the TupleTable?
> >  > >
> >  > > Also, after looking more carefully at the resources being indexed,
I
> >  > > noticed that many of them do have relatively large literals (100s
of
> >  > > characters). I also noticed that when using Fuseki to load the
> > resources I
> >  > > get lots of warning messages like this, on the console:
> >  > >
> >  > > Lexical form 'We are currently doing
> >  > > this:<br></br><br></br>workspaceConnection.replaceComponents
> >  > > (replaceComponents, replaceSource, falses, false,
> >  > > monitor);<br></br><br></br>the new way of doing it would be
something
> >  > >
> >
like:<br></br><br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >  > > ArrayList&lt;IComponentOp&gt; replaceOps = new
> >  > > ArrayList&lt;IComponentOp&gt;();<br></
> >  > br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >  > > for (Iterator iComponents = components.iterator();
> > iComponents.hasNext();)
> >  > >
> >
{<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >  > > IComponentHandle componentHandle = (IComponentHandle)
iComponents.next
> >  > > ();<br></
> >  >
> >
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >  > > replaceOps.add(promotionTargetConnection.componentOpFactory
> >  > > ().replaceComponent
> >  > > (componentHandle,<br></
> >  >
> >
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >  > > buildWorkspaceConnection,
> >  > > false));<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
> >  > <br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >  > > promotionTargetConnection.applyComponentOperations(replaceOps,
> > monitor);'
> >  > > not valid for datatype
> >  > > http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
> >  > >
> >  > > Could this be part of the problem?
> >  >
> >  > No - it's a different issue. This is something coming from the
parser.
> >  >
> >  > RDF XMLLiterals have special rules - they must follow
> >  > exclusive canonical XML, which means, amongst a lot of other thigs,
they
> >  > have to be a single XML node. The rules for exclusive Canonical XML
are
> >  > really quite strict (e.g. attributes in alphabetical order).
> >  >
> >  > http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
> >  >
> >  > If you want to store XML or HTML fragments, you can't use RDF
> >  > XMLLiterals very easily - you have to mangle them to conform to the
> >  > rules. I suggest either as strings or invent your own datatype.
> >  >
> >  > You can run the parser on it's own using
> >  > "riotcmd.riot --validate FILE ..."
> >  >
> >  >
> >  > Andy
> >  >
> >  > >
> >  > > Thanks,
> >  > > Frank.
> >

Re: OutOfMemoryError while loading datagraphs

Reply via email to