Re: OutOfMemoryError while loading datagraphs

Frank Budinsky Tue, 08 Mar 2011 10:46:13 -0800

Hi Andy,

I never actually tried to run it past 100,000 graphs, so I don't know what
will happen.


I just finished running it with -xmx1200m, and as you thought it should, it
did run (with 100,000 graphs) to completion. It took just under 4 hours to
run on my Thinkpad T61 laptop running Windows XP and with 3G of RAM. I
noticed that it was loading about 700 graphs/minute at the start, around
550 graphs/minute though most of the run (up till about 90,000 graphs), but
then got significantly slower towards the end (i.e., only about 55
graphs/minute for the last 1000). You mentioned that your JVM was 1.6 G. Is
that something you can config? I noticed that the total memory of my java
process peaked at about 1.2 G.

I'll try running the test again with 500,000 graphs to see if it gets all
the way through. Since it runs so slow, I'll let it run overnight and see
what happens.

How long does it take to load the 100,000 graphs for you?  I assume this
runs much faster on your hardware. I'm wondering if there's a minimum
hardware requirement to run TDB, especially if it's being used to load tens
or hundreds of millions of triples? It would be nice to set expectations
for what kind of hardware is needed for this.

Thanks,
Frank.

Andy Seaborne <[email protected]> wrote on 03/08/2011 11:53:06
AM:

> [image removed]
>
> Re: OutOfMemoryError while loading datagraphs
>
> Andy Seaborne
>
> to:
>
> jena-users
>
> 03/08/2011 11:53 AM
>
> Please respond to jena-users
>
>
>
> On 08/03/11 15:05, Frank Budinsky wrote:
> >
> >>> I tried increasing the amount of memory, but that just increased the
> > number
> >>> of calls that succeed (e.g., 10000 vs 2000) before getting the
> > exception.
> >>
> >> What sizes of heap are you using?
> >
> > I've been experimenting with various heap sizes, noticing that the
bigger I
> > make it the longer it runs
> > before crashing. When using the the default (small) heap size (-xmx64m)
the
> > test program fails at about 5800 graphs. If I bump it all the way up to
> > -xmx1200m, like you did, I suspect I will also be able to run the test
to
> > completion (100,000 graphs), but it takes very long to run (more than 2
> > hours on my machine). I'm guessing this is also running much faster for
> > you?
> >
> > Extrapolating from what you're saying it looks like I would need a heap
of
> > 6G, or so, to hit my original target of 500,000 graphs (about 50M
triples
> > total). Does that sound right? That is, needing to run with such a huge
> > heap?
>
> No - the caches are bounded.  Once they reach steady state, there is no
> further growth and no fixed scale limit.  I'mve run a trial with 500,000
> with -Mx1200M and it works. for me.  JVM is 1.6G.  The caches were still
> filling up in in your test case.
>
> The caches are LRU by slots, which is a bit crude for the node cache as
> nodes vary in size.  Index files have fixed used units (blocks - they
> are 8Kbytes).
>
> The default settings are supposed to work for heap of 1.2G -- it's what
> the scripts set the heap to.
>
> The caches were still filling up in in your test case.
>
>    Andy
>
> >
> > Thanks,
> > Frank.
> >
> > Andy Seaborne<[email protected]>  wrote on 03/08/2011
09:23:50
> > AM:
> >
> >> [image removed]
> >>
> >> Re: OutOfMemoryError while loading datagraphs
> >>
> >> Andy Seaborne
> >>
> >> to:
> >>
> >> jena-users
> >>
> >> 03/08/2011 09:24 AM
> >>
> >> Please respond to jena-users
> >>
> >> (Frank sent me the detached file)
> >>
> >> Frank,
> >>
> >> I'm on a 64 bit machine, but I'm settign direct mode and limiting the
> >> Java heap size with -Xmx
> >>
> >> With a heap of 1200M, java reports 1066M max memory, and the test
runs.
> >> With a heap of 500M, java reports 444M max memory, and the test stops
at
> >> 11800.
> >>
> >> Things will be a little different for 32 bit but should be
approximately
> >> the same.  TDB is doing the same things.
> >>
> >> Tweaking the block cache sizes (sorry, magic needed) down to 5000
(read,
> >> default 10000) and 1000 (write, default 2000), it runs at 500M, but
> > slower.
> >>
> >> There are quite a few files for named graphs so small changes in cache
> >> size get multiplied (x12, I think).
> >>
> >>> I tried increasing the amount of memory, but that just increased the
> > number
> >>> of calls that succeed (e.g., 10000 vs 2000) before getting the
> > exception.
> >>
> >> What sizes of heap are you using?
> >>
> >>     Andy
> >>
> >> On 07/03/11 18:34, Frank Budinsky wrote:
> >>> Hi Andy,
> >>>
> >>> I created a simple standalone test program that roughly simulates
what
> >>> my application is doing and it also crashes with the same
> >>> OutOfMemoryError exception. I've attached it here. Would it be
possible
> >>> for you to give it a try?
> >>>
> >>> /(See attached file: TDBOutOfMemoryTest.java)/
> >>>
> >>> Just change TDB_DIR to some new empty database location and run. It
> >>> get's the OutOfMemoryError at around 5800 graphs when I run it with
> >>> default VM params.
> >>>
> >>> Thanks,
> >>> Frank.
> >>>
> >>>
> >>> Andy Seaborne<[email protected]>  wrote on 03/02/2011
> >>> 09:38:51 AM:
> >>>
> >>>   >  [image removed]
> >>>   >
> >>>   >  Re: OutOfMemoryError while loading datagraphs
> >>>   >
> >>>   >  Andy Seaborne
> >>>   >
> >>>   >  to:
> >>>   >
> >>>   >  jena-users
> >>>   >
> >>>   >  03/02/2011 09:41 AM
> >>>   >
> >>>   >  Please respond to jena-users
> >>>   >
> >>>   >  Hi Frank,
> >>>   >
> >>>   >  On 28/02/11 14:48, Frank Budinsky wrote:
> >>>   >  >
> >>>   >  >  Hi Andy,
> >>>   >  >
> >>>   >  >  I did some further analysis of my OutOfMemeoryError problem,
and
> >>> this is
> >>>   >  >  what I've discovered. The problem seems to be that there is
one
> >>> instance of
> >>>   >  >  class NodeTupleTableConcrete that contains an ever growing
set of
> >>> tuples
> >>>   >  >  which eventually uses up all the available heap space and
then
> > crashes.
> >>>   >  >
> >>>   >  >  To be more specific, this field in class TupleTable:
> >>>   >  >
> >>>   >  >  private final TupleIndex[] indexes ;
> >>>   >  >
> >>>   >  >  seems to contain 6 continually growing TupleIndexRecord
instances
> >>>   >  >  (BPlusTrees). From my measurements, this seems to eat up
> >>> approximately 1G
> >>>   >  >  of heap for every 1M triples in the Dataset (i.e., about 1K
per
> >>> datagraph).
> >>>   >  >  So, to load my 100K datagraphs (~10M total triples) it would
seem
> >>> to need
> >>>   >  >  10G of heap space.
> >>>   >
> >>>   >  There are 6 indexes for named graphs (see the files GSPO etc).
TDB
> > uses
> >>>   >  total indexing which puts a lot of work at load time but means
any
> >>>   >  lookup needed is always done with an index scan. The code can
run
> > with
> >>>   >  less indexes - the minimum is one - but that is no exposed in
the
> >>>   >  configuration.
> >>>   >
> >>>   >  Each index holds quads (4 NodeIds, a NodeId is 64 bits on disk).
As
> > the
> >>>   >  index grows the data goes to disk. There is a finite LRU cache
in
> > front
> >>>   >  of each index.
> >>>   >
> >>>   >  Does your dataset have a location? If has no location, it's all
> >>>   >  in-memory with a RAM-disk like structure. This is for
small-scale
> >>>   >  testing only - it really does read and write blocks out of the
RAM
> > disk
> >>>   >  by copy to give strict disk-like semantics.
> >>>   >
> >>>   >  There is also a NodeTable mapping between NodeId and Node
(Jena's
> >>>   >  graph-level RDF Term class). This has a cache in front of it .
> >>>   >  readPropertiesFile
> >>>   >  The long-ish literals maybe the problem. The node table cache is
> >>>   >  fixed-number, not bounded by size.
> >>>   >
> >>>   >  The sizeof the caches are controlled by:
> >>>   >
> >>>   >  SystemTDB.Node2NodeIdCacheSize
> >>>   >  SystemTDB.NodeId2NodeCacheSize
> >>>   >
> >>>   >  These are not easy to control but either (1) get the source code
and
> >>>   >  alter the default values (2) see the code in SystemTDB that uses
a
> >>>   >  properties file.
> >>>   >
> >>>   >  If you can end me a copy of the data, I can try loading it here.
> >>>   >
> >>>   >  >  Does this make sense? How is it supposed to work? Shouldn't
the
> > triples
> >>>   >  >  from previously loaded named graphs be eligable for GC when
I'm
> >>> loading the
> >>>   >  >  next named graph? Could it be that I'm holding onto something
> > that's
> >>>   >  >  preventing GC in the TupleTable?
> >>>   >  >
> >>>   >  >  Also, after looking more carefully at the resources being
indexed,
> > I
> >>>   >  >  noticed that many of them do have relatively large literals
(100s
> > of
> >>>   >  >  characters). I also noticed that when using Fuseki to load
the
> >>> resources I
> >>>   >  >  get lots of warning messages like this, on the console:
> >>>   >  >
> >>>   >  >  Lexical form 'We are currently doing
> >>>   >  >  this:<br></br><br></br>workspaceConnection.replaceComponents
> >>>   >  >  (replaceComponents, replaceSource, falses, false,
> >>>   >  >  monitor);<br></br><br></br>the new way of doing it would be
> > something
> >>>   >  >
> >>>
> >
like:<br></br><br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >>>   >  >  ArrayList&lt;IComponentOp&gt; replaceOps = new
> >>>   >  >  ArrayList&lt;IComponentOp&gt;();<br></
> >>>   >  br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >>>   >  >  for (Iterator iComponents = components.iterator();
> >>> iComponents.hasNext();)
> >>>   >  >
> >>>
> >
{<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >
> >>>   >  >  IComponentHandle componentHandle = (IComponentHandle)
> > iComponents.next
> >>>   >  >  ();<br></
> >>>   >
> >>>
> >>
> >
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >
> >>>   >  >  replaceOps.add(promotionTargetConnection.componentOpFactory
> >>>   >  >  ().replaceComponent
> >>>   >  >  (componentHandle,<br></
> >>>   >
> >>>
> >>
> >
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >
> >>>   >  >  buildWorkspaceConnection,
> >>>   >  >  false));<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
> >>>   >  <br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >>>   >  >  promotionTargetConnection.applyComponentOperations
(replaceOps,
> >>> monitor);'
> >>>   >  >  not valid for datatype
> >>>   >  >  http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
> >>>   >  >
> >>>   >  >  Could this be part of the problem?
> >>>   >
> >>>   >  No - it's a different issue. This is something coming from the
> > parser.
> >>>   >
> >>>   >  RDF XMLLiterals have special rules - they must follow
> >>>   >  exclusive canonical XML, which means, amongst a lot of other
thigs,
> > they
> >>>   >  have to be a single XML node. The rules for exclusive Canonical
XML
> > are
> >>>   >  really quite strict (e.g. attributes in alphabetical order).
> >>>   >
> >>>   >  http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
> >>>   >
> >>>   >  If you want to store XML or HTML fragments, you can't use RDF
> >>>   >  XMLLiterals very easily - you have to mangle them to conform to
the
> >>>   >  rules. I suggest either as strings or invent your own datatype.
> >>>   >
> >>>   >  You can run the parser on it's own using
> >>>   >  "riotcmd.riot --validate FILE ..."
> >>>   >
> >>>   >
> >>>   >  Andy
> >>>   >
> >>>   >  >
> >>>   >  >  Thanks,
> >>>   >  >  Frank.
> >>>

Re: OutOfMemoryError while loading datagraphs

Reply via email to