> > I tried increasing the amount of memory, but that just increased the number > > of calls that succeed (e.g., 10000 vs 2000) before getting the exception. > > What sizes of heap are you using?
I've been experimenting with various heap sizes, noticing that the bigger I make it the longer it runs before crashing. When using the the default (small) heap size (-xmx64m) the test program fails at about 5800 graphs. If I bump it all the way up to -xmx1200m, like you did, I suspect I will also be able to run the test to completion (100,000 graphs), but it takes very long to run (more than 2 hours on my machine). I'm guessing this is also running much faster for you? Extrapolating from what you're saying it looks like I would need a heap of 6G, or so, to hit my original target of 500,000 graphs (about 50M triples total). Does that sound right? That is, needing to run with such a huge heap? Thanks, Frank. Andy Seaborne <[email protected]> wrote on 03/08/2011 09:23:50 AM: > [image removed] > > Re: OutOfMemoryError while loading datagraphs > > Andy Seaborne > > to: > > jena-users > > 03/08/2011 09:24 AM > > Please respond to jena-users > > (Frank sent me the detached file) > > Frank, > > I'm on a 64 bit machine, but I'm settign direct mode and limiting the > Java heap size with -Xmx > > With a heap of 1200M, java reports 1066M max memory, and the test runs. > With a heap of 500M, java reports 444M max memory, and the test stops at > 11800. > > Things will be a little different for 32 bit but should be approximately > the same. TDB is doing the same things. > > Tweaking the block cache sizes (sorry, magic needed) down to 5000 (read, > default 10000) and 1000 (write, default 2000), it runs at 500M, but slower. > > There are quite a few files for named graphs so small changes in cache > size get multiplied (x12, I think). > > > I tried increasing the amount of memory, but that just increased the number > > of calls that succeed (e.g., 10000 vs 2000) before getting the exception. > > What sizes of heap are you using? > > Andy > > On 07/03/11 18:34, Frank Budinsky wrote: > > Hi Andy, > > > > I created a simple standalone test program that roughly simulates what > > my application is doing and it also crashes with the same > > OutOfMemoryError exception. I've attached it here. Would it be possible > > for you to give it a try? > > > > /(See attached file: TDBOutOfMemoryTest.java)/ > > > > Just change TDB_DIR to some new empty database location and run. It > > get's the OutOfMemoryError at around 5800 graphs when I run it with > > default VM params. > > > > Thanks, > > Frank. > > > > > > Andy Seaborne <[email protected]> wrote on 03/02/2011 > > 09:38:51 AM: > > > > > [image removed] > > > > > > Re: OutOfMemoryError while loading datagraphs > > > > > > Andy Seaborne > > > > > > to: > > > > > > jena-users > > > > > > 03/02/2011 09:41 AM > > > > > > Please respond to jena-users > > > > > > Hi Frank, > > > > > > On 28/02/11 14:48, Frank Budinsky wrote: > > > > > > > > Hi Andy, > > > > > > > > I did some further analysis of my OutOfMemeoryError problem, and > > this is > > > > what I've discovered. The problem seems to be that there is one > > instance of > > > > class NodeTupleTableConcrete that contains an ever growing set of > > tuples > > > > which eventually uses up all the available heap space and then crashes. > > > > > > > > To be more specific, this field in class TupleTable: > > > > > > > > private final TupleIndex[] indexes ; > > > > > > > > seems to contain 6 continually growing TupleIndexRecord instances > > > > (BPlusTrees). From my measurements, this seems to eat up > > approximately 1G > > > > of heap for every 1M triples in the Dataset (i.e., about 1K per > > datagraph). > > > > So, to load my 100K datagraphs (~10M total triples) it would seem > > to need > > > > 10G of heap space. > > > > > > There are 6 indexes for named graphs (see the files GSPO etc). TDB uses > > > total indexing which puts a lot of work at load time but means any > > > lookup needed is always done with an index scan. The code can run with > > > less indexes - the minimum is one - but that is no exposed in the > > > configuration. > > > > > > Each index holds quads (4 NodeIds, a NodeId is 64 bits on disk). As the > > > index grows the data goes to disk. There is a finite LRU cache in front > > > of each index. > > > > > > Does your dataset have a location? If has no location, it's all > > > in-memory with a RAM-disk like structure. This is for small-scale > > > testing only - it really does read and write blocks out of the RAM disk > > > by copy to give strict disk-like semantics. > > > > > > There is also a NodeTable mapping between NodeId and Node (Jena's > > > graph-level RDF Term class). This has a cache in front of it . > > > readPropertiesFile > > > The long-ish literals maybe the problem. The node table cache is > > > fixed-number, not bounded by size. > > > > > > The sizeof the caches are controlled by: > > > > > > SystemTDB.Node2NodeIdCacheSize > > > SystemTDB.NodeId2NodeCacheSize > > > > > > These are not easy to control but either (1) get the source code and > > > alter the default values (2) see the code in SystemTDB that uses a > > > properties file. > > > > > > If you can end me a copy of the data, I can try loading it here. > > > > > > > Does this make sense? How is it supposed to work? Shouldn't the triples > > > > from previously loaded named graphs be eligable for GC when I'm > > loading the > > > > next named graph? Could it be that I'm holding onto something that's > > > > preventing GC in the TupleTable? > > > > > > > > Also, after looking more carefully at the resources being indexed, I > > > > noticed that many of them do have relatively large literals (100s of > > > > characters). I also noticed that when using Fuseki to load the > > resources I > > > > get lots of warning messages like this, on the console: > > > > > > > > Lexical form 'We are currently doing > > > > this:<br></br><br></br>workspaceConnection.replaceComponents > > > > (replaceComponents, replaceSource, falses, false, > > > > monitor);<br></br><br></br>the new way of doing it would be something > > > > > > like:<br></br><br></br><br></br> > > > > ArrayList<IComponentOp> replaceOps = new > > > > ArrayList<IComponentOp>();<br></ > > > br> > > > > for (Iterator iComponents = components.iterator(); > > iComponents.hasNext();) > > > > > > {<br></br> > > > > IComponentHandle componentHandle = (IComponentHandle) iComponents.next > > > > ();<br></ > > > > > > br> > > > > replaceOps.add(promotionTargetConnection.componentOpFactory > > > > ().replaceComponent > > > > (componentHandle,<br></ > > > > > > br> > > > > buildWorkspaceConnection, > > > > false));<br></br> } > > > <br></br><br></br> > > > > promotionTargetConnection.applyComponentOperations(replaceOps, > > monitor);' > > > > not valid for datatype > > > > http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral > > > > > > > > Could this be part of the problem? > > > > > > No - it's a different issue. This is something coming from the parser. > > > > > > RDF XMLLiterals have special rules - they must follow > > > exclusive canonical XML, which means, amongst a lot of other thigs, they > > > have to be a single XML node. The rules for exclusive Canonical XML are > > > really quite strict (e.g. attributes in alphabetical order). > > > > > > http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral > > > > > > If you want to store XML or HTML fragments, you can't use RDF > > > XMLLiterals very easily - you have to mangle them to conform to the > > > rules. I suggest either as strings or invent your own datatype. > > > > > > You can run the parser on it's own using > > > "riotcmd.riot --validate FILE ..." > > > > > > > > > Andy > > > > > > > > > > > Thanks, > > > > Frank. > >
