Hi Andy and all,
I finally managed to get a relatively powerful machine set up and I reran
the test program I sent you, but unfortunately, it still runs orders of
magnitude slower than the numbers you got when you tried it. The hardware I
used this time is similar to what you are using, the only significant
difference is that it's running Window's 7 instead of Ubuntu. I know Linux
is somewhat faster than Windows, but I don't expect we can blame Microsoft
for a change from 573.87 seconds to about 4 hours :-)
Are you sure that your numbers are correct? How big was the TDB database on
disk at the end of the run? Do you have any other ideas what may be wrong
with my configuration?
I would very much appreciate it if others on this mailing list could also
give it a quick try. I'd like to know if it's just me (and my colleagues),
or is there some kind of pattern to explain this huge difference.
Here is the simple test program again (inlined this time, since Apache
seems to throw away attachments). To run it, just change the TDB_DIR
constant to some empty directory. The test program loads100,000 datagraphs
(about 100 triples each -> 10M triples total). It prints a message on the
console at every 100, so if it's taking seconds for each println, you'll
know very quickly that it will take hours to run, instead of a few minutes,
like Andy has seen.
Thanks,
Frank.
>>>>>>>> BEGIN TDBOutOfMemoryTest.java
package com.ibm.lld.test;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import com.hp.hpl.jena.query.Dataset;
import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.shared.Lock;
import com.hp.hpl.jena.tdb.TDB;
import com.hp.hpl.jena.tdb.TDBFactory;
public class TDBOutOfMemoryTest
{
public static final String TDB_DIR = "D:/work/relm/outofmem_jena_DB";
public static final int NOGRAPHS = 100000; // Number of data graphs
to load
public static void main( String[] args ) {
System.out.println("> Starting test: " + new java.util.Date());
Dataset dataset = TDBFactory.createDataset(TDB_DIR);
System.out.println("> Initial number of indexed graphs: " +
dataset.asDatasetGraph().size());
try {
for (int i=0; i<NOGRAPHS; ) {
InputStream instream = getGraph(i); // the RDF
graph to load
dataset.getLock().enterCriticalSection(Lock.WRITE);
try {
Model model = dataset.getNamedModel
("https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/"
+ i);
model.read(instream, null);
//model.close();
} finally { dataset.getLock().leaveCriticalSection
() ; }
if (++i % 100 == 0) System.out.println(i/100 + "00
at: " + new java.util.Date());
instream.close();
}
TDB.sync(dataset);
dataset.close();
System.out.println("> Done at: " + new java.util.Date());
}
catch (IOException e) {
System.out.println("> Failed: " + e.getMessage());
}
}
private static InputStream getGraph(int no) {
String graph = GRAPH_TEMPLATE.replaceAll("%NUMBER%",
String.valueOf(no));
return new ByteArrayInputStream(graph.getBytes());
}
private static final String GRAPH_TEMPLATE =
"<rdf:RDF\n" +
" xmlns:rdf=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"\n" +
" xmlns:j.0=\"http://open-services.net/ns/core#\"\n" +
" xmlns:j.1=\"http://open-services.net/ns/cm-x#\"\n" +
" xmlns:j.2=\"http://purl.org/dc/terms/\"\n" +
" xmlns:j.3=\"http://open-services.net/ns/cm#\"\n" +
" xmlns:j.5=
\"http://jazz.net/xmlns/prod/jazz/rtc/ext/1.0/\"\n" +
" xmlns:j.4=\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/\"
> \n" +
" <rdf:Description rdf:nodeID=\"A0\">\n" +
" <j.2:title>@mandrew</j.2:title>\n" +
" <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
" <rdf:object rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_y4JlcPYJEdqU64Cr2VV0dQ\"/>
\n" +
" <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
" <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
" </rdf:Description>\n" +
" <rdf:Description rdf:about=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\">
\n" +
" <j.2:title rdf:parseType=\"Literal\">Process REST Service
doesn't scale</j.2:title>\n" +
" <rdf:type rdf:resource=
\"http://open-services.net/ns/cm#ChangeRequest\"/>\n" +
"
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_P_wUELLTEduhAusIxeOxbA\"/>
\n" +
"
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_gTuTMG62Edu8R4joT9P1Ug\"/>
\n" +
"
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_y4JlcPYJEdqU64Cr2VV0dQ\"/>
\n" +
"
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_vCJP8OeKEduR89vYjZT89g\"/>
\n" +
"
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/sandbox02/ccm/service/com.ibm.team.process.internal.common.service.IProcessRestService/processAreasForUser?userId=shilpat
\"/>\n" +
"
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_DziEAHHfEdyLLb7t1B32_A\"/>
\n" +
"
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/wiki/bin/view/Main/DraftTeamProcessRestApi#Project_Areas_collection
\"/>\n" +
"
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/154737
\"/>\n" +
" <j.3:fixed>false</j.3:fixed>\n" +
" <j.0:discussion rdf:resource=
\"https://jazz.net/jazz/oslc/workitems/_p0Lr8D7SEeCl0bUDoWAOSQ/rtc_cm:comments
\"/>\n" +
" <j.2:contributor rdf:resource=
\"https://jazz.net/jazz/oslc/users/_y4JlcPYJEdqU64Cr2VV0dQ\"/>\n" +
" <j.5:client rdf:resource=
\"https://jazz.net/jazz/oslc/enumerations/_Q2fMII8EEd2Q-OW8dr3S5w/client/client.literal.l12
\"/>\n" +
" <j.4:plannedFor rdf:resource=
\"https://jazz.net/jazz/oslc/iterations/_VceosAh8EeC72Mz-78YBKQ\"/>\n" +
" <j.2:modified>2011-02-23T22:33:45.764Z</j.2:modified>\n" +
" <j.5:contextId>_Q2fMII8EEd2Q-OW8dr3S5w</j.5:contextId>\n"
+
" <j.4:timeSheet rdf:resource=
\"https://jazz.net/jazz/oslc/workitems/_p0Lr8D7SEeCl0bUDoWAOSQ/rtc_cm:timeSheet
\"/>\n" +
" <j.4:filedAgainst rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.workitem.Category/_YNQI4I8FEd2Q-OW8dr3S5w
\"/>\n" +
" <j.3:verified>false</j.3:verified>\n" +
" <j.4:correctedEstimate></j.4:correctedEstimate>\n" +
" <j.1:priority rdf:resource=
\"https://jazz.net/jazz/oslc/enumerations/_Q2fMII8EEd2Q-OW8dr3S5w/priority/4
\"/>\n" +
" <j.3:approved>false</j.3:approved>\n" +
" <j.3:status>In Progress</j.3:status>\n" +
" <j.2:type>Defect</j.2:type>\n" +
" <j.4:modifiedBy rdf:resource=
\"https://jazz.net/jazz/oslc/users/_y4JlcPYJEdqU64Cr2VV0dQ\"/>\n" +
" <j.2:created>2011-02-22T22:35:15.682Z</j.2:created>\n" +
" <j.4:timeSpent></j.4:timeSpent>\n" +
" <j.3:closed>false</j.3:closed>\n" +
" <j.0:shortTitle rdf:parseType=\"Literal\">Defect
%NUMBER%</j.0:shortTitle>\n" +
" <j.4:state rdf:resource=
\"https://jazz.net/jazz/oslc/workflows/_Q2fMII8EEd2Q-OW8dr3S5w/states/bugzillaWorkflow/2
\"/>\n" +
" <j.4:estimate></j.4:estimate>\n" +
" <j.2:identifier>%NUMBER%</j.2:identifier>\n" +
" <j.2:description rdf:datatype=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral\">On Jazz.net, we
have a 3.0 &quot;sandbox&quot; deployed (CCM+JTS) which allows any
jazz.net user to create a project to try out RTC.&nbsp; We're seeing
massive performance problems due to an apparent scalability problem in
process.&nbsp; Currently, the sandbox has &gt; 100 projects
created. This is causing the following
issues:<br></br><br></br>1) Sandbox home page loads
a list of the current user's projects by calling process rest service with
user id.&nbsp; This request takes &gt; 60
seconds.<br></br>2) CCM app gets stuck on
&quot;Loading...&quot; for &gt; 60 seconds, spinning on the
request to InitializationData.&nbsp; InitData is waiting to get the
response from process's initializer (which is doing a lookup based on the
name of the project).<br></br>3) Home menu hangs for a long
time waiting for the list of projects to populate (There's also a UI
scaleability issue in the home menu... the number of projects exceeds
available space in the viewport, but that's another
item).<br></br><br></br>This is a blocker. Process
must be able to scale to hundreds, possibly thousands of projects, without
slowing down the loading of the web UI.</j.2:description>\n" +
" <j.4:foundIn rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.workitem.Deliverable/_kuFJcPDhEd-1FumPcb1epw
\"/>\n" +
" <j.5:howfound rdf:resource=
\"https://jazz.net/jazz/oslc/enumerations/_Q2fMII8EEd2Q-OW8dr3S5w/howfound/howfound.literal.l3
\"/>\n" +
" <j.2:subject>service</j.2:subject>\n" +
" <j.1:severity rdf:resource=
\"https://jazz.net/jazz/oslc/enumerations/_Q2fMII8EEd2Q-OW8dr3S5w/severity/5
\"/>\n" +
" <j.4:type rdf:resource=
\"https://jazz.net/jazz/oslc/types/_Q2fMII8EEd2Q-OW8dr3S5w/defect\"/>\n" +
" <j.1:project rdf:resource=
\"https://jazz.net/jazz/oslc/projectareas/_Q2fMII8EEd2Q-OW8dr3S5w\"/>\n" +
" <j.2:creator rdf:resource=
\"https://jazz.net/jazz/oslc/users/_gTuTMG62Edu8R4joT9P1Ug\"/>\n" +
" <j.4:teamArea rdf:resource=
\"https://jazz.net/jazz/oslc/teamareas/_ER2xcI8FEd2Q-OW8dr3S5w\"/>\n" +
" <j.3:reviewed>false</j.3:reviewed>\n" +
" <j.5:archived>false</j.5:archived>\n" +
" <j.4:resolvedBy rdf:resource=
\"https://jazz.net/jazz/oslc/users/_YNh4MOlsEdq4xpiOKg5hvA\"/>\n" +
" <j.5:os rdf:resource=
\"https://jazz.net/jazz/oslc/enumerations/_Q2fMII8EEd2Q-OW8dr3S5w/OS/OS.literal.l1
\"/>\n" +
" <j.3:inprogress>true</j.3:inprogress>\n" +
" <j.0:serviceProvider rdf:resource=
\"https://jazz.net/jazz/oslc/contexts/_Q2fMII8EEd2Q-OW8dr3S5w/workitems/services
\"/>\n" +
" <j.4:progressTracking rdf:resource=
\"https://jazz.net/jazz/oslc/workitems/_p0Lr8D7SEeCl0bUDoWAOSQ/progressTracking
\"/>\n" +
"
<j.4:com.ibm.team.filesystem.workitems.change_set.com.ibm.team.scm.ChangeSet
rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.scm.ChangeSet/_p2Q08T7lEeC50ZOFeYh_9w
\"/>\n" +
"
<j.4:com.ibm.team.filesystem.workitems.change_set.com.ibm.team.scm.ChangeSet
rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.scm.ChangeSet/_S7U3gT8XEeC50ZOFeYh_9w
\"/>\n" +
" </rdf:Description>\n" +
" <rdf:Description rdf:nodeID=\"A1\">\n" +
"
<j.2:title>https://jazz.net/sandbox02/ccm/service/com.ibm.team.process.internal.common.service.IProcessRestService/processAreasForUser?userId=shilpat</j.2:title>
\n" +
" <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
" <rdf:object rdf:resource=
\"https://jazz.net/sandbox02/ccm/service/com.ibm.team.process.internal.common.service.IProcessRestService/processAreasForUser?userId=shilpat
\"/>\n" +
" <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
" <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
" </rdf:Description>\n" +
" <rdf:Description rdf:nodeID=\"A2\">\n" +
" <j.2:title>@packham</j.2:title>\n" +
" <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
" <rdf:object rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_vCJP8OeKEduR89vYjZT89g\"/>
\n" +
" <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
" <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
" </rdf:Description>\n" +
" <rdf:Description rdf:nodeID=\"A3\">\n" +
" <j.2:title>154737: Replace
ProjectAreaWebUIInitializionData with a dynamic module</j.2:title>\n" +
" <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
" <rdf:object rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/154737
\"/>\n" +
" <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
" <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
" </rdf:Description>\n" +
" <rdf:Description rdf:nodeID=\"A4\">\n" +
" <j.2:title>@retchles</j.2:title>\n" +
" <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
" <rdf:object rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_gTuTMG62Edu8R4joT9P1Ug\"/>
\n" +
" <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
" <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
" </rdf:Description>\n" +
" <rdf:Description rdf:nodeID=\"A5\">\n" +
" <j.2:title>Changes in Process - <No Comment> - Jared
Burns - Feb 23, 2011 1:36 AM</j.2:title>\n" +
" <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
" <rdf:object rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.scm.ChangeSet/_S7U3gT8XEeC50ZOFeYh_9w
\"/>\n" +
" <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.filesystem.workitems.change_set.com.ibm.team.scm.ChangeSet
\"/>\n" +
" <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
" </rdf:Description>\n" +
" <rdf:Description rdf:nodeID=\"A6\">\n" +
" <j.2:title>@mjarvis</j.2:title>\n" +
" <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
" <rdf:object rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_P_wUELLTEduhAusIxeOxbA\"/>
\n" +
" <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
" <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
" </rdf:Description>\n" +
" <rdf:Description rdf:nodeID=\"A7\">\n" +
" <j.2:title>Changes in Process - Performance test - Martha
Andrews - Feb 22, 2011 9:45 PM</j.2:title>\n" +
" <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
" <rdf:object rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.scm.ChangeSet/_p2Q08T7lEeC50ZOFeYh_9w
\"/>\n" +
" <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.filesystem.workitems.change_set.com.ibm.team.scm.ChangeSet
\"/>\n" +
" <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
" </rdf:Description>\n" +
" <rdf:Description rdf:nodeID=\"A8\">\n" +
" <j.2:title>@storaskar</j.2:title>\n" +
" <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
" <rdf:object rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_DziEAHHfEdyLLb7t1B32_A\"/>
\n" +
" <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
" <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
" </rdf:Description>\n" +
" <rdf:Description rdf:nodeID=\"A9\">\n" +
"
<j.2:title>https://jazz.net/wiki/bin/view/Main/DraftTeamProcessRestApi#Project_Areas_collection</j.2:title>
\n" +
" <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
" <rdf:object rdf:resource=
\"https://jazz.net/wiki/bin/view/Main/DraftTeamProcessRestApi#Project_Areas_collection
\"/>\n" +
" <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
" <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
" </rdf:Description>\n" +
"</rdf:RDF>\n";
}
>>>>>>>> END TDBOutOfMemoryTest.java
Andy Seaborne <[email protected]> wrote on 03/09/2011 09:46:58
AM:
> [image removed]
>
> Re: OutOfMemoryError while loading datagraphs
>
> Andy Seaborne
>
> to:
>
> jena-users
>
> 03/09/2011 09:48 AM
>
> Please respond to jena-users
>
>
>
> On 08/03/11 18:45, Frank Budinsky wrote:
> >
> > Hi Andy,
> >
> > I never actually tried to run it past 100,000 graphs, so I don't know
what
> > will happen.
> >
> > I just finished running it with -xmx1200m, and as you thought it
should, it
> > did run (with 100,000 graphs) to completion. It took just under 4 hours
to
> > run on my Thinkpad T61 laptop running Windows XP and with 3G of RAM. I
> > noticed that it was loading about 700 graphs/minute at the start,
around
> > 550 graphs/minute though most of the run (up till about 90,000 graphs),
but
> > then got significantly slower towards the end (i.e., only about 55
> > graphs/minute for the last 1000). You mentioned that your JVM was 1.6
G. Is
> > that something you can config? I noticed that the total memory of my
java
> > process peaked at about 1.2 G.
>
> No - it might just be an artifact of the using a 64 bit JVM (openjdk).
> The heap size (max memory) was 1044M in main() I think.
>
> > I'll try running the test again with 500,000 graphs to see if it gets
all
> > the way through. Since it runs so slow, I'll let it run overnight and
see
> > what happens.
>
> It was fatser my my machine but I was using a desktop and disk speeds on
> portables are notoriously slow for database use.
>
> > How long does it take to load the 100,000 graphs for you? I assume
this
> > runs much faster on your hardware.
>
> 573.87 seconds and one cup of coffee running inside Eclipse.
>
> My desktop is a quad core i5 CPU 760 @ 2.80GHz, ext4 filing system in
> Ubuntu 10.10. SATA IDE disk; 7200 RPM.
>
> > I'm wondering if there's a minimum
> > hardware requirement to run TDB, especially if it's being used to load
tens
> > or hundreds of millions of triples? It would be nice to set
expectations
> > for what kind of hardware is needed for this.
>
> Yes but it's very hard to set expectations. If its doing subject ->
> properties lookup, it scales much better than GROUP BY and BI queries.
>
> SSDs are the next new thing.
>
> The bulk loader is faster; and you can load on one machine and run
> queries on another.
>
> Andy
>
> >
> > Thanks,
> > Frank.
> >
> > Andy Seaborne<[email protected]> wrote on 03/08/2011
11:53:06
> > AM:
> >
> >> [image removed]
> >>
> >> Re: OutOfMemoryError while loading datagraphs
> >>
> >> Andy Seaborne
> >>
> >> to:
> >>
> >> jena-users
> >>
> >> 03/08/2011 11:53 AM
> >>
> >> Please respond to jena-users
> >>
> >>
> >>
> >> On 08/03/11 15:05, Frank Budinsky wrote:
> >>>
> >>>>> I tried increasing the amount of memory, but that just increased
the
> >>> number
> >>>>> of calls that succeed (e.g., 10000 vs 2000) before getting the
> >>> exception.
> >>>>
> >>>> What sizes of heap are you using?
> >>>
> >>> I've been experimenting with various heap sizes, noticing that the
> > bigger I
> >>> make it the longer it runs
> >>> before crashing. When using the the default (small) heap size
(-xmx64m)
> > the
> >>> test program fails at about 5800 graphs. If I bump it all the way up
to
> >>> -xmx1200m, like you did, I suspect I will also be able to run the
test
> > to
> >>> completion (100,000 graphs), but it takes very long to run (more than
2
> >>> hours on my machine). I'm guessing this is also running much faster
for
> >>> you?
> >>>
> >>> Extrapolating from what you're saying it looks like I would need a
heap
> > of
> >>> 6G, or so, to hit my original target of 500,000 graphs (about 50M
> > triples
> >>> total). Does that sound right? That is, needing to run with such a
huge
> >>> heap?
> >>
> >> No - the caches are bounded. Once they reach steady state, there is
no
> >> further growth and no fixed scale limit. I'mve run a trial with
500,000
> >> with -Mx1200M and it works. for me. JVM is 1.6G. The caches were
still
> >> filling up in in your test case.
> >>
> >> The caches are LRU by slots, which is a bit crude for the node cache
as
> >> nodes vary in size. Index files have fixed used units (blocks - they
> >> are 8Kbytes).
> >>
> >> The default settings are supposed to work for heap of 1.2G -- it's
what
> >> the scripts set the heap to.
> >>
> >> The caches were still filling up in in your test case.
> >>
> >> Andy
> >>
> >>>
> >>> Thanks,
> >>> Frank.
> >>>
> >>> Andy Seaborne<[email protected]> wrote on 03/08/2011
> > 09:23:50
> >>> AM:
> >>>
> >>>> [image removed]
> >>>>
> >>>> Re: OutOfMemoryError while loading datagraphs
> >>>>
> >>>> Andy Seaborne
> >>>>
> >>>> to:
> >>>>
> >>>> jena-users
> >>>>
> >>>> 03/08/2011 09:24 AM
> >>>>
> >>>> Please respond to jena-users
> >>>>
> >>>> (Frank sent me the detached file)
> >>>>
> >>>> Frank,
> >>>>
> >>>> I'm on a 64 bit machine, but I'm settign direct mode and limiting
the
> >>>> Java heap size with -Xmx
> >>>>
> >>>> With a heap of 1200M, java reports 1066M max memory, and the test
> > runs.
> >>>> With a heap of 500M, java reports 444M max memory, and the test
stops
> > at
> >>>> 11800.
> >>>>
> >>>> Things will be a little different for 32 bit but should be
> > approximately
> >>>> the same. TDB is doing the same things.
> >>>>
> >>>> Tweaking the block cache sizes (sorry, magic needed) down to 5000
> > (read,
> >>>> default 10000) and 1000 (write, default 2000), it runs at 500M, but
> >>> slower.
> >>>>
> >>>> There are quite a few files for named graphs so small changes in
cache
> >>>> size get multiplied (x12, I think).
> >>>>
> >>>>> I tried increasing the amount of memory, but that just increased
the
> >>> number
> >>>>> of calls that succeed (e.g., 10000 vs 2000) before getting the
> >>> exception.
> >>>>
> >>>> What sizes of heap are you using?
> >>>>
> >>>> Andy
> >>>>
> >>>> On 07/03/11 18:34, Frank Budinsky wrote:
> >>>>> Hi Andy,
> >>>>>
> >>>>> I created a simple standalone test program that roughly simulates
> > what
> >>>>> my application is doing and it also crashes with the same
> >>>>> OutOfMemoryError exception. I've attached it here. Would it be
> > possible
> >>>>> for you to give it a try?
> >>>>>
> >>>>> /(See attached file: TDBOutOfMemoryTest.java)/
> >>>>>
> >>>>> Just change TDB_DIR to some new empty database location and run. It
> >>>>> get's the OutOfMemoryError at around 5800 graphs when I run it with
> >>>>> default VM params.
> >>>>>
> >>>>> Thanks,
> >>>>> Frank.
> >>>>>
> >>>>>
> >>>>> Andy Seaborne<[email protected]> wrote on 03/02/2011
> >>>>> 09:38:51 AM:
> >>>>>
> >>>>> > [image removed]
> >>>>> >
> >>>>> > Re: OutOfMemoryError while loading datagraphs
> >>>>> >
> >>>>> > Andy Seaborne
> >>>>> >
> >>>>> > to:
> >>>>> >
> >>>>> > jena-users
> >>>>> >
> >>>>> > 03/02/2011 09:41 AM
> >>>>> >
> >>>>> > Please respond to jena-users
> >>>>> >
> >>>>> > Hi Frank,
> >>>>> >
> >>>>> > On 28/02/11 14:48, Frank Budinsky wrote:
> >>>>> > >
> >>>>> > > Hi Andy,
> >>>>> > >
> >>>>> > > I did some further analysis of my OutOfMemeoryError
problem,
> > and
> >>>>> this is
> >>>>> > > what I've discovered. The problem seems to be that there
is
> > one
> >>>>> instance of
> >>>>> > > class NodeTupleTableConcrete that contains an ever
growing
> > set of
> >>>>> tuples
> >>>>> > > which eventually uses up all the available heap space
and
> > then
> >>> crashes.
> >>>>> > >
> >>>>> > > To be more specific, this field in class TupleTable:
> >>>>> > >
> >>>>> > > private final TupleIndex[] indexes ;
> >>>>> > >
> >>>>> > > seems to contain 6 continually growing TupleIndexRecord
> > instances
> >>>>> > > (BPlusTrees). From my measurements, this seems to eat up
> >>>>> approximately 1G
> >>>>> > > of heap for every 1M triples in the Dataset (i.e., about
1K
> > per
> >>>>> datagraph).
> >>>>> > > So, to load my 100K datagraphs (~10M total triples) it
would
> > seem
> >>>>> to need
> >>>>> > > 10G of heap space.
> >>>>> >
> >>>>> > There are 6 indexes for named graphs (see the files GSPO
etc).
> > TDB
> >>> uses
> >>>>> > total indexing which puts a lot of work at load time but
means
> > any
> >>>>> > lookup needed is always done with an index scan. The code
can
> > run
> >>> with
> >>>>> > less indexes - the minimum is one - but that is no exposed
in
> > the
> >>>>> > configuration.
> >>>>> >
> >>>>> > Each index holds quads (4 NodeIds, a NodeId is 64 bits on
disk).
> > As
> >>> the
> >>>>> > index grows the data goes to disk. There is a finite LRU
cache
> > in
> >>> front
> >>>>> > of each index.
> >>>>> >
> >>>>> > Does your dataset have a location? If has no location, it's
all
> >>>>> > in-memory with a RAM-disk like structure. This is for
> > small-scale
> >>>>> > testing only - it really does read and write blocks out of
the
> > RAM
> >>> disk
> >>>>> > by copy to give strict disk-like semantics.
> >>>>> >
> >>>>> > There is also a NodeTable mapping between NodeId and Node
> > (Jena's
> >>>>> > graph-level RDF Term class). This has a cache in front of
it .
> >>>>> > readPropertiesFile
> >>>>> > The long-ish literals maybe the problem. The node table
cache is
> >>>>> > fixed-number, not bounded by size.
> >>>>> >
> >>>>> > The sizeof the caches are controlled by:
> >>>>> >
> >>>>> > SystemTDB.Node2NodeIdCacheSize
> >>>>> > SystemTDB.NodeId2NodeCacheSize
> >>>>> >
> >>>>> > These are not easy to control but either (1) get the source
code
> > and
> >>>>> > alter the default values (2) see the code in SystemTDB that
uses
> > a
> >>>>> > properties file.
> >>>>> >
> >>>>> > If you can end me a copy of the data, I can try loading it
here.
> >>>>> >
> >>>>> > > Does this make sense? How is it supposed to work?
Shouldn't
> > the
> >>> triples
> >>>>> > > from previously loaded named graphs be eligable for GC
when
> > I'm
> >>>>> loading the
> >>>>> > > next named graph? Could it be that I'm holding
ontosomething
> >>> that's
> >>>>> > > preventing GC in the TupleTable?
> >>>>> > >
> >>>>> > > Also, after looking more carefully at the resources
being
> > indexed,
> >>> I
> >>>>> > > noticed that many of them do have relatively large
literals
> > (100s
> >>> of
> >>>>> > > characters). I also noticed that when using Fuseki to
load
> > the
> >>>>> resources I
> >>>>> > > get lots of warning messages like this, on the console:
> >>>>> > >
> >>>>> > > Lexical form 'We are currently doing
> >>>>> > >
this:<br></br><br></br>workspaceConnection.replaceComponents
> >>>>> > > (replaceComponents, replaceSource, falses, false,
> >>>>> > > monitor);<br></br><br></br>the new way of doing it would
be
> >>> something
> >>>>> > >
> >>>>>
> >>>
> >
like:<br></br><br></br><br></br>
> >>>>> > > ArrayList<IComponentOp> replaceOps = new
> >>>>> > > ArrayList<IComponentOp>();<br></
> >>>>> > br>
> >>>>> > > for (Iterator iComponents = components.iterator();
> >>>>> iComponents.hasNext();)
> >>>>> > >
> >>>>>
> >>>
> >
{<br></br>
> >
> >>>
> >>>>> > > IComponentHandle componentHandle = (IComponentHandle)
> >>> iComponents.next
> >>>>> > > ();<br></
> >>>>> >
> >>>>>
> >>>>
> >>>
> >>
> >
>
br>
> >
> >>>
> >>>>> > > replaceOps.add
(promotionTargetConnection.componentOpFactory
> >>>>> > > ().replaceComponent
> >>>>> > > (componentHandle,<br></
> >>>>> >
> >>>>>
> >>>>
> >>>
> >>
> >
>
br>
> >
> >>>
> >>>>> > > buildWorkspaceConnection,
> >>>>> > >
false));<br></br> }
> >>>>> > <br></br><br></br>
> >>>>> > > promotionTargetConnection.applyComponentOperations
> > (replaceOps,
> >>>>> monitor);'
> >>>>> > > not valid for datatype
> >>>>> > > http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
> >>>>> > >
> >>>>> > > Could this be part of the problem?
> >>>>> >
> >>>>> > No - it's a different issue. This is something coming from
the
> >>> parser.
> >>>>> >
> >>>>> > RDF XMLLiterals have special rules - they must follow
> >>>>> > exclusive canonical XML, which means, amongst a lot of other
> > thigs,
> >>> they
> >>>>> > have to be a single XML node. The rules for exclusive
Canonical
> > XML
> >>> are
> >>>>> > really quite strict (e.g. attributes in alphabetical order).
> >>>>> >
> >>>>> > http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
> >>>>> >
> >>>>> > If you want to store XML or HTML fragments, you can't use
RDF
> >>>>> > XMLLiterals very easily - you have to mangle them to conform
to
> > the
> >>>>> > rules. I suggest either as strings or invent your own
datatype.
> >>>>> >
> >>>>> > You can run the parser on it's own using
> >>>>> > "riotcmd.riot --validate FILE ..."
> >>>>> >
> >>>>> >
> >>>>> > Andy
> >>>>> >
> >>>>> > >
> >>>>> > > Thanks,
> >>>>> > > Frank.
> >>>>>