Paolo Castagna wrote: > I have some code to convert Freebase dumps in RDF, it's ~600 million > triples, I'll use that to gather some numbers. Ideally, comparing > tdbloader, tdbloader2, tdbloader3 and tdbloader4 (both in terms of > time and costs).
FYI Code to convert Freebase dumps in RDF is here: https://github.com/castagna/freebase2rdf I have been using Amazon EC2 instances to run a few experiments during the last couple of days with m1.xlarge instances (i.e. 15 GB memory). tdbloader didn't complete, it was just getting slower and slower... With tdbloader2 I had a java.lang.OutOfMemoryError: Mar 5 05:22:30 ip-10-53-58-155 build: Add: 618,450,000 Data (Batch: 6,547 / Avg: 21,206) Mar 5 05:35:10 ip-10-53-58-155 build: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space Mar 5 05:35:10 ip-10-53-58-155 build: #011at java.util.HashMap.<init>(HashMap.java:209) Mar 5 05:35:10 ip-10-53-58-155 build: #011at org.apache.xerces.impl.validation.ValidationState.<init>(Unknown Source) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.datatypes.xsd.XSDDatatype.parse(XSDDatatype.java:270) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.datatypes.xsd.impl.XSDBaseNumericType.parse(XSDBaseNumericType.java:165) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setValue(LiteralLabelImpl.java:213) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setLiteralLabel_1(LiteralLabelImpl.java:107) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.<init>(LiteralLabelImpl.java:96) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelFactory.createLiteralLabel(LiteralLabelFactory.java:28) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.Node.createLiteral(Node.java:103) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.sparql.util.NodeFactory.intToNode(NodeFactory.java:79) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.solver.stats.Stats.format(Stats.java:195) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.solver.stats.Stats.write(Stats.java:72) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:178) Mar 5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97) Mar 5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:59) Mar 5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:46) Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79) I'll try giving the JVM more RAM. tdbloader3 run out of disk space (because it is writing temporary files in /tmp and the available instance disk space is mounted on /mnt :-() I'll see how to change/fix this and re-run. Paolo
