Paolo Castagna wrote:
> I have some code to convert Freebase dumps in RDF, it's ~600 million
> triples, I'll use that to gather some numbers. Ideally, comparing
> tdbloader, tdbloader2, tdbloader3 and tdbloader4 (both in terms of
> time and costs).

FYI

Code to convert Freebase dumps in RDF is here:
https://github.com/castagna/freebase2rdf

I have been using Amazon EC2 instances to run a few experiments during
the last couple of days with m1.xlarge instances (i.e. 15 GB memory).

tdbloader didn't complete, it was just getting slower and slower...


With tdbloader2 I had a java.lang.OutOfMemoryError:

Mar  5 05:22:30 ip-10-53-58-155 build: Add: 618,450,000 Data (Batch: 6,547 / 
Avg: 21,206)
Mar  5 05:35:10 ip-10-53-58-155 build: Exception in thread "main" 
java.lang.OutOfMemoryError: Java heap space
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
java.util.HashMap.<init>(HashMap.java:209)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
org.apache.xerces.impl.validation.ValidationState.<init>(Unknown Source)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.datatypes.xsd.XSDDatatype.parse(XSDDatatype.java:270)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.datatypes.xsd.impl.XSDBaseNumericType.parse(XSDBaseNumericType.java:165)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setValue(LiteralLabelImpl.java:213)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setLiteralLabel_1(LiteralLabelImpl.java:107)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.graph.impl.LiteralLabelImpl.<init>(LiteralLabelImpl.java:96)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.graph.impl.LiteralLabelFactory.createLiteralLabel(LiteralLabelFactory.java:28)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.graph.Node.createLiteral(Node.java:103)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.sparql.util.NodeFactory.intToNode(NodeFactory.java:79)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.tdb.solver.stats.Stats.format(Stats.java:195)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.tdb.solver.stats.Stats.write(Stats.java:72)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:178)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
arq.cmdline.CmdMain.mainMethod(CmdMain.java:97)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
arq.cmdline.CmdMain.mainRun(CmdMain.java:59)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
arq.cmdline.CmdMain.mainRun(CmdMain.java:46)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)

I'll try giving the JVM more RAM.


tdbloader3 run out of disk space (because it is writing temporary files
in /tmp and the available instance disk space is mounted on /mnt :-()
I'll see how to change/fix this and re-run.

Paolo

Reply via email to