Hi Paolo Castagna wrote: > Paolo Castagna wrote: >> I have some code to convert Freebase dumps in RDF, it's ~600 million >> triples, I'll use that to gather some numbers. Ideally, comparing >> tdbloader, tdbloader2, tdbloader3 and tdbloader4 (both in terms of >> time and costs). > > FYI > > Code to convert Freebase dumps in RDF is here: > https://github.com/castagna/freebase2rdf > > I have been using Amazon EC2 instances to run a few experiments during > the last couple of days with m1.xlarge instances (i.e. 15 GB memory). > > tdbloader didn't complete, it was just getting slower and slower... > > > With tdbloader2 I had a java.lang.OutOfMemoryError: > > Mar 5 05:22:30 ip-10-53-58-155 build: Add: 618,450,000 Data (Batch: 6,547 / > Avg: 21,206) > Mar 5 05:35:10 ip-10-53-58-155 build: Exception in thread "main" > java.lang.OutOfMemoryError: Java heap space > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > java.util.HashMap.<init>(HashMap.java:209) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > org.apache.xerces.impl.validation.ValidationState.<init>(Unknown Source) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.datatypes.xsd.XSDDatatype.parse(XSDDatatype.java:270) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.datatypes.xsd.impl.XSDBaseNumericType.parse(XSDBaseNumericType.java:165) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setValue(LiteralLabelImpl.java:213) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setLiteralLabel_1(LiteralLabelImpl.java:107) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.graph.impl.LiteralLabelImpl.<init>(LiteralLabelImpl.java:96) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.graph.impl.LiteralLabelFactory.createLiteralLabel(LiteralLabelFactory.java:28) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.graph.Node.createLiteral(Node.java:103) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.sparql.util.NodeFactory.intToNode(NodeFactory.java:79) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.tdb.solver.stats.Stats.format(Stats.java:195) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.tdb.solver.stats.Stats.write(Stats.java:72) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:178) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > arq.cmdline.CmdMain.mainMethod(CmdMain.java:97) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > arq.cmdline.CmdMain.mainRun(CmdMain.java:59) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > arq.cmdline.CmdMain.mainRun(CmdMain.java:46) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at > com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79) > > I'll try giving the JVM more RAM.
I tried with -Xmx2048m, but I had the same problem. I'll try with -Xmx4096m. > tdbloader3 run out of disk space (because it is writing temporary files > in /tmp and the available instance disk space is mounted on /mnt :-() > I'll see how to change/fix this and re-run. This run almost to completion this time, but I was using --spill-size-auto policy which clearly need improvements. ... Mar 6 04:28:11 ip-10-54-171-206 build: INFO Add: 77,550,000 records to POS (Batch: 605 / Avg: 144,190) Mar 6 04:29:15 ip-10-54-171-206 build: INFO Add: 77,600,000 records to POS (Batch: 777 / Avg: 128,869) Mar 6 04:30:20 ip-10-54-171-206 build: INFO Add: 77,650,000 records to POS (Batch: 776 / Avg: 116,492) Mar 6 04:47:11 ip-10-54-171-206 build: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded Mar 6 04:47:11 ip-10-54-171-206 build: #011at java.lang.Long.valueOf(Long.java:557) Mar 6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3$2.convert(tdbloader3.java:367) Mar 6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3$2.convert(tdbloader3.java:363) Mar 6 04:47:11 ip-10-54-171-206 build: #011at org.openjena.atlas.iterator.Iter$4.next(Iter.java:293) Mar 6 04:47:11 ip-10-54-171-206 build: #011at org.openjena.atlas.data.AbstractDataBag.addAll(AbstractDataBag.java:76) Mar 6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3.createBPlusTreeIndex(tdbloader3.java:378) Mar 6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3.exec(tdbloader3.java:252) Mar 6 04:47:11 ip-10-54-171-206 build: #011at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97) Mar 6 04:47:11 ip-10-54-171-206 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:59) Mar 6 04:47:11 ip-10-54-171-206 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:46) Mar 6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3.main(tdbloader3.java:129) I'll try with a fixed --spill-size 10000000. Paolo > > Paolo
