Hi

Paolo Castagna wrote:
> Paolo Castagna wrote:
>> I have some code to convert Freebase dumps in RDF, it's ~600 million
>> triples, I'll use that to gather some numbers. Ideally, comparing
>> tdbloader, tdbloader2, tdbloader3 and tdbloader4 (both in terms of
>> time and costs).
> 
> FYI
> 
> Code to convert Freebase dumps in RDF is here:
> https://github.com/castagna/freebase2rdf
> 
> I have been using Amazon EC2 instances to run a few experiments during
> the last couple of days with m1.xlarge instances (i.e. 15 GB memory).
> 
> tdbloader didn't complete, it was just getting slower and slower...
> 
> 
> With tdbloader2 I had a java.lang.OutOfMemoryError:
> 
> Mar  5 05:22:30 ip-10-53-58-155 build: Add: 618,450,000 Data (Batch: 6,547 / 
> Avg: 21,206)
> Mar  5 05:35:10 ip-10-53-58-155 build: Exception in thread "main" 
> java.lang.OutOfMemoryError: Java heap space
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> java.util.HashMap.<init>(HashMap.java:209)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> org.apache.xerces.impl.validation.ValidationState.<init>(Unknown Source)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.datatypes.xsd.XSDDatatype.parse(XSDDatatype.java:270)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.datatypes.xsd.impl.XSDBaseNumericType.parse(XSDBaseNumericType.java:165)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setValue(LiteralLabelImpl.java:213)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setLiteralLabel_1(LiteralLabelImpl.java:107)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.graph.impl.LiteralLabelImpl.<init>(LiteralLabelImpl.java:96)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.graph.impl.LiteralLabelFactory.createLiteralLabel(LiteralLabelFactory.java:28)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.graph.Node.createLiteral(Node.java:103)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.sparql.util.NodeFactory.intToNode(NodeFactory.java:79)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.tdb.solver.stats.Stats.format(Stats.java:195)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.tdb.solver.stats.Stats.write(Stats.java:72)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:178)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> arq.cmdline.CmdMain.mainMethod(CmdMain.java:97)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> arq.cmdline.CmdMain.mainRun(CmdMain.java:59)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> arq.cmdline.CmdMain.mainRun(CmdMain.java:46)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at 
> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)
> 
> I'll try giving the JVM more RAM.

I tried with -Xmx2048m, but I had the same problem.
I'll try with -Xmx4096m.

> tdbloader3 run out of disk space (because it is writing temporary files
> in /tmp and the available instance disk space is mounted on /mnt :-()
> I'll see how to change/fix this and re-run.

This run almost to completion this time, but I was using --spill-size-auto 
policy which clearly need improvements.

...
Mar  6 04:28:11 ip-10-54-171-206 build: INFO  Add: 77,550,000 records to POS 
(Batch: 605 / Avg: 144,190)
Mar  6 04:29:15 ip-10-54-171-206 build: INFO  Add: 77,600,000 records to POS 
(Batch: 777 / Avg: 128,869)
Mar  6 04:30:20 ip-10-54-171-206 build: INFO  Add: 77,650,000 records to POS 
(Batch: 776 / Avg: 116,492)
Mar  6 04:47:11 ip-10-54-171-206 build: Exception in thread "main" 
java.lang.OutOfMemoryError: GC overhead limit exceeded
Mar  6 04:47:11 ip-10-54-171-206 build: #011at 
java.lang.Long.valueOf(Long.java:557)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at 
cmd.tdbloader3$2.convert(tdbloader3.java:367)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at 
cmd.tdbloader3$2.convert(tdbloader3.java:363)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at 
org.openjena.atlas.iterator.Iter$4.next(Iter.java:293)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at 
org.openjena.atlas.data.AbstractDataBag.addAll(AbstractDataBag.java:76)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at 
cmd.tdbloader3.createBPlusTreeIndex(tdbloader3.java:378)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at 
cmd.tdbloader3.exec(tdbloader3.java:252)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at 
arq.cmdline.CmdMain.mainMethod(CmdMain.java:97)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at 
arq.cmdline.CmdMain.mainRun(CmdMain.java:59)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at 
arq.cmdline.CmdMain.mainRun(CmdMain.java:46)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at 
cmd.tdbloader3.main(tdbloader3.java:129)

I'll try with a fixed --spill-size 10000000.

Paolo

> 
> Paolo

Reply via email to