On 10/08/12 06:34, Michael Brunnbauer wrote:

Hello Andy,

[tdbloader2]

On Thu, Aug 09, 2012 at 06:53:59PM +0200, Michael Brunnbauer wrote:
INFO  Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Any idea what a good value for -Xmx for 1B+ triples would be ?
I will try with 16384 now.

-Xmx16384M throws the memory error after 478 mio triples:

INFO  Add: 478,600,000 Data (Batch: 247 / Avg: 13,627)
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit 
exceeded

478 million / 16G heap

This is bizarre.

Previously, with 32G heap:
INFO  Add: 55,500,000 Data (Batch: 98 / Avg: 10,335)
INFO    Elapsed: 5,369.59 seconds [2012/08/09 17:45:44 CEST]
INFO  Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

which is 55 million, a lot less than when you decreased the heap size.

This morning, I have loaded (this is the end of the data phase, rformatted.):

11:46:38 INFO  loader               ::
   Add: 747,400,000 Data (Batch: 187,969 / Avg: 131,924)
11:46:43 INFO  loader               ::
   Total: 747,436,151 tuples : 5,669.75 seconds :
   131,828.81 tuples/sec [2012/08/10 11:46:43 UTC]

with no change to tdbloader2 other than fix the classpath setting bug so it's -Xmx1200M.

The machine is a 34G machine in Amazon - I even forgot to halt the large dataset it is hosting but it's not public yet and only the odd developer is testing against it.


What is the data like? The data shape should only affect the building of the node table.

Many long literals? (might explain why the default setting was not enough)

but that does not explain why decreasing the heap size means it gets further.

Unrelated:

I have noticed the parameters to sort(1) could be a lot better ...

e.g.
--buffer-size=50%  --parallel=3

I'll try that out but you're crashing out in the data phase before index creation.

        Andy


Regards,

Michael Brunnbauer


Reply via email to