On 10/08/12 06:34, Michael Brunnbauer wrote:
Hello Andy,
[tdbloader2]
On Thu, Aug 09, 2012 at 06:53:59PM +0200, Michael Brunnbauer wrote:
INFO Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Any idea what a good value for -Xmx for 1B+ triples would be ?
I will try with 16384 now.
-Xmx16384M throws the memory error after 478 mio triples:
INFO Add: 478,600,000 Data (Batch: 247 / Avg: 13,627)
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded
478 million / 16G heap
This is bizarre.
Previously, with 32G heap:
INFO Add: 55,500,000 Data (Batch: 98 / Avg: 10,335)
INFO Elapsed: 5,369.59 seconds [2012/08/09 17:45:44 CEST]
INFO Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
which is 55 million, a lot less than when you decreased the heap size.
This morning, I have loaded (this is the end of the data phase,
rformatted.):
11:46:38 INFO loader ::
Add: 747,400,000 Data (Batch: 187,969 / Avg: 131,924)
11:46:43 INFO loader ::
Total: 747,436,151 tuples : 5,669.75 seconds :
131,828.81 tuples/sec [2012/08/10 11:46:43 UTC]
with no change to tdbloader2 other than fix the classpath setting bug so
it's -Xmx1200M.
The machine is a 34G machine in Amazon - I even forgot to halt the large
dataset it is hosting but it's not public yet and only the odd developer
is testing against it.
What is the data like? The data shape should only affect the building
of the node table.
Many long literals? (might explain why the default setting was not enough)
but that does not explain why decreasing the heap size means it gets
further.
Unrelated:
I have noticed the parameters to sort(1) could be a lot better ...
e.g.
--buffer-size=50% --parallel=3
I'll try that out but you're crashing out in the data phase before index
creation.
Andy
Regards,
Michael Brunnbauer