Re: abysmal multicore performance, especially on AMD processors

Marko Topolnik Mon, 10 Dec 2012 03:17:07 -0800

The main GC feature here are the Thread-Local Allocation Buffers. They are 
on by default and are "automatically sized according to allocation 
patterns". The size can also be fine-tuned with the -XX:TLABSize=nconfiguration 
option. You may consider tweaking this setting to optimize 
runtime. Basically, everything that one call to your function allocates 
should fit into a TLAB because it is all garbage upon exit. Allocation 
inside TLAB is ultra-fast and completely concurrent.


Configure 
TLAB<http://docs.oracle.com/javase/7/docs/technotes/tools/windows/java.html>

On Sunday, December 9, 2012 7:37:09 PM UTC+1, Andy Fingerhut wrote:
>
>
> On Dec 9, 2012, at 6:25 AM, Softaddicts wrote: 
>
> > If the number of object allocation mentioned earlier in this thread are 
> real, 
> > yes vm heap management can be a bottleneck. There has to be some 
> > locking done somewhere otherwise the heap would corrupt :) 
> > 
> > The other bottleneck can come from garbage collection which has to 
> freeze 
> > object allocation completely or partially. 
> > 
> > This internal process has to reclaim unreferenced objects otherwise you 
> may end up 
> > exhausting the heap. That can even susoend your app while gc is running 
> depending 
> > on the strategy used. 
>
> Agreed that memory allocation and garbage collection will in some cases 
> need to coordinate between threads to work in the general case of arbitrary 
> allocations and GCs. 
>
> However, one could have a central list of large "pages" of free memory 
> (e.g. a few MBytes, or maybe even larger), and pass these out to concurrent 
> memory allocators in these large chunks, and let them do small object 
> allocations and GC within each thread completely concurrently. 
>
> The only times locking of any kind might be needed with such a strategy 
> would be when one of the parallel threads requests a new big page from the 
> central free list, or returned a completely empty free page back to the 
> central list that it didn't need any more.  All other memory allocation and 
> GC could be completely concurrent.  The idea of making those "pages" large 
> is that such passing pages around would be infrequent, and thus could be 
> made to have no measurable synchronization overhead. 
>
> That is pretty much what is happening when you run Lee's benchmark 
> programs as 1 thread per JVM, but 4 or 8 different JVMs processes running 
> in parallel.  In that situation the OS has the central free list of pages, 
> and the JVMs manage their small object allocations and GCs completely 
> concurrently without interfering with each other. 
>
> If HotSpot's JVM could be configured to work like that, he would be seeing 
> big speedups in a single JVM. 
>
> Andy 
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

Reply via email to