On Thu, Dec 20, 2012 at 11:35 AM, Rong Xu <x...@google.com> wrote: > we have this patch primarily for getting valid profile counts. we > observe that for some high-threaded programs, we are getting poor > counter due to data racing of counter update (like counter value is > only 15% of what it supposed to be for a 10-thread program).
I have seen much worse on Octeon running with 32-threaded program. I think it was only 1% of what it should have been. > > In general, enabling atomic updates slows down programs. (for my some > of my toy programs, it has 3x slow down.) And that the reason I use > options to control value and edge profile count. I think on Octeon, the atomic updates would be a speedup because of the atomic instruction which was added explicitly for incrementing a statistics counter. Internally at Cavium, I might just turn this on by default as it even helps the one thread case :). Thanks, Andrew Pinski > > -Rong > > On Thu, Dec 20, 2012 at 8:57 AM, Andrew Pinski <pins...@gmail.com> wrote: >> On Thu, Dec 20, 2012 at 8:20 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >>>> On Wed, Dec 19, 2012 at 4:29 PM, Andrew Pinski <pins...@gmail.com> wrote: >>>> > >>>> > On Wed, Dec 19, 2012 at 12:08 PM, Rong Xu <x...@google.com> wrote: >>>> > > Hi, >>>> > > >>>> > > This patch adds the supprot of atomic update the profile counters. >>>> > > Tested with google internal benchmarks and fdo kernel build. >>>> > >>>> > I think you should use the __atomic_ functions instead of __sync_ >>>> > functions as they allow better performance for simple counters as you >>>> > can use __ATOMIC_RELAXED. >>>> >>>> You are right. I think __ATOMIC_RELAXED should be OK here. >>>> Thanks for the suggestion. >>>> >>>> > >>>> > And this would be useful for the trunk also. I was going to implement >>>> > this exact thing this week but some other important stuff came up. >>>> >>>> I'll post trunk patch later. >>> >>> Yes, I like that patch, too. Even if the costs are quite high (and this is >>> why >>> atomic updates was sort of voted down in the past) the alternative of using >>> TLS >>> has problems with too-much per-thread memory. >> >> Actually sometimes (on some processors) atomic increments are cheaper >> than doing a regular incremental. Mainly because there is an >> instruction which can handle it in the L2 cache rather than populating >> the L1. Octeon is one such processor where this is true. >> >> Thanks, >> Andrew Pinski >> >>> >>> While there are even more alternatives, like recording the changes and >>> commmiting them in blocks (say at function return), I guess some solution is >>> better than no solution. >>> >>> Thanks, >>> Honza