On 9 Feb 2014, at 15:53, Greg Parker <gpar...@apple.com> wrote: > On Feb 9, 2014, at 12:19 AM, Gerriet M. Denkmann <gerr...@mdenkmann.de> wrote: >> The real app (which I am trying to optimise) has actually two loops: one is >> counting, the other one is modifying. Which seems to be good news. >> >> But I would really like to understand what I should do. Trial and error (or >> blindly groping in the mist) is not really my preferred way of working. > > Optimizing small loops like this is a black art. Very small effects become > critically important, such as the alignment of your loop instructions or the > associativity of that CPU's L1 cache.
[...] > Cache associativity can mean that there are some array split sizes that are > much worse than others. If you choose the wrong size then each thread's > working memory is on different cache lines, but those cache lines collide > with each other in memory caches. Changing the work size to avoid collisions > can help. sysctl hw.cachelinesize returns: hw.cachelinesize: 64 I divided my huge array (malloced, address is multiple of 0x1000) into at most [NSProcessInfo processorCount] chunks, where each chunk starts at a multiple of 2^n (using fewer chunks if required by this rule). The result of using dispatch_apply: n time 0 10 1 5.5 2 4 3 3 4 2 5 1.7 6 1.6 7 1.5 16 1.4 That is, your statement "that there are some array split sizes that are much worse than others" is strongly backed up by my tests. Kind regards, Gerriet. _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com