Re: dispatch_async Performance Issues

Andreas Grosam Wed, 29 Jun 2011 05:17:51 -0700

Hi Jonathan,

I'm happy that your guesses exactly match the results of my investigations. :)

Actually, the compiler (LLVM here) is doing an amazing job when optimizing. 
Although I'm happy with this, it makes such tests a pain, since often enough 
the compiler simply strips the code away what I wanted to test :)

So again, I modified the test code in order to prevent the undesired effects of 
such optimizations. So, I actually *use* the future in the test code, so that 
the compiler does not strip away dereferencing and does not strip away 
calculating the sum in the inner loops.

After this modifications, more hints to performance issues became apparent ...

The cause is definitely the way the C++ compiler handles and generates code for 
possibly C++ exceptions. Accidentally, my Iterator had a non trivial destructor 
(or more precisely, the buffer class which is a member of the Iterator), namely 
it releases a CFData object. Due to this, the compiler had to generate some 
extra code which causes the performance difference.

After fixing the buffers list class and the iterator class, the test performs 
as expected. With C++ Exceptions enabled, USE_FUTURE defined, and all code 
fixed so that it actually uses the Future, the test runs as follows:

**** CFDataBuffers Benchmark ****
Using  Future: Yes
Data size: 131072KB, Buffer size: 8192, N = 16384, C = 2

[Classic]: Elapsed time: 473.539ms  
(Single threaded, naive implementation using a NSMutableData object constructed 
by appending many NSData buffers)

[ConcurrentProduceConsume1]: Elapsed time: 135.162ms   
(GCD, straight and forward implementation, using pointers)

[ConcurrentProduceConsumeIter]: Elapsed time: 363.226ms    
(GCD, old and unmodified code, using C++ Iterator concept)

[ConcurrentProduceConsumeIter2]: Elapsed time: 189.002ms 
(Fixed Iterator and fixed buffers list)

As usual, take the numbers with a grain of salt. The "workload" of the consumer 
is minimal, and producing the buffers doesn't block as it would likely happen 
when downloading data.

When dealing with small amount of data (25KByte) the GCD approach does not 
perform better. In this scenario, the classic approach would be faster:

Data size: 24KB, Buffer size: 8192, N = 3, C = 2
[Classic]: Elapsed time: 0.0543303ms
[ConcurrentProduceConsumeIter2]: Elapsed time: 0.116253ms

Regards, and thank you again for you help!
Andreas

On Jun 28, 2011, at 9:37 PM, Jonathan Taylor wrote:

>> In the meantime however, I found one (surprising) cause of the performance 
>> issue. After making the versions  *more* equivalent the issue become 
>> apparent.  I restructured the second version (using the C++ iterators) and 
>> will discuss this in more detail. The culprit is in the consumer part as 
>> follows:
>> 
>> New restructured code:
>> 
>> [...]
>> 
>> The difference compared to the former code provided in the previous mail is 
>> now
>> 
>> 1) The C++ instances, that is the iterators, are defined locally within the 
>> block.
>> 
>> 2) The "Future" (that is the result of the operation) is conditional 
>> compiled in or out, in order to test its impact.
>>    Here, the __block modifier is used for the "Future" variables "sum" and 
>> "total".  
>>    When using pointers within the block accessing the outside variables, the 
>> performance does not differ, but using __block may be more correct.
> 
> 
> Ah - now then! I will take a very strong guess as to what is happening there 
> (I've done it myself, and seen it done by plenty of others! [*]). In the case 
> where you do NOT define USE_FUTURE, your consumer thread as written in your 
> email does not make any use of the variables "sum_" and "total_". Hence the 
> compiler is entirely justified in optimizing out those variables entirely! It 
> will still have to check the iterator against eof, and may have to 
> dereference the iterator[**], but it does not need to update the "sum_" or 
> "total_" variables. 
> 
> It may well be that there is still a deeper performance issue with your 
> original code, and I'm happy to have another look at that when I have a 
> chance. I suggest you deal with this issue first, though, as it appears to be 
> introducing misleading discrepancies in the execution times you're using for 
> comparison.
> 
> As I say, it's quite a common issue when you start stripping down code with 
> the aim of doing minimal performance comparisons. The easiest solution is 
> either to printf the results at the end (which forces the compiler to 
> actually evaluate them!), or alternatively do the sort of thing you're doing 
> when USE_FUTURE is defined - writing to a shadow variable at the end. If you 
> declare your shadow variable as "volatile" then the compiler is forced to 
> write the result and is not permitted to optimize everything out.
> 
> Hope that helps, even if it may not deal with your original problem yet. 
> Apologies that my first round of guesses were wrong - I'm pretty sure about 
> this one though :)
> 
> Jonny
> 
> 
> [**] Completely unrelated to this thread, but see this rather extreme example 
> where the claimed performance had to be reduced by a factor of twelve due to 
> this problem! 
> http://www.ibm.com/developerworks/forums/thread.jspa?threadID=226415
> [*] I ~think~ ... because this involves a memory access, which is strictly 
> speaking a side effect in itself.
> 

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: dispatch_async Performance Issues

Reply via email to