Re: dispatch_async Performance Issues

2011-06-29 Thread Andreas Grosam
Hi Jonathan,

I'm happy that your guesses exactly match the results of my investigations. :)

Actually, the compiler (LLVM here) is doing an amazing job when optimizing. 
Although I'm happy with this, it makes such tests a pain, since often enough 
the compiler simply strips the code away what I wanted to test :)

So again, I modified the test code in order to prevent the undesired effects of 
such optimizations. So, I actually *use* the future in the test code, so that 
the compiler does not strip away dereferencing and does not strip away 
calculating the sum in the inner loops.

After this modifications, more hints to performance issues became apparent ...

The cause is definitely the way the C++ compiler handles and generates code for 
possibly C++ exceptions. Accidentally, my Iterator had a non trivial destructor 
(or more precisely, the buffer class which is a member of the Iterator), namely 
it releases a CFData object. Due to this, the compiler had to generate some 
extra code which causes the performance difference.

After fixing the buffers list class and the iterator class, the test performs 
as expected. With C++ Exceptions enabled, USE_FUTURE defined, and all code 
fixed so that it actually uses the Future, the test runs as follows:

 CFDataBuffers Benchmark 
Using  Future: Yes
Data size: 131072KB, Buffer size: 8192, N = 16384, C = 2


[Classic]: Elapsed time: 473.539ms  
(Single threaded, naive implementation using a NSMutableData object constructed 
by appending many NSData buffers)

[ConcurrentProduceConsume1]: Elapsed time: 135.162ms   
(GCD, straight and forward implementation, using pointers)

[ConcurrentProduceConsumeIter]: Elapsed time: 363.226ms
(GCD, old and unmodified code, using C++ Iterator concept)

[ConcurrentProduceConsumeIter2]: Elapsed time: 189.002ms 
(Fixed Iterator and fixed buffers list)


As usual, take the numbers with a grain of salt. The "workload" of the consumer 
is minimal, and producing the buffers doesn't block as it would likely happen 
when downloading data.



When dealing with small amount of data (25KByte) the GCD approach does not 
perform better. In this scenario, the classic approach would be faster:

Data size: 24KB, Buffer size: 8192, N = 3, C = 2
[Classic]: Elapsed time: 0.0543303ms
[ConcurrentProduceConsumeIter2]: Elapsed time: 0.116253ms



Regards, and thank you again for you help!
Andreas



On Jun 28, 2011, at 9:37 PM, Jonathan Taylor wrote:

>> In the meantime however, I found one (surprising) cause of the performance 
>> issue. After making the versions  *more* equivalent the issue become 
>> apparent.  I restructured the second version (using the C++ iterators) and 
>> will discuss this in more detail. The culprit is in the consumer part as 
>> follows:
>> 
>> New restructured code:
>> 
>> [...]
>> 
>> The difference compared to the former code provided in the previous mail is 
>> now
>> 
>> 1) The C++ instances, that is the iterators, are defined locally within the 
>> block.
>> 
>> 2) The "Future" (that is the result of the operation) is conditional 
>> compiled in or out, in order to test its impact.
>>Here, the __block modifier is used for the "Future" variables "sum" and 
>> "total".  
>>When using pointers within the block accessing the outside variables, the 
>> performance does not differ, but using __block may be more correct.
> 
> 
> Ah - now then! I will take a very strong guess as to what is happening there 
> (I've done it myself, and seen it done by plenty of others! [*]). In the case 
> where you do NOT define USE_FUTURE, your consumer thread as written in your 
> email does not make any use of the variables "sum_" and "total_". Hence the 
> compiler is entirely justified in optimizing out those variables entirely! It 
> will still have to check the iterator against eof, and may have to 
> dereference the iterator[**], but it does not need to update the "sum_" or 
> "total_" variables. 
> 
> It may well be that there is still a deeper performance issue with your 
> original code, and I'm happy to have another look at that when I have a 
> chance. I suggest you deal with this issue first, though, as it appears to be 
> introducing misleading discrepancies in the execution times you're using for 
> comparison.
> 
> As I say, it's quite a common issue when you start stripping down code with 
> the aim of doing minimal performance comparisons. The easiest solution is 
> either to printf the results at the end (which forces the compiler to 
> actually evaluate them!), or alternatively do the sort of thing you're doing 
> when USE_FUTURE is defined - writing to a shadow variable at the end. If you 
> declare your shadow variable as "volatile" then the compiler is forced to 
> write the result and is not permitted to optimize everything out.
> 
> Hope that helps, even if it may not deal with your original problem yet. 
> Apologies that my first round of guesses were wrong - I'm pretty sure about 
> this one though :)

Re: dispatch_async Performance Issues

2011-06-28 Thread Jonathan Taylor
> In the meantime however, I found one (surprising) cause of the performance 
> issue. After making the versions  *more* equivalent the issue become 
> apparent.  I restructured the second version (using the C++ iterators) and 
> will discuss this in more detail. The culprit is in the consumer part as 
> follows:
> 
> New restructured code:
> 
> [...]
> 
> The difference compared to the former code provided in the previous mail is 
> now
> 
> 1) The C++ instances, that is the iterators, are defined locally within the 
> block.
> 
> 2) The "Future" (that is the result of the operation) is conditional compiled 
> in or out, in order to test its impact.
> Here, the __block modifier is used for the "Future" variables "sum" and 
> "total".  
> When using pointers within the block accessing the outside variables, the 
> performance does not differ, but using __block may be more correct.


Ah - now then! I will take a very strong guess as to what is happening there 
(I've done it myself, and seen it done by plenty of others! [*]). In the case 
where you do NOT define USE_FUTURE, your consumer thread as written in your 
email does not make any use of the variables "sum_" and "total_". Hence the 
compiler is entirely justified in optimizing out those variables entirely! It 
will still have to check the iterator against eof, and may have to dereference 
the iterator[**], but it does not need to update the "sum_" or "total_" 
variables. 

It may well be that there is still a deeper performance issue with your 
original code, and I'm happy to have another look at that when I have a chance. 
I suggest you deal with this issue first, though, as it appears to be 
introducing misleading discrepancies in the execution times you're using for 
comparison.

As I say, it's quite a common issue when you start stripping down code with the 
aim of doing minimal performance comparisons. The easiest solution is either to 
printf the results at the end (which forces the compiler to actually evaluate 
them!), or alternatively do the sort of thing you're doing when USE_FUTURE is 
defined - writing to a shadow variable at the end. If you declare your shadow 
variable as "volatile" then the compiler is forced to write the result and is 
not permitted to optimize everything out.

Hope that helps, even if it may not deal with your original problem yet. 
Apologies that my first round of guesses were wrong - I'm pretty sure about 
this one though :)

Jonny


[**] Completely unrelated to this thread, but see this rather extreme example 
where the claimed performance had to be reduced by a factor of twelve due to 
this problem! 
http://www.ibm.com/developerworks/forums/thread.jspa?threadID=226415
[*] I ~think~ ... because this involves a memory access, which is strictly 
speaking a side effect in itself.

___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com


Re: dispatch_async Performance Issues

2011-06-28 Thread Andreas Grosam

On Jun 28, 2011, at 4:07 PM, Jonathan Taylor wrote:

> ... and more importantly I do not believe your code makes correct use of 
> dispatch_group_async. Your first call adds a producer block to the global 
> concurrent queue, which may start executing immediately. At some time soon 
> afterwards you add a consumer block to the global concurrent queue, which 
> again may start executing immediately - while the producer block is still in 
> the process of "producing". As far as I can see the code appears to have a 
> number of major issues:
> 
> 1. Theoretically your producer block may not have produced anything by the 
> time your consumer block executes

The consumer thread may block until a buffer is available. This happens in 
member function consume() of the buffer list.

> 
> 2. What happens if your consumer block has processed all the currently-added 
> blocks before the producer block has added the eof marker?
The consumer block loops until a buffer is available, or until an timeout 
occurs.  This happens in member consume().
An empty buffer signals EOF.


> 
> 3. You have only one single consumer block, which means you will only have a 
> single consumer "thread" running under Grand Central Dispatch, which rather 
> defeats the purpose of using it at all! I think you may not have fully 
> understood how GCD is intended to be used...
I know this. But there is no easy way around, due to this: the buffers 
represent a contiguous stream of data. The consumer thread is used to execute a 
parser. The parser can not simply stop and then continue with a new chunk of 
data. The parser's state is represented in the execution stack. So, the parser 
may block when it iterates over the data buffers and thus safes its state that 
way.

I know that this is not the optimal way to use GCD. But at least I get 
semaphores which do not trap into the kernel that often, and thread switching 
shall be faster as well.

The original goals for this approach is:

Reduce memory foot print. The buffers list maximum number of buffers is limited.
Increase overall performance due to simultaneously downloading data and parsing.
Scaleability . No need to safe the stream to an intermediate file on disk 
before parsing.
Avoid copying of data completely.




> 
> 4. Are you sure that your mysterious and rather complicated-seeming 
> CFDataConstBuffers class
:) 
> is sufficiently threadsafe for your producer to be adding to it at the same 
> time as your consumer block is iterating through it? I am 99% sure that the 
> "explicit" consumer code you wrote in ConcurrentProduceConsume is not 
> threadsafe.
I hope it is. But there may be subtle caveats which can in certain 
circumstances reduce performance due to frequent context switches. I would 
avoid it if I would know IF and when this can happen ;)
  
The source is available (see link previous mail). Would you mind taking a look 
at it? But be warned: this is C++ code.

Thanks in advance!

Andreas

> 
> Jonny.

___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com


Re: dispatch_async Performance Issues

2011-06-28 Thread Andreas Grosam

On Jun 28, 2011, at 3:53 PM, Jonathan Taylor wrote:


Hi Jonathan, 

Thank you very much for taking the time to look at it.

> Hi Andreas,
> 
> If I understand your post correctly, you are saying that you see a 
> performance drop of 3x when using an iterator in your inner loop as opposed 
> to using hand-written C code to do the iterating.

Yes, at least that was my former assumption.

In the meantime however, I found one (surprising) cause of the performance 
issue. After making the versions  *more* equivalent the issue become apparent.  
I restructured the second version (using the C++ iterators) and will discuss 
this in more detail. The culprit is in the consumer part as follows:

New restructured code:

...
#if defined (USE_FUTURE)
__block size_t sum = 0;
__block size_t total = 0;
#endif

dispatch_async(queue,
 ^{
 CFDataConstBuffers_iterator eof;
 CFDataConstBuffers_iterator iter(*buffersPtr);
 
 size_t sum_ = 0;
 size_t total_ = 0;
 
 while (iter != eof)
 {
 sum_ += *iter;
 ++iter;
 ++total_;
 }
 
#if defined (USE_FUTURE)
 sum = sum_;
 total = total_;
#endif
 semPtr->signal();
 });
 ...


The difference compared to the former code provided in the previous mail is now

1) The C++ instances, that is the iterators, are defined locally within the 
block.

2) The "Future" (that is the result of the operation) is conditional compiled 
in or out, in order to test its impact.
 Here, the __block modifier is used for the "Future" variables "sum" and 
"total".  
 When using pointers within the block accessing the outside variables, the 
performance does not differ, but using __block may be more correct.


Note, that I access the variables sum and total only once at the end of the 
block code. This has a reason, which I will explain below.


As mentioned, the conditional  #if define (USE_FUTURE) tests its impact on 
performance.

If USE_FUTURE is defined, the performance drops dramatically!
The same happens, if there are pointer variables which access variables defined 
outside the block.

The performance drops even more if I would use the __block variables "sum" and 
"total" directly when incrementing, e.g.:

 while (iter != eof)
 {
 sum += *iter;
 ++iter;
 ++total;
 }

Even when I access the Future "sum" and "total" only once in the block, the 
performance penalty is significant.


> Unfortunately you say you haven't actually posted the code relating to the 
> iterator... but it seems to me that this is very likely to be the source of 
> your problems!

OK, appended the source code. :) 
These are three files, and a lot for this mail.
But I guess, this is not the cause of the issue.

> 
> Your title states "dispatch_async Performance Issues", but have you actually 
> looked into whether you see the same problem in a single-threaded case that 
> does not use dispatch_async?
I haven't *exactly* done this one, but I have a lot of other test cases (see 
below). 
> 
> All I can suggest is that you examine the preprocessor output and even the 
> generated assembly code with the aim of spotting any significant differences 
> in the generated code in each case. It may well be that a function call is 
> being generated as part of the iterator code, or something like that. Shark 
> may help you pin down where the problem is, but you will probably need to 
> have some level of appreciation of assembly code to be able to fully 
> interpret the results for your optimized build. 

I examined the assembly and from the looks I couldn't find any hints. Both 
versions looked quite similar. My guess is, it is some synchronization 
primitive like a spin lock.

After modifying the "direct implemented" and "C++ Iterator" versions, both run 
now similar if no future is used. If a future is used, for some reason, the 
runtime drops much more for the C++ Iterator version than the code that mimics 
the behavior.

But, considering my original goal, performance is a bit slow, that is, not 
faster then a very primitive implementation which uses one thread, allocates 
NSData buffers, and adds them to a NSMutableData object which is then processed 
(see "Classic" bench). This is a bit disappointing. Guess the overhead for 
dispatch is still high. Here are some results of my bench marks (note that in 
single threaded approaches the future has no effect since there is none):


 CFDataBuffers Benchmark 
Using  Future: No
Data size: 131072KB, Buffer size: 8192, N = 16384, C = 2
[SimpleProduceConsume1]:Elapsed time: 299.028ms  
[SimpleProduceConsume2]:Elapsed time: 125.744ms

Re: dispatch_async Performance Issues

2011-06-28 Thread Jonathan Taylor
... and more importantly I do not believe your code makes correct use of 
dispatch_group_async. Your first call adds a producer block to the global 
concurrent queue, which may start executing immediately. At some time soon 
afterwards you add a consumer block to the global concurrent queue, which again 
may start executing immediately - while the producer block is still in the 
process of "producing". As far as I can see the code appears to have a number 
of major issues:

1. Theoretically your producer block may not have produced anything by the time 
your consumer block executes

2. What happens if your consumer block has processed all the currently-added 
blocks before the producer block has added the eof marker?

3. You have only one single consumer block, which means you will only have a 
single consumer "thread" running under Grand Central Dispatch, which rather 
defeats the purpose of using it at all! I think you may not have fully 
understood how GCD is intended to be used...

4. Are you sure that your mysterious and rather complicated-seeming 
CFDataConstBuffers class is sufficiently threadsafe for your producer to be 
adding to it at the same time as your consumer block is iterating through it? I 
am 99% sure that the "explicit" consumer code you wrote in 
ConcurrentProduceConsume is not threadsafe.

Jonny.___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com


Re: dispatch_async Performance Issues

2011-06-28 Thread Jonathan Taylor
Hi Andreas,

If I understand your post correctly, you are saying that you see a performance 
drop of 3x when using an iterator in your inner loop as opposed to using 
hand-written C code to do the iterating. Unfortunately you say you haven't 
actually posted the code relating to the iterator... but it seems to me that 
this is very likely to be the source of your problems!

Your title states "dispatch_async Performance Issues", but have you actually 
looked into whether you see the same problem in a single-threaded case that 
does not use dispatch_async?

All I can suggest is that you examine the preprocessor output and even the 
generated assembly code with the aim of spotting any significant differences in 
the generated code in each case. It may well be that a function call is being 
generated as part of the iterator code, or something like that. Shark may help 
you pin down where the problem is, but you will probably need to have some 
level of appreciation of assembly code to be able to fully interpret the 
results for your optimized build. Unfortunately without the crucial part of 
your source code there's not a lot anyone else can do to help you in that 
respect...

Hope that helps a bit
Jonny___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com


dispatch_async Performance Issues

2011-06-28 Thread Andreas Grosam
Hi All

I've implemented a typical consumer/producer pattern, using a container holding 
a FIFO list of buffers. This buffer list has functions to store and retrieve 
these buffers concurrently. Consumer and Producer are running in the global 
concurrent dispatch queue.

I'm experiencing huge differences in performance in two almost identical 
versions of the code. I've tried to figure the cause with Instruments, but 
didn't find any concrete hints. The "xxx_block_invoke"  of the consumer part of 
one version just takes considerable more time than the other.




The code basically works as follows:

The producer is simply creating a certain amount of buffers (NSData objects), 
initializing its content and storing them into the concurrent accessible buffer 
list. The producer part of the code is identical in both versions. The 
concurrently accessible buffers list is the same, too.

In the consumer part , the code simply retrieves buffers and processes them. 
The difference is in the processing of the received buffer content - that is, 
how the bytes of the buffer's content are accessed.
This tiny bit of difference in code should't make a huge difference in runtime, 
but it actually there is a huge difference, where ever it comes from!

The first uses a "direct" implementation using pointers accessing the content 
of the buffer.
The second implementation uses a C++ iterator concept. But when compiled with 
-03, it should produce almost the same code as the former. So, I would expect 
only minor differences in performance.

However, the difference is about factor 3! I have no idea what the cause of 
this huge difference is. The main part of the code is shown below. The source 
is actual C++ and uses just the dispatch library. Not every piece of source is 
shown, but I can provide it if necessary.


Additional Info:

Synchronizing the access is achieved using dispatch_semaphore objects. Storing 
a buffer into the buffers  list may block if the buffers list has reached its 
maximum number of buffers. Retrieving a buffer may block if there is no buffer 
available.

After testing, it seems, the implementation is correct.

Class CFDataConstBuffers is the type of the buffer list. Is has two 
principal functions:
consume() and produce() which can be called concurrently.

consume() returns the next buffer in the list (FIFO). It may block until the 
buffer list has one available.

produce() puts back a buffer. It may block, if the buffer's capacity (max 
number of buffers) is reached.


Class CFDataConstBuffer, the buffer class, is basically a thin wrapper 
around a CFDataRef.

Class semaphore is a thin wrapper around a dispatch_semahore_t.



Below are two functions whose runtime duration is measured. Note that the 
consumer part of the function ConcurrentProduceConsume() is written such that 
it mimics the code produced by the compiler in the second function 
ConcurrentProduceConsumeIter() which uses C++ Iterators - hence, it looks a bit 
more complex than necessary. The code for the iterator isn't shown here, though.

The buffer size was set from 4KB to 32KB, incrementing in steps.
The buffers' list capacity (max number of hold buffers) was set to 1 to 8.
LLVM compiler, Xcode 4.02.

For no apparent reason ConcurrentProduceConsume() performs significantly faster 
(about 2.5x) than ConcurrentProduceConsumeIter().

Is there a hidden cost for C++ instances in a block, for C++ exception 
handlers, etc.?


Thanks for tips!



// Create a buffers instance with capacity C, then produce N buffers with 
// BUFFER_SIZE bytes and fill them. Concurrently consume the buffers and 
// iterate over the consumed buffer.
// Performs concurrently on two threads. 
//
void ConcurrentProduceConsume(size_t C = 1, size_t N = 100)
{
typedef std::pair, bool> result_t;

// Create a buffers instance with at max C buffers:
CFDataConstBuffers buffers(C);
CFDataConstBuffers* buffersPtr = &buffers;

const size_t TOTAL = BUFFER_SIZE * N;

// Get the global concurrent queue:
dispatch_queue_t queue = 
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);

// Create a group in order to sync the two threads:
dispatch_group_t group = dispatch_group_create();

dispatch_group_async(group, queue,
^{
unsigned long k = 0;
UInt8 data[BUFFER_SIZE];
for (int i = 0; i < N; ++i) {
// fill the buffer:
for (int j = 0; j < BUFFER_SIZE; ++j, ++k) {
data[j] = char(k); 
} 
CFDataRef d = CFDataCreate(NULL, data, sizeof(data));
CFDataConstBuffer buffer = d;
CFRelease(d);
buffersPtr->produce(buffer);
}

// put EOF:
buffersPtr->produce(CFDataConstBuffer());
});


dispatch_group_async(group, queue,
^{
const char* p;
const char* back;