Re: dispatch_async Performance Issues
Hi Jonathan, I'm happy that your guesses exactly match the results of my investigations. :) Actually, the compiler (LLVM here) is doing an amazing job when optimizing. Although I'm happy with this, it makes such tests a pain, since often enough the compiler simply strips the code away what I wanted to test :) So again, I modified the test code in order to prevent the undesired effects of such optimizations. So, I actually *use* the future in the test code, so that the compiler does not strip away dereferencing and does not strip away calculating the sum in the inner loops. After this modifications, more hints to performance issues became apparent ... The cause is definitely the way the C++ compiler handles and generates code for possibly C++ exceptions. Accidentally, my Iterator had a non trivial destructor (or more precisely, the buffer class which is a member of the Iterator), namely it releases a CFData object. Due to this, the compiler had to generate some extra code which causes the performance difference. After fixing the buffers list class and the iterator class, the test performs as expected. With C++ Exceptions enabled, USE_FUTURE defined, and all code fixed so that it actually uses the Future, the test runs as follows: CFDataBuffers Benchmark Using Future: Yes Data size: 131072KB, Buffer size: 8192, N = 16384, C = 2 [Classic]: Elapsed time: 473.539ms (Single threaded, naive implementation using a NSMutableData object constructed by appending many NSData buffers) [ConcurrentProduceConsume1]: Elapsed time: 135.162ms (GCD, straight and forward implementation, using pointers) [ConcurrentProduceConsumeIter]: Elapsed time: 363.226ms (GCD, old and unmodified code, using C++ Iterator concept) [ConcurrentProduceConsumeIter2]: Elapsed time: 189.002ms (Fixed Iterator and fixed buffers list) As usual, take the numbers with a grain of salt. The "workload" of the consumer is minimal, and producing the buffers doesn't block as it would likely happen when downloading data. When dealing with small amount of data (25KByte) the GCD approach does not perform better. In this scenario, the classic approach would be faster: Data size: 24KB, Buffer size: 8192, N = 3, C = 2 [Classic]: Elapsed time: 0.0543303ms [ConcurrentProduceConsumeIter2]: Elapsed time: 0.116253ms Regards, and thank you again for you help! Andreas On Jun 28, 2011, at 9:37 PM, Jonathan Taylor wrote: >> In the meantime however, I found one (surprising) cause of the performance >> issue. After making the versions *more* equivalent the issue become >> apparent. I restructured the second version (using the C++ iterators) and >> will discuss this in more detail. The culprit is in the consumer part as >> follows: >> >> New restructured code: >> >> [...] >> >> The difference compared to the former code provided in the previous mail is >> now >> >> 1) The C++ instances, that is the iterators, are defined locally within the >> block. >> >> 2) The "Future" (that is the result of the operation) is conditional >> compiled in or out, in order to test its impact. >>Here, the __block modifier is used for the "Future" variables "sum" and >> "total". >>When using pointers within the block accessing the outside variables, the >> performance does not differ, but using __block may be more correct. > > > Ah - now then! I will take a very strong guess as to what is happening there > (I've done it myself, and seen it done by plenty of others! [*]). In the case > where you do NOT define USE_FUTURE, your consumer thread as written in your > email does not make any use of the variables "sum_" and "total_". Hence the > compiler is entirely justified in optimizing out those variables entirely! It > will still have to check the iterator against eof, and may have to > dereference the iterator[**], but it does not need to update the "sum_" or > "total_" variables. > > It may well be that there is still a deeper performance issue with your > original code, and I'm happy to have another look at that when I have a > chance. I suggest you deal with this issue first, though, as it appears to be > introducing misleading discrepancies in the execution times you're using for > comparison. > > As I say, it's quite a common issue when you start stripping down code with > the aim of doing minimal performance comparisons. The easiest solution is > either to printf the results at the end (which forces the compiler to > actually evaluate them!), or alternatively do the sort of thing you're doing > when USE_FUTURE is defined - writing to a shadow variable at the end. If you > declare your shadow variable as "volatile" then the compiler is forced to > write the result and is not permitted to optimize everything out. > > Hope that helps, even if it may not deal with your original problem yet. > Apologies that my first round of guesses were wrong - I'm pretty sure about > this one though :)
Re: dispatch_async Performance Issues
> In the meantime however, I found one (surprising) cause of the performance > issue. After making the versions *more* equivalent the issue become > apparent. I restructured the second version (using the C++ iterators) and > will discuss this in more detail. The culprit is in the consumer part as > follows: > > New restructured code: > > [...] > > The difference compared to the former code provided in the previous mail is > now > > 1) The C++ instances, that is the iterators, are defined locally within the > block. > > 2) The "Future" (that is the result of the operation) is conditional compiled > in or out, in order to test its impact. > Here, the __block modifier is used for the "Future" variables "sum" and > "total". > When using pointers within the block accessing the outside variables, the > performance does not differ, but using __block may be more correct. Ah - now then! I will take a very strong guess as to what is happening there (I've done it myself, and seen it done by plenty of others! [*]). In the case where you do NOT define USE_FUTURE, your consumer thread as written in your email does not make any use of the variables "sum_" and "total_". Hence the compiler is entirely justified in optimizing out those variables entirely! It will still have to check the iterator against eof, and may have to dereference the iterator[**], but it does not need to update the "sum_" or "total_" variables. It may well be that there is still a deeper performance issue with your original code, and I'm happy to have another look at that when I have a chance. I suggest you deal with this issue first, though, as it appears to be introducing misleading discrepancies in the execution times you're using for comparison. As I say, it's quite a common issue when you start stripping down code with the aim of doing minimal performance comparisons. The easiest solution is either to printf the results at the end (which forces the compiler to actually evaluate them!), or alternatively do the sort of thing you're doing when USE_FUTURE is defined - writing to a shadow variable at the end. If you declare your shadow variable as "volatile" then the compiler is forced to write the result and is not permitted to optimize everything out. Hope that helps, even if it may not deal with your original problem yet. Apologies that my first round of guesses were wrong - I'm pretty sure about this one though :) Jonny [**] Completely unrelated to this thread, but see this rather extreme example where the claimed performance had to be reduced by a factor of twelve due to this problem! http://www.ibm.com/developerworks/forums/thread.jspa?threadID=226415 [*] I ~think~ ... because this involves a memory access, which is strictly speaking a side effect in itself. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: dispatch_async Performance Issues
On Jun 28, 2011, at 4:07 PM, Jonathan Taylor wrote: > ... and more importantly I do not believe your code makes correct use of > dispatch_group_async. Your first call adds a producer block to the global > concurrent queue, which may start executing immediately. At some time soon > afterwards you add a consumer block to the global concurrent queue, which > again may start executing immediately - while the producer block is still in > the process of "producing". As far as I can see the code appears to have a > number of major issues: > > 1. Theoretically your producer block may not have produced anything by the > time your consumer block executes The consumer thread may block until a buffer is available. This happens in member function consume() of the buffer list. > > 2. What happens if your consumer block has processed all the currently-added > blocks before the producer block has added the eof marker? The consumer block loops until a buffer is available, or until an timeout occurs. This happens in member consume(). An empty buffer signals EOF. > > 3. You have only one single consumer block, which means you will only have a > single consumer "thread" running under Grand Central Dispatch, which rather > defeats the purpose of using it at all! I think you may not have fully > understood how GCD is intended to be used... I know this. But there is no easy way around, due to this: the buffers represent a contiguous stream of data. The consumer thread is used to execute a parser. The parser can not simply stop and then continue with a new chunk of data. The parser's state is represented in the execution stack. So, the parser may block when it iterates over the data buffers and thus safes its state that way. I know that this is not the optimal way to use GCD. But at least I get semaphores which do not trap into the kernel that often, and thread switching shall be faster as well. The original goals for this approach is: Reduce memory foot print. The buffers list maximum number of buffers is limited. Increase overall performance due to simultaneously downloading data and parsing. Scaleability . No need to safe the stream to an intermediate file on disk before parsing. Avoid copying of data completely. > > 4. Are you sure that your mysterious and rather complicated-seeming > CFDataConstBuffers class :) > is sufficiently threadsafe for your producer to be adding to it at the same > time as your consumer block is iterating through it? I am 99% sure that the > "explicit" consumer code you wrote in ConcurrentProduceConsume is not > threadsafe. I hope it is. But there may be subtle caveats which can in certain circumstances reduce performance due to frequent context switches. I would avoid it if I would know IF and when this can happen ;) The source is available (see link previous mail). Would you mind taking a look at it? But be warned: this is C++ code. Thanks in advance! Andreas > > Jonny. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: dispatch_async Performance Issues
On Jun 28, 2011, at 3:53 PM, Jonathan Taylor wrote: Hi Jonathan, Thank you very much for taking the time to look at it. > Hi Andreas, > > If I understand your post correctly, you are saying that you see a > performance drop of 3x when using an iterator in your inner loop as opposed > to using hand-written C code to do the iterating. Yes, at least that was my former assumption. In the meantime however, I found one (surprising) cause of the performance issue. After making the versions *more* equivalent the issue become apparent. I restructured the second version (using the C++ iterators) and will discuss this in more detail. The culprit is in the consumer part as follows: New restructured code: ... #if defined (USE_FUTURE) __block size_t sum = 0; __block size_t total = 0; #endif dispatch_async(queue, ^{ CFDataConstBuffers_iterator eof; CFDataConstBuffers_iterator iter(*buffersPtr); size_t sum_ = 0; size_t total_ = 0; while (iter != eof) { sum_ += *iter; ++iter; ++total_; } #if defined (USE_FUTURE) sum = sum_; total = total_; #endif semPtr->signal(); }); ... The difference compared to the former code provided in the previous mail is now 1) The C++ instances, that is the iterators, are defined locally within the block. 2) The "Future" (that is the result of the operation) is conditional compiled in or out, in order to test its impact. Here, the __block modifier is used for the "Future" variables "sum" and "total". When using pointers within the block accessing the outside variables, the performance does not differ, but using __block may be more correct. Note, that I access the variables sum and total only once at the end of the block code. This has a reason, which I will explain below. As mentioned, the conditional #if define (USE_FUTURE) tests its impact on performance. If USE_FUTURE is defined, the performance drops dramatically! The same happens, if there are pointer variables which access variables defined outside the block. The performance drops even more if I would use the __block variables "sum" and "total" directly when incrementing, e.g.: while (iter != eof) { sum += *iter; ++iter; ++total; } Even when I access the Future "sum" and "total" only once in the block, the performance penalty is significant. > Unfortunately you say you haven't actually posted the code relating to the > iterator... but it seems to me that this is very likely to be the source of > your problems! OK, appended the source code. :) These are three files, and a lot for this mail. But I guess, this is not the cause of the issue. > > Your title states "dispatch_async Performance Issues", but have you actually > looked into whether you see the same problem in a single-threaded case that > does not use dispatch_async? I haven't *exactly* done this one, but I have a lot of other test cases (see below). > > All I can suggest is that you examine the preprocessor output and even the > generated assembly code with the aim of spotting any significant differences > in the generated code in each case. It may well be that a function call is > being generated as part of the iterator code, or something like that. Shark > may help you pin down where the problem is, but you will probably need to > have some level of appreciation of assembly code to be able to fully > interpret the results for your optimized build. I examined the assembly and from the looks I couldn't find any hints. Both versions looked quite similar. My guess is, it is some synchronization primitive like a spin lock. After modifying the "direct implemented" and "C++ Iterator" versions, both run now similar if no future is used. If a future is used, for some reason, the runtime drops much more for the C++ Iterator version than the code that mimics the behavior. But, considering my original goal, performance is a bit slow, that is, not faster then a very primitive implementation which uses one thread, allocates NSData buffers, and adds them to a NSMutableData object which is then processed (see "Classic" bench). This is a bit disappointing. Guess the overhead for dispatch is still high. Here are some results of my bench marks (note that in single threaded approaches the future has no effect since there is none): CFDataBuffers Benchmark Using Future: No Data size: 131072KB, Buffer size: 8192, N = 16384, C = 2 [SimpleProduceConsume1]:Elapsed time: 299.028ms [SimpleProduceConsume2]:Elapsed time: 125.744ms
Re: dispatch_async Performance Issues
... and more importantly I do not believe your code makes correct use of dispatch_group_async. Your first call adds a producer block to the global concurrent queue, which may start executing immediately. At some time soon afterwards you add a consumer block to the global concurrent queue, which again may start executing immediately - while the producer block is still in the process of "producing". As far as I can see the code appears to have a number of major issues: 1. Theoretically your producer block may not have produced anything by the time your consumer block executes 2. What happens if your consumer block has processed all the currently-added blocks before the producer block has added the eof marker? 3. You have only one single consumer block, which means you will only have a single consumer "thread" running under Grand Central Dispatch, which rather defeats the purpose of using it at all! I think you may not have fully understood how GCD is intended to be used... 4. Are you sure that your mysterious and rather complicated-seeming CFDataConstBuffers class is sufficiently threadsafe for your producer to be adding to it at the same time as your consumer block is iterating through it? I am 99% sure that the "explicit" consumer code you wrote in ConcurrentProduceConsume is not threadsafe. Jonny.___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: dispatch_async Performance Issues
Hi Andreas, If I understand your post correctly, you are saying that you see a performance drop of 3x when using an iterator in your inner loop as opposed to using hand-written C code to do the iterating. Unfortunately you say you haven't actually posted the code relating to the iterator... but it seems to me that this is very likely to be the source of your problems! Your title states "dispatch_async Performance Issues", but have you actually looked into whether you see the same problem in a single-threaded case that does not use dispatch_async? All I can suggest is that you examine the preprocessor output and even the generated assembly code with the aim of spotting any significant differences in the generated code in each case. It may well be that a function call is being generated as part of the iterator code, or something like that. Shark may help you pin down where the problem is, but you will probably need to have some level of appreciation of assembly code to be able to fully interpret the results for your optimized build. Unfortunately without the crucial part of your source code there's not a lot anyone else can do to help you in that respect... Hope that helps a bit Jonny___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
dispatch_async Performance Issues
Hi All I've implemented a typical consumer/producer pattern, using a container holding a FIFO list of buffers. This buffer list has functions to store and retrieve these buffers concurrently. Consumer and Producer are running in the global concurrent dispatch queue. I'm experiencing huge differences in performance in two almost identical versions of the code. I've tried to figure the cause with Instruments, but didn't find any concrete hints. The "xxx_block_invoke" of the consumer part of one version just takes considerable more time than the other. The code basically works as follows: The producer is simply creating a certain amount of buffers (NSData objects), initializing its content and storing them into the concurrent accessible buffer list. The producer part of the code is identical in both versions. The concurrently accessible buffers list is the same, too. In the consumer part , the code simply retrieves buffers and processes them. The difference is in the processing of the received buffer content - that is, how the bytes of the buffer's content are accessed. This tiny bit of difference in code should't make a huge difference in runtime, but it actually there is a huge difference, where ever it comes from! The first uses a "direct" implementation using pointers accessing the content of the buffer. The second implementation uses a C++ iterator concept. But when compiled with -03, it should produce almost the same code as the former. So, I would expect only minor differences in performance. However, the difference is about factor 3! I have no idea what the cause of this huge difference is. The main part of the code is shown below. The source is actual C++ and uses just the dispatch library. Not every piece of source is shown, but I can provide it if necessary. Additional Info: Synchronizing the access is achieved using dispatch_semaphore objects. Storing a buffer into the buffers list may block if the buffers list has reached its maximum number of buffers. Retrieving a buffer may block if there is no buffer available. After testing, it seems, the implementation is correct. Class CFDataConstBuffers is the type of the buffer list. Is has two principal functions: consume() and produce() which can be called concurrently. consume() returns the next buffer in the list (FIFO). It may block until the buffer list has one available. produce() puts back a buffer. It may block, if the buffer's capacity (max number of buffers) is reached. Class CFDataConstBuffer, the buffer class, is basically a thin wrapper around a CFDataRef. Class semaphore is a thin wrapper around a dispatch_semahore_t. Below are two functions whose runtime duration is measured. Note that the consumer part of the function ConcurrentProduceConsume() is written such that it mimics the code produced by the compiler in the second function ConcurrentProduceConsumeIter() which uses C++ Iterators - hence, it looks a bit more complex than necessary. The code for the iterator isn't shown here, though. The buffer size was set from 4KB to 32KB, incrementing in steps. The buffers' list capacity (max number of hold buffers) was set to 1 to 8. LLVM compiler, Xcode 4.02. For no apparent reason ConcurrentProduceConsume() performs significantly faster (about 2.5x) than ConcurrentProduceConsumeIter(). Is there a hidden cost for C++ instances in a block, for C++ exception handlers, etc.? Thanks for tips! // Create a buffers instance with capacity C, then produce N buffers with // BUFFER_SIZE bytes and fill them. Concurrently consume the buffers and // iterate over the consumed buffer. // Performs concurrently on two threads. // void ConcurrentProduceConsume(size_t C = 1, size_t N = 100) { typedef std::pair, bool> result_t; // Create a buffers instance with at max C buffers: CFDataConstBuffers buffers(C); CFDataConstBuffers* buffersPtr = &buffers; const size_t TOTAL = BUFFER_SIZE * N; // Get the global concurrent queue: dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); // Create a group in order to sync the two threads: dispatch_group_t group = dispatch_group_create(); dispatch_group_async(group, queue, ^{ unsigned long k = 0; UInt8 data[BUFFER_SIZE]; for (int i = 0; i < N; ++i) { // fill the buffer: for (int j = 0; j < BUFFER_SIZE; ++j, ++k) { data[j] = char(k); } CFDataRef d = CFDataCreate(NULL, data, sizeof(data)); CFDataConstBuffer buffer = d; CFRelease(d); buffersPtr->produce(buffer); } // put EOF: buffersPtr->produce(CFDataConstBuffer()); }); dispatch_group_async(group, queue, ^{ const char* p; const char* back;