Re: Mystery Threads
> On 1 Oct 2016, at 01:33, Quincey Morris > wrote: > > On Sep 30, 2016, at 02:57 , Gerriet M. Denkmann wrote: > >> Any ideas where to look for a reason? > > The next step is probably to clarify the times between: > > a. Accumulated execution time — the amount of time your code actually spends > executing in CPUs. > > b. Elapsed time in your process — the amount of time that’s accounted for by > your process, whether executing or waiting. > > c. Elapsed time outside your process — the amount of time that’s accounted > for by system code, also whether executing or waiting. Time Profiler tells me that 99.9 % of the time is in Running and Blocked - each very roughly half of the time (or 2/3 to 1/3). Running is almost 100 % in my function. Blocked is in roughly equal parts: my function, mach_msg_trap (from RunLoop) and workq_kernreturn (from start_wq_thread). There are some minor variations between 8 or 20,000 iterations, but nothing to explain a difference factor of 8. My function reports the running time: start =[NSDate date] … dispatch_apply… time = -start.timeIntervallSinceNow which shows the same factor of 8 between 8 or 20,000 iterations. > > You can also play around with change isolation. Instead of changing two > contextual conditions (the number of dispatch_appy calls, the number of > iterations in a single block’s loop), change only one of them and observe the > effect in Instruments. Well, to compare I want in all cases (independent of nbr of iterations): The iterations have disjunct working ranges and the sum of the working ranges of all iterations covers the whole bigArray. The same nbr of operations (at the same indices) should be done on bigArray. The operations in the working range should be done randomly (well: at least not sequentially). One thing is quite clear: each iteration of my function accesses its working range sort of randomly. If one uses sequential access, then a very different behaviour emerges. > You can also try out some other instruments speculatively. For example, is > there a different pattern in the Allocations instrument, indicating that one > form of your code is doing vast numbers of memory allocations for some > (unknown) reason. Or is I/O being doing, unexpectedly? There are no allocations (except at the start one huge malloc of 400 MB). There is no I/O. The I tried the System Trace Instrument and learned: 2k iterations (180 msec): the whole time there is zero-fill being done (very rarely a page-fault). 8 iterations (1500 msec): the first 100 msec there is zero-filling, then the 8 threads just keep slugging along. There are far fewer context switches (almost none after the zero-filling has ceased). But still, I cannot see any reason why this should take so much longer. My hypothesis is: with a big number of iterations (each having a working range ≤ 500 KB ) any 8 iterations running concurrently use together ≤ 4 MB, which might just fit into some cache. With 8 iterations (each using a working range of 50 MB) there probably is a lot of cache reloading going on. But I failed to see any proof of this hypothesis in Instruments. Kind regards, Gerriet. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: Mystery Threads
On Sep 30, 2016, at 02:57 , Gerriet M. Denkmann wrote: > > dispatch_apply(8,…): > My function is running 3090 msec and blocked 970 ms. > And with dispatch_apply(20,000,…): > My function is running 196 msec and blocked 27 ms. In a way, this is good news, because a difference that gross ought to be relatively easy to account for. Clearly there is nothing subtle about the difference between the two scenarios. > Any ideas where to look for a reason? The next step is probably to clarify the times between: a. Accumulated execution time — the amount of time your code actually spends executing in CPUs. b. Elapsed time in your process — the amount of time that’s accounted for by your process, whether executing or waiting. c. Elapsed time outside your process — the amount of time that’s accounted for by system code, also whether executing or waiting. You can also play around with change isolation. Instead of changing two contextual conditions (the number of dispatch_appy calls, the number of iterations in a single block’s loop), change only one of them and observe the effect in Instruments. You can also try out some other instruments speculatively. For example, is there a different pattern in the Allocations instrument, indicating that one form of your code is doing vast numbers of memory allocations for some (unknown) reason. Or is I/O being doing, unexpectedly? Basically, at this point, you’re looking for a lever. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: Mystery Threads
> On 30 Sep 2016, at 10:57, Gerriet M. Denkmann wrote: > > > > But I just cannot see anything which forces my function to run 16 times > longer in the first case. > > Any ideas where to look for a reason? https://github.com/apple/swift-corelibs-libdispatch/blob/ab16f5e62859ff2f54996b8838f8304a8d125102/src/apply.c ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: Mystery Threads
> On 29 Sep 2016, at 16:05, Roland King wrote: > > >> On 29 Sep 2016, at 16:59, Gerriet M. Denkmann wrote: >> >> >>> On 29 Sep 2016, at 15:34, Quincey Morris >>> wrote: >>> >> >> Well, I count this as (bigArea = 4 GB): >> (a) one call of dispatch_apply which schedules 40 000 times a block to GCD >> which handles 0.1 MB >> (b) one call of dispatch_apply which schedules 8 times a block to GCD which >> handles 500 MB >> >> Could be that these blocks sometimes collide (maybe when they are operating >> on adjacent areas), which slows them down. Such a collision is rather >> unlikely if only 8 of 40 000 are running. > > Why guess - this is exactly what Instruments is designed to tell you. It’s > even dispatch-aware so it can show you results broken down by dispatch queue > and worker thread inside the dispatch queue. Run the two under instruments > and find out where all the time is spent. Ok. So I did run the Time Profiler Instrument (as suggested by Quincey): dispatch_apply(8,…): My function is running 3090 msec and blocked 970 ms. Other blockings: 690 ms workq_kernreturn, 560 ms mach_msg_trap. And with dispatch_apply(20,000,…): My function is running 196 msec and blocked 27 ms. 21 ms workq_kernreturn, 34 ms mach_msg_trap. But I just cannot see anything which forces my function to run 16 times longer in the first case. Any ideas where to look for a reason? Kind regards, Gerriet. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: Mystery Threads
> On 29 Sep 2016, at 16:59, Gerriet M. Denkmann wrote: > > >> On 29 Sep 2016, at 15:34, Quincey Morris >> wrote: >> >> On Sep 29, 2016, at 01:05 , Gerriet M. Denkmann wrote: >>> >>> Well, nothing. Just let’s call it nbrOfBlocksToBeUsedByDispatchApply, or >>> whatever. But ultimately any of these things has to run on a CPU, of which >>> there are no more than 8. >> >> Well, here’s my narrative. It may be fiction or non-fiction. >> >> You said you tried “nbrOf…” as a few ten-thousands, vs. 8. Let’s be concrete >> and call this (a) 40,000 vs. (b) 8. So, for each set of 40,000 iterations of >> your block, you’re doing 1 dispatch_apply in case #a, and 5,000 >> dispatch_apply calls in case #b. So, you’ve established that 4,999 >> dispatch_apply calls — and related per-dispatch_appy overhead — take a long >> time. > > Well, I count this as (bigArea = 4 GB): > (a) one call of dispatch_appy which schedules 40 000 times a block to GCD > which handles 0.1 MB > (b) one call of dispatch_appy which schedules 8 times a block to GCD which > handles 500 MB > > Could be that these blocks sometimes collide (maybe when they are operating > on adjacent areas), which slows them down. Such a collision is rather > unlikely if only 8 of 40 000 are running. > > >> Of course, I’m relying on the fact that you’re doing the same number of >> *total* iterations of your inner loop in case #a and case #b. This is not >> quite the whole story, because there are loop setup overheads per block. >> However, the loop setup that you’ve shown is very simple — a couple of Int >> operations — so the additional 4,999 loop setup executions are likely >> dwarfed by 4,999 dispatch_apply executions. > > The actual story is: one outer loop (same in all cases) which sets up some > parameters, then another loop which covers the area which is assigned to this > block. > In case (a) this area is small: 0.1 MB, whereas in case (b) it is large: 500 > MB. Which seems to be in favour of case (b). > > Why guess - this is exactly what Instruments is designed to tell you. It’s even dispatch-aware so it can show you results broken down by dispatch queue and worker thread inside the dispatch queue. Run the two under instruments and find out where all the time is spent. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: Mystery Threads
> On 29 Sep 2016, at 15:34, Quincey Morris > wrote: > > On Sep 29, 2016, at 01:05 , Gerriet M. Denkmann wrote: >> >> Well, nothing. Just let’s call it nbrOfBlocksToBeUsedByDispatchApply, or >> whatever. But ultimately any of these things has to run on a CPU, of which >> there are no more than 8. > > Well, here’s my narrative. It may be fiction or non-fiction. > > You said you tried “nbrOf…” as a few ten-thousands, vs. 8. Let’s be concrete > and call this (a) 40,000 vs. (b) 8. So, for each set of 40,000 iterations of > your block, you’re doing 1 dispatch_apply in case #a, and 5,000 > dispatch_apply calls in case #b. So, you’ve established that 4,999 > dispatch_apply calls — and related per-dispatch_appy overhead — take a long > time. Well, I count this as (bigArea = 4 GB): (a) one call of dispatch_appy which schedules 40 000 times a block to GCD which handles 0.1 MB (b) one call of dispatch_appy which schedules 8 times a block to GCD which handles 500 MB Could be that these blocks sometimes collide (maybe when they are operating on adjacent areas), which slows them down. Such a collision is rather unlikely if only 8 of 40 000 are running. > Of course, I’m relying on the fact that you’re doing the same number of > *total* iterations of your inner loop in case #a and case #b. This is not > quite the whole story, because there are loop setup overheads per block. > However, the loop setup that you’ve shown is very simple — a couple of Int > operations — so the additional 4,999 loop setup executions are likely dwarfed > by 4,999 dispatch_apply executions. The actual story is: one outer loop (same in all cases) which sets up some parameters, then another loop which covers the area which is assigned to this block. In case (a) this area is small: 0.1 MB, whereas in case (b) it is large: 500 MB. Which seems to be in favour of case (b). Kind regards, Gerriet. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: Mystery Threads
> On 29 Sep 2016, at 10:05, Gerriet M. Denkmann wrote: > > >> On 29 Sep 2016, at 14:38, Quincey Morris >> wrote: >> >> On Sep 29, 2016, at 00:15 , Gerriet M. Denkmann wrote: >>> >>> dispatch_apply( nbrOfThreads, queue, ^void(size_t idx) >>> >>> As my computer has just 8 CPUs, I thought that using nbrOfThreads > 8 would >>> be silly: adding overhead without gaining anything. >>> >>> Turns out this is quite wrong. One function (called threadHappyFunction) >>> works more than 10 times faster using nbrOfThreads = a few ten-thousand (as >>> compared to nbrOfThreads = 8). >>> >>> This is nice, but I would like to know why can this happen. >> >> What makes you think “nbrOfThreads” represents a number of threads? > > Well, nothing. Just let’s call it nbrOfBlocksToBeUsedByDispatchApply, or > whatever. But ultimately any of these things has to run on a CPU, of which > there are no more than 8. There are mored shared resources than just execution units in your system (e.g. memory bandwidth, or for non-linear accesses latency). Maybe one of your blocks is bandwidth bound, while the other is compute bound? Your second function might be memory bound (with lots of read-modify-write traffic). There are many other factors (and the dispatch_apply man-page tells you that number of invocations is very dependant on your block), such as caches or hyper-threading. The performance counters in Instruments may be able to guide you. Daniel. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: Mystery Threads
On Sep 29, 2016, at 01:05 , Gerriet M. Denkmann wrote: > > Well, nothing. Just let’s call it nbrOfBlocksToBeUsedByDispatchApply, or > whatever. But ultimately any of these things has to run on a CPU, of which > there are no more than 8. Well, here’s my narrative. It may be fiction or non-fiction. You said you tried “nbrOf…” as a few ten-thousands, vs. 8. Let’s be concrete and call this (a) 40,000 vs. (b) 8. So, for each set of 40,000 iterations of your block, you’re doing 1 dispatch_apply in case #a, and 5,000 dispatch_apply calls in case #b. So, you’ve established that 4,999 dispatch_apply calls — and related per-dispatch_appy overhead — take a long time. Of course, I’m relying on the fact that you’re doing the same number of *total* iterations of your inner loop in case #a and case #b. This is not quite the whole story, because there are loop setup overheads per block. However, the loop setup that you’ve shown is very simple — a couple of Int operations — so the additional 4,999 loop setup executions are likely dwarfed by 4,999 dispatch_apply executions. >> Isn’t this what Instruments is for? > > Might be. I already looked at Time Profiler, but failed to notice anything > (maybe was looking at the wrong things) and System Trace, but did not > understand what to observe or what the Instrument was telling me. Unfortunately, I agree, Instruments is inscrutable initially, and has a nasty learning curve. You kind of have to persist, poking around till things start to make a little sense. Since you want to know where time is being spent, Time Profiler sounds like the right place to start. One possible approach is to profile case #a and case #b, and compare the Instruments output. Since you know what the actual performance difference is (in general terms), you should be able to see that reflected in what Instruments tells you. That should give you some reference points. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: Mystery Threads
> On 29 Sep 2016, at 14:38, Quincey Morris > wrote: > > On Sep 29, 2016, at 00:15 , Gerriet M. Denkmann wrote: >> >> dispatch_apply( nbrOfThreads, queue, ^void(size_t idx) >> >> As my computer has just 8 CPUs, I thought that using nbrOfThreads > 8 would >> be silly: adding overhead without gaining anything. >> >> Turns out this is quite wrong. One function (called threadHappyFunction) >> works more than 10 times faster using nbrOfThreads = a few ten-thousand (as >> compared to nbrOfThreads = 8). >> >> This is nice, but I would like to know why can this happen. > > What makes you think “nbrOfThreads” represents a number of threads? Well, nothing. Just let’s call it nbrOfBlocksToBeUsedByDispatchApply, or whatever. But ultimately any of these things has to run on a CPU, of which there are no more than 8. > >> Why are the threads seemingly blocking each other? >> What is going on here? >> How can I investigate this? > > Isn’t this what Instruments is for? Might be. I already looked at Time Profiler, but failed to notice anything (maybe was looking at the wrong things) and System Trace, but did not understand what to observe or what the Instrument was telling me. Maybe some other Instrument would help? What should I look for? Kind regards, Gerriet. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: Mystery Threads
On Sep 29, 2016, at 00:15 , Gerriet M. Denkmann wrote: > > dispatch_apply( nbrOfThreads, queue, ^void(size_t idx) > > As my computer has just 8 CPUs, I thought that using nbrOfThreads > 8 would > be silly: adding overhead without gaining anything. > > Turns out this is quite wrong. One function (called threadHappyFunction) > works more than 10 times faster using nbrOfThreads = a few ten-thousand (as > compared to nbrOfThreads = 8). > > This is nice, but I would like to know why can this happen. What makes you think “nbrOfThreads” represents a number of threads? > Why are the threads seemingly blocking each other? > What is going on here? > How can I investigate this? Isn’t this what Instruments is for? ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: Mystery Threads
My thoughts are general, not specific to Mac OS... The idea that the best performance comes from threads = #CPUs is attractive, but will work only if the threads do not sleep and do not interfere with each other. A classic example is dividing up a complex calculation that runs without touching the disk, reading large arrays, or sharing variables. To construct a case where more than #CPUs might help, consider a thread which sets up and reads from a network connection, then does something trivial and repeats. If you have threads=#CPUs, each thread will start and issue a network read. Then all the threads will wait for a reply and the CPU is idle. More threads might help. But they might overwhelm the network, so it might not be a good plan to have thousands. Exactly the same applies if the thead reads or writes the disk. The thread might read the disk deliberately, or it might get a page fault. More threads might help, but the bottleneck is likely to be the disk, so the number of threads is almost irrelevant. Where threads run simultaneously on different CPUs, you want for best performance to allow each CPU to work with data in its cache. Where one thread writes a memory location that is read by another thread, the CPUs are likely to have to flush their cache, and the multi-CPU performance becomes that of an uncached CPU, which can be terrible. The hardware issues here are complex and I haven't kept up with them; perhaps cache sharing has improved. On 29 September 2016 at 08:15, Gerriet M. Denkmann wrote: > I have a big array (like a few GB) which is operated upon by some > functions. > As these functions act purely local, an obvious idea is: > > - (void)someFunction > { > nbrOfThreads = ... > sizeOfBigArray = ... a few GB > stride = sizeOfBigArray / nbrOfThreads > > dispatch_apply( nbrOfThreads, queue, ^void(size_t idx) > { > start = idx * stride > end = start + stride > > index = start > while ( index < end ) > { > mask = ... > bigArray[index] |= mask > index += … something positive… > } > } > ) > } > > As my computer has just 8 CPUs, I thought that using nbrOfThreads > 8 > would be silly: adding overhead without gaining anything. > > Turns out this is quite wrong. One function (called threadHappyFunction) > works more than 10 times faster using nbrOfThreads = a few ten-thousand (as > compared to nbrOfThreads = 8). > > This is nice, but I would like to know why can this happen. > > Another function does not like threads at all: > > - (void)threadShyFunction > { > nbrOfThreads = ... > uint64_t *bigArrayAsLongs = (uint64_t *)bigArray > sizeOfBigArrayInLongs = ... > stride = sizeOfBigArrayInLongs / nbrOfThreads > > uint64_t *template = ... > sizeOfTemplate = not more than a few dozen longs > > dispatch_apply( nbrOfThreads, queue, ^void(size_t idx) > { > start = idx * stride > end = start + stride > > offset = start > > while ( offset + sizeOfTemplate < end ) > { > for ( i = 0 ..< sizeOfTemplate ) > bigArrayAsLongs[offset + i] |= template[i] > offset += sizeOfTemplate > } > } > ) > } > > This works, but for nbrOfThreads > 1 it gets slower instead of faster. Up > to hundred times slower for moderately big nbrOfThreads. > This really bothers me. > Why are the threads seemingly blocking each other? > What is going on here? > How can I investigate this? > > Gerriet. > > P.S. macOS 12, Xcode 8, ObjC or Swift. > > > ___ > > Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) > > Please do not post admin requests or moderator comments to the list. > Contact the moderators at cocoa-dev-admins(at)lists.apple.com > > Help/Unsubscribe/Update your Subscription: > https://lists.apple.com/mailman/options/cocoa-dev/aandi%40quite.com > > This email sent to aa...@quite.com ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Mystery Threads
I have a big array (like a few GB) which is operated upon by some functions. As these functions act purely local, an obvious idea is: - (void)someFunction { nbrOfThreads = ... sizeOfBigArray = ... a few GB stride = sizeOfBigArray / nbrOfThreads dispatch_apply( nbrOfThreads, queue, ^void(size_t idx) { start = idx * stride end = start + stride index = start while ( index < end ) { mask = ... bigArray[index] |= mask index += … something positive… } } ) } As my computer has just 8 CPUs, I thought that using nbrOfThreads > 8 would be silly: adding overhead without gaining anything. Turns out this is quite wrong. One function (called threadHappyFunction) works more than 10 times faster using nbrOfThreads = a few ten-thousand (as compared to nbrOfThreads = 8). This is nice, but I would like to know why can this happen. Another function does not like threads at all: - (void)threadShyFunction { nbrOfThreads = ... uint64_t *bigArrayAsLongs = (uint64_t *)bigArray sizeOfBigArrayInLongs = ... stride = sizeOfBigArrayInLongs / nbrOfThreads uint64_t *template = ... sizeOfTemplate = not more than a few dozen longs dispatch_apply( nbrOfThreads, queue, ^void(size_t idx) { start = idx * stride end = start + stride offset = start while ( offset + sizeOfTemplate < end ) { for ( i = 0 ..< sizeOfTemplate ) bigArrayAsLongs[offset + i] |= template[i] offset += sizeOfTemplate } } ) } This works, but for nbrOfThreads > 1 it gets slower instead of faster. Up to hundred times slower for moderately big nbrOfThreads. This really bothers me. Why are the threads seemingly blocking each other? What is going on here? How can I investigate this? Gerriet. P.S. macOS 12, Xcode 8, ObjC or Swift. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com