Re: Mystery Threads

2016-10-01 Thread Gerriet M. Denkmann

> On 1 Oct 2016, at 01:33, Quincey Morris  
> wrote:
> 
> On Sep 30, 2016, at 02:57 , Gerriet M. Denkmann  wrote:
> 
>> Any ideas where to look for a reason?
> 
> The next step is probably to clarify the times between:
> 
> a. Accumulated execution time — the amount of time your code actually spends 
> executing in CPUs.
> 
> b. Elapsed time in your process — the amount of time that’s accounted for by 
> your process, whether executing or waiting.
> 
> c. Elapsed time outside your process — the amount of time that’s accounted 
> for by system code, also whether executing or waiting.

Time Profiler tells me that 99.9 % of the time is in Running and Blocked - each 
very roughly half of the time (or 2/3 to 1/3).
Running is almost 100 % in my function.
Blocked is in roughly equal parts: my function, mach_msg_trap (from RunLoop) 
and workq_kernreturn (from start_wq_thread).

There are some minor variations between 8 or 20,000 iterations, but nothing to 
explain a difference factor of 8.

My function reports the running time:
start =[NSDate date]
…
dispatch_apply…
time = -start.timeIntervallSinceNow
which shows the same factor of 8 between  8 or 20,000 iterations.

> 
> You can also play around with change isolation. Instead of changing two 
> contextual conditions (the number of dispatch_appy calls, the number of 
> iterations in a single block’s loop), change only one of them and observe the 
> effect in Instruments.

Well, to compare I want in all cases (independent of nbr of iterations):
The iterations have disjunct working ranges and the sum of the working ranges 
of all iterations covers the whole bigArray.
The same nbr of operations (at the same indices) should be done on bigArray.
The operations in the working range should be done randomly (well: at least not 
sequentially).

One thing is quite clear: each iteration of my function accesses its working 
range sort of randomly.
If one uses sequential access, then a very different behaviour emerges.

> You can also try out some other instruments speculatively. For example, is 
> there a different pattern in the Allocations instrument, indicating that one 
> form of your code is doing vast numbers of memory allocations for some 
> (unknown) reason. Or is I/O being doing, unexpectedly?

There are no allocations (except at the start one huge malloc of 400 MB).
There is no I/O.

The I tried the System Trace Instrument and learned:

2k iterations (180 msec):
the whole time there is zero-fill being done (very rarely a page-fault).

8 iterations (1500 msec):
the first 100 msec there is zero-filling, then the 8 threads just keep slugging 
along.
There are far fewer context switches (almost none after the zero-filling has 
ceased).

But still, I cannot see any reason why this should take so much longer.

My hypothesis is: with a big number of iterations (each having a working range 
≤ 500 KB ) any 8 iterations running concurrently use together ≤ 4 MB, which 
might just fit into some cache.

With 8 iterations (each using a working range of 50 MB) there probably is a lot 
of cache reloading going on.
But I failed to see any proof of this hypothesis in Instruments.


Kind regards,

Gerriet.


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Mystery Threads

2016-09-30 Thread Quincey Morris
On Sep 30, 2016, at 02:57 , Gerriet M. Denkmann  wrote:
> 
> dispatch_apply(8,…):
> My function is running 3090 msec and blocked 970 ms. 

> And with dispatch_apply(20,000,…):
> My function is running 196 msec and blocked 27 ms.

In a way, this is good news, because a difference that gross ought to be 
relatively easy to account for. Clearly there is nothing subtle about the 
difference between the two scenarios.

> Any ideas where to look for a reason?

The next step is probably to clarify the times between:

a. Accumulated execution time — the amount of time your code actually spends 
executing in CPUs.

b. Elapsed time in your process — the amount of time that’s accounted for by 
your process, whether executing or waiting.

c. Elapsed time outside your process — the amount of time that’s accounted for 
by system code, also whether executing or waiting.

You can also play around with change isolation. Instead of changing two 
contextual conditions (the number of dispatch_appy calls, the number of 
iterations in a single block’s loop), change only one of them and observe the 
effect in Instruments.

You can also try out some other instruments speculatively. For example, is 
there a different pattern in the Allocations instrument, indicating that one 
form of your code is doing vast numbers of memory allocations for some 
(unknown) reason. Or is I/O being doing, unexpectedly?

Basically, at this point, you’re looking for a lever.

___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Mystery Threads

2016-09-30 Thread Jonathan Mitchell

> On 30 Sep 2016, at 10:57, Gerriet M. Denkmann  wrote:
> 
> 
> 
> But I just cannot see anything which forces my function to run 16 times 
> longer in the first case.
> 
> Any ideas where to look for a reason?

https://github.com/apple/swift-corelibs-libdispatch/blob/ab16f5e62859ff2f54996b8838f8304a8d125102/src/apply.c


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Mystery Threads

2016-09-30 Thread Gerriet M. Denkmann

> On 29 Sep 2016, at 16:05, Roland King  wrote:
> 
> 
>> On 29 Sep 2016, at 16:59, Gerriet M. Denkmann  wrote:
>> 
>> 
>>> On 29 Sep 2016, at 15:34, Quincey Morris 
>>>  wrote:
>>> 
>> 
>> Well, I count this as (bigArea = 4 GB):
>> (a) one call of dispatch_apply which schedules 40 000 times a block to GCD 
>> which handles 0.1 MB
>> (b) one call of dispatch_apply which schedules 8 times a block to GCD which 
>> handles 500 MB
>> 
>> Could be that these blocks sometimes collide (maybe when they are operating 
>> on adjacent areas), which slows them down. Such a collision is rather 
>> unlikely if only 8 of 40 000 are running.
> 
> Why guess - this is exactly what Instruments is designed to tell you. It’s 
> even dispatch-aware so it can show you results broken down by dispatch queue 
> and worker thread inside the dispatch queue. Run the two under instruments 
> and find out where all the time is spent. 

Ok. So I did run the Time Profiler Instrument (as suggested by Quincey):

dispatch_apply(8,…):
My function is running 3090 msec and blocked 970 ms. 
Other blockings: 
690 ms workq_kernreturn,  
560 ms mach_msg_trap.

And with dispatch_apply(20,000,…):
My function is running 196 msec and blocked 27 ms.
21 ms workq_kernreturn,  
34 ms mach_msg_trap.

But I just cannot see anything which forces my function to run 16 times longer 
in the first case.

Any ideas where to look for a reason?

Kind regards,

Gerriet.



___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Mystery Threads

2016-09-29 Thread Roland King

> On 29 Sep 2016, at 16:59, Gerriet M. Denkmann  wrote:
> 
> 
>> On 29 Sep 2016, at 15:34, Quincey Morris 
>>  wrote:
>> 
>> On Sep 29, 2016, at 01:05 , Gerriet M. Denkmann  wrote:
>>> 
>>> Well, nothing. Just let’s call it nbrOfBlocksToBeUsedByDispatchApply, or 
>>> whatever. But ultimately any of these things has to run on a CPU, of which 
>>> there are no more than 8.
>> 
>> Well, here’s my narrative. It may be fiction or non-fiction.
>> 
>> You said you tried “nbrOf…” as a few ten-thousands, vs. 8. Let’s be concrete 
>> and call this (a) 40,000 vs. (b) 8. So, for each set of 40,000 iterations of 
>> your block, you’re doing 1 dispatch_apply in case #a, and 5,000 
>> dispatch_apply calls in case #b. So, you’ve established that 4,999 
>> dispatch_apply calls — and related per-dispatch_appy overhead — take a long 
>> time.
> 
> Well, I count this as (bigArea = 4 GB):
> (a) one call of dispatch_appy which schedules 40 000 times a block to GCD 
> which handles 0.1 MB
> (b) one call of dispatch_appy which schedules 8 times a block to GCD which 
> handles 500 MB
> 
> Could be that these blocks sometimes collide (maybe when they are operating 
> on adjacent areas), which slows them down. Such a collision is rather 
> unlikely if only 8 of 40 000 are running.
> 
> 
>> Of course, I’m relying on the fact that you’re doing the same number of 
>> *total* iterations of your inner loop in case #a and case #b. This is not 
>> quite the whole story, because there are loop setup overheads per block. 
>> However, the loop setup that you’ve shown is very simple — a couple of Int 
>> operations — so the additional 4,999 loop setup executions are likely 
>> dwarfed by 4,999 dispatch_apply executions.
> 
> The actual story is: one outer loop (same in all cases) which sets up some 
> parameters, then another loop which covers the area which is assigned to this 
> block.
> In case (a) this area is small: 0.1 MB, whereas in case (b) it is large: 500 
> MB. Which seems to be in favour of case (b).
> 
> 

Why guess - this is exactly what Instruments is designed to tell you. It’s even 
dispatch-aware so it can show you results broken down by dispatch queue and 
worker thread inside the dispatch queue. Run the two under instruments and find 
out where all the time is spent. 


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Mystery Threads

2016-09-29 Thread Gerriet M. Denkmann

> On 29 Sep 2016, at 15:34, Quincey Morris 
>  wrote:
> 
> On Sep 29, 2016, at 01:05 , Gerriet M. Denkmann  wrote:
>> 
>> Well, nothing. Just let’s call it nbrOfBlocksToBeUsedByDispatchApply, or 
>> whatever. But ultimately any of these things has to run on a CPU, of which 
>> there are no more than 8.
> 
> Well, here’s my narrative. It may be fiction or non-fiction.
> 
> You said you tried “nbrOf…” as a few ten-thousands, vs. 8. Let’s be concrete 
> and call this (a) 40,000 vs. (b) 8. So, for each set of 40,000 iterations of 
> your block, you’re doing 1 dispatch_apply in case #a, and 5,000 
> dispatch_apply calls in case #b. So, you’ve established that 4,999 
> dispatch_apply calls — and related per-dispatch_appy overhead — take a long 
> time.

Well, I count this as (bigArea = 4 GB):
(a) one call of dispatch_appy which schedules 40 000 times a block to GCD which 
handles 0.1 MB
(b) one call of dispatch_appy which schedules 8 times a block to GCD which 
handles 500 MB

Could be that these blocks sometimes collide (maybe when they are operating on 
adjacent areas), which slows them down. Such a collision is rather unlikely if 
only 8 of 40 000 are running.


> Of course, I’m relying on the fact that you’re doing the same number of 
> *total* iterations of your inner loop in case #a and case #b. This is not 
> quite the whole story, because there are loop setup overheads per block. 
> However, the loop setup that you’ve shown is very simple — a couple of Int 
> operations — so the additional 4,999 loop setup executions are likely dwarfed 
> by 4,999 dispatch_apply executions.

The actual story is: one outer loop (same in all cases) which sets up some 
parameters, then another loop which covers the area which is assigned to this 
block.
In case (a) this area is small: 0.1 MB, whereas in case (b) it is large: 500 
MB. Which seems to be in favour of case (b).


Kind regards,

Gerriet.


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Mystery Threads

2016-09-29 Thread Daniel Vollmer

> On 29 Sep 2016, at 10:05, Gerriet M. Denkmann  wrote:
> 
> 
>> On 29 Sep 2016, at 14:38, Quincey Morris 
>>  wrote:
>> 
>> On Sep 29, 2016, at 00:15 , Gerriet M. Denkmann  wrote:
>>> 
>>> dispatch_apply( nbrOfThreads, queue, ^void(size_t idx)
>>> 
>>> As my computer has just 8 CPUs, I thought that using nbrOfThreads > 8 would 
>>> be silly: adding overhead without gaining anything.
>>> 
>>> Turns out this is quite wrong. One function (called threadHappyFunction) 
>>> works more than 10 times faster using nbrOfThreads = a few ten-thousand (as 
>>> compared to nbrOfThreads = 8).
>>> 
>>> This is nice, but I would like to know why can this happen.
>> 
>> What makes you think “nbrOfThreads” represents a number of threads?
> 
> Well, nothing. Just let’s call it nbrOfBlocksToBeUsedByDispatchApply, or 
> whatever. But ultimately any of these things has to run on a CPU, of which 
> there are no more than 8.

There are mored shared resources than just execution units in your system (e.g. 
memory bandwidth, or for non-linear accesses latency). Maybe one of your blocks 
is bandwidth bound, while the other is compute bound? Your second function 
might be memory bound (with lots of read-modify-write traffic). There are many 
other factors (and the dispatch_apply man-page tells you that number of 
invocations is very dependant on your block), such as caches or 
hyper-threading. The performance counters in Instruments may be able to guide 
you.

Daniel.


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Mystery Threads

2016-09-29 Thread Quincey Morris
On Sep 29, 2016, at 01:05 , Gerriet M. Denkmann  wrote:
> 
> Well, nothing. Just let’s call it nbrOfBlocksToBeUsedByDispatchApply, or 
> whatever. But ultimately any of these things has to run on a CPU, of which 
> there are no more than 8.

Well, here’s my narrative. It may be fiction or non-fiction.

You said you tried “nbrOf…” as a few ten-thousands, vs. 8. Let’s be concrete 
and call this (a) 40,000 vs. (b) 8. So, for each set of 40,000 iterations of 
your block, you’re doing 1 dispatch_apply in case #a, and 5,000 dispatch_apply 
calls in case #b. So, you’ve established that 4,999 dispatch_apply calls — and 
related per-dispatch_appy overhead — take a long time.

Of course, I’m relying on the fact that you’re doing the same number of *total* 
iterations of your inner loop in case #a and case #b. This is not quite the 
whole story, because there are loop setup overheads per block. However, the 
loop setup that you’ve shown is very simple — a couple of Int operations — so 
the additional 4,999 loop setup executions are likely dwarfed by 4,999 
dispatch_apply executions.

>> Isn’t this what Instruments is for?
> 
> Might be. I already looked at Time Profiler, but failed to notice anything 
> (maybe was looking at the wrong things) and System Trace, but did not 
> understand what to observe or what the Instrument was telling me.

Unfortunately, I agree, Instruments is inscrutable initially, and has a nasty 
learning curve. You kind of have to persist, poking around till things start to 
make a little sense. Since you want to know where time is being spent, Time 
Profiler sounds like the right place to start.

One possible approach is to profile case #a and case #b, and compare the 
Instruments output. Since you know what the actual performance difference is 
(in general terms), you should be able to see that reflected in what 
Instruments tells you. That should give you some reference points.

___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Mystery Threads

2016-09-29 Thread Gerriet M. Denkmann

> On 29 Sep 2016, at 14:38, Quincey Morris 
>  wrote:
> 
> On Sep 29, 2016, at 00:15 , Gerriet M. Denkmann  wrote:
>> 
>>  dispatch_apply( nbrOfThreads, queue, ^void(size_t idx)
>> 
>> As my computer has just 8 CPUs, I thought that using nbrOfThreads > 8 would 
>> be silly: adding overhead without gaining anything.
>> 
>> Turns out this is quite wrong. One function (called threadHappyFunction) 
>> works more than 10 times faster using nbrOfThreads = a few ten-thousand (as 
>> compared to nbrOfThreads = 8).
>> 
>> This is nice, but I would like to know why can this happen.
> 
> What makes you think “nbrOfThreads” represents a number of threads?

Well, nothing. Just let’s call it nbrOfBlocksToBeUsedByDispatchApply, or 
whatever. But ultimately any of these things has to run on a CPU, of which 
there are no more than 8.

> 
>> Why are the threads seemingly blocking each other?
>> What is going on here?
>> How can I investigate this?
> 
> Isn’t this what Instruments is for?

Might be. I already looked at Time Profiler, but failed to notice anything 
(maybe was looking at the wrong things) and System Trace, but did not 
understand what to observe or what the Instrument was telling me.
Maybe some other Instrument would help?
What should I look for?


Kind regards,

Gerriet.


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Mystery Threads

2016-09-29 Thread Quincey Morris
On Sep 29, 2016, at 00:15 , Gerriet M. Denkmann  wrote:
> 
>   dispatch_apply( nbrOfThreads, queue, ^void(size_t idx)
> 
> As my computer has just 8 CPUs, I thought that using nbrOfThreads > 8 would 
> be silly: adding overhead without gaining anything.
> 
> Turns out this is quite wrong. One function (called threadHappyFunction) 
> works more than 10 times faster using nbrOfThreads = a few ten-thousand (as 
> compared to nbrOfThreads = 8).
> 
> This is nice, but I would like to know why can this happen.

What makes you think “nbrOfThreads” represents a number of threads?

> Why are the threads seemingly blocking each other?
> What is going on here?
> How can I investigate this?

Isn’t this what Instruments is for?

___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Mystery Threads

2016-09-29 Thread Aandi Inston
My thoughts are general, not specific to Mac OS... The idea that the best
performance comes from threads = #CPUs is attractive, but will work only if
the threads do not sleep and do not interfere with each other. A classic
example is dividing up a complex calculation that runs without touching the
disk, reading large arrays, or sharing variables.

To construct a case where more than #CPUs might help, consider a thread
which sets up and reads from a network connection, then does something
trivial and repeats. If you have threads=#CPUs, each thread will start and
issue a network read. Then all the threads will wait for a reply and the
CPU is idle. More threads might help. But they might overwhelm the network,
so it might not be a good plan to have thousands. Exactly the same applies
if the thead reads or writes the disk. The thread might read the disk
deliberately, or it might get a page fault. More threads might help, but
the bottleneck is likely to be the disk, so the number of threads is almost
irrelevant.

Where threads run simultaneously on different CPUs, you want for best
performance to allow each CPU to work with data in its cache. Where one
thread writes a memory location that is read by another thread, the CPUs
are likely to have to flush their cache, and the multi-CPU performance
becomes that of an uncached CPU, which can be terrible. The hardware issues
here are complex and I haven't kept up with them; perhaps cache sharing has
improved.

On 29 September 2016 at 08:15, Gerriet M. Denkmann  wrote:

> I have a big array (like a few GB) which is operated upon by some
> functions.
> As these functions act purely local, an obvious idea is:
>
> - (void)someFunction
> {
> nbrOfThreads = ...
> sizeOfBigArray = ...  a few GB
> stride = sizeOfBigArray / nbrOfThreads
>
> dispatch_apply( nbrOfThreads, queue, ^void(size_t idx)
> {
> start = idx * stride
> end = start + stride
>
> index = start
> while ( index < end )
> {
> mask = ...
> bigArray[index] |= mask
> index += … something positive…
> }
> }
> )
> }
>
> As my computer has just 8 CPUs, I thought that using nbrOfThreads > 8
> would be silly: adding overhead without gaining anything.
>
> Turns out this is quite wrong. One function (called threadHappyFunction)
> works more than 10 times faster using nbrOfThreads = a few ten-thousand (as
> compared to nbrOfThreads = 8).
>
> This is nice, but I would like to know why can this happen.
>
> Another function does not like threads at all:
>
> - (void)threadShyFunction
> {
> nbrOfThreads = ...
> uint64_t *bigArrayAsLongs = (uint64_t *)bigArray
> sizeOfBigArrayInLongs = ...
> stride = sizeOfBigArrayInLongs / nbrOfThreads
>
> uint64_t *template = ...
> sizeOfTemplate = not more than a few dozen longs
>
> dispatch_apply( nbrOfThreads, queue, ^void(size_t idx)
> {
> start = idx * stride
> end = start + stride
>
> offset = start
>
> while ( offset + sizeOfTemplate < end )
> {
> for ( i = 0 ..< sizeOfTemplate )
> bigArrayAsLongs[offset + i] |= template[i]
> offset += sizeOfTemplate
> }
> }
> )
> }
>
> This works, but for nbrOfThreads > 1 it gets slower instead of faster. Up
> to hundred times slower for moderately big nbrOfThreads.
> This really bothers me.
> Why are the threads seemingly blocking each other?
> What is going on here?
> How can I investigate this?
>
> Gerriet.
>
> P.S. macOS 12, Xcode 8, ObjC or Swift.
>
>
> ___
>
> Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
> https://lists.apple.com/mailman/options/cocoa-dev/aandi%40quite.com
>
> This email sent to aa...@quite.com
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com