Re: [swift-evolution] [Concurrency] async/await + actors

Pierre Habouzit via swift-evolution Sun, 03 Sep 2017 12:45:07 -0700

> On Sep 3, 2017, at 10:26 AM, Chris Lattner via swift-evolution 
> <swift-evolution@swift.org> wrote:
> 
> On Sep 2, 2017, at 11:09 PM, Pierre Habouzit <phabou...@apple.com 
> <mailto:phabou...@apple.com>> wrote:
>>> On Sep 2, 2017, at 12:19 PM, Pierre Habouzit <pie...@habouzit.net 
>>> <mailto:pie...@habouzit.net>> wrote:
>>>>>> What do you mean by this?
>>>>> 
>>>>> My understanding is that GCD doesn’t currently scale to 1M concurrent 
>>>>> queues / tasks.
>>>> 
>>>> It completely does provided these 1M queues / tasks are organized on 
>>>> several well known independent contexts.
>>> 
>>> Ok, I stand corrected.  My understanding was that you could run into 
>>> situations where you get stack explosions, fragment your VM and run out of 
>>> space, but perhaps that is a relic of 32-bit systems.
>> 
>> a queue on 64bit systems is 128 bytes (nowadays). Provided you have that 
>> amount of VM available to you (1M queues is 128M after all) then you're good.
>> If a large amount of them fragments the VM beyond this is a malloc/VM bug on 
>> 64bit systems that are supposed to have enough address space.
> 
> Right, I was referring to the fragmentation you get by having a large number 
> of 2M stacks allocated for each kernel thread.  I recognize the queues 
> themselves are small.


Yes, this is still poor, if every single queue needs to make a thread at the 
same time, we've lost. Whatever abstraction you pick on top it'll be true: 
physical OS threads (pthreads or workqueue threads) are inherently expensive 
and don't scale well.


>> What doesn't scale is asking for threads, not having queues.
> 
> Right, agreed.
> 
>>> Agreed, to be clear, I have no objection to building actors on top of 
>>> (perhaps enhanced) GCD queues.  In fact I *hope* that this can work, since 
>>> it leads to a naturally more incremental path forward, which is therefore 
>>> much more likely to actually happen.
>> 
>> Good :)
> 
> I meant to be pretty clear about that all along, but perhaps I missed the 
> mark.  In any case, I’ve significantly revised the “scalable runtime” section 
> of the doc to reflect some of this discussion, please let me know what you 
> think:
> https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#scalable-runtime
>  
> <https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#scalable-runtime>
I'm much happier with this new wording which is IMO a very fair assessment. 
thanks for rewording it.

> 
>> 
>>>> My currently not very well formed opinion on this subject is that GCD 
>>>> queues are just what you need with these possibilities:
>>>> - this Actor queue can be targeted to other queues by the developer when 
>>>> he means for these actor to be executed in an existing execution context / 
>>>> locking domain,
>>>> - we disallow Actors to be directly targeted to GCD global concurrent 
>>>> queues ever
>>>> - for the other ones we create a new abstraction with stronger and better 
>>>> guarantees (typically limiting the number of possible threads servicing 
>>>> actors to a low number, not greater than NCPU).
>>> 
>>> Is there a specific important use case for being able to target an actor to 
>>> an existing queue?  Are you looking for advanced patterns where multiple 
>>> actors (each providing disjoint mutable state) share an underlying queue? 
>>> Would this be for performance reasons, for compatibility with existing 
>>> code, or something else?
>> 
>> Mostly for interaction with current designs where being on a given bottom 
>> serial queue gives you the locking context for resources naturally attached 
>> to it.
> 
> Ok.  I don’t understand the use-case well enough to know how we should model 
> this.  For example, is it important for an actor to be able to change its 
> queue dynamically as it goes (something that sounds really scary to me) or 
> can the “queue to use” be specified at actor initialization time?

I think I need to read more on actors, because the same way you're not an OS 
runtime expert, I'm not (or rather no longer, I started down that path a 
lifetime ago) a language expert at all, and I feel like I need to understand 
your world better to try to explain this part better to you.

> 
>>> One plausible way to model this is to say that it is a “multithreaded 
>>> actor” of some sort, where the innards of the actor allow arbitrary number 
>>> of client threads to call into it concurrently.  The onus would be on the 
>>> implementor of the NIC or database to implement the proper synchronization 
>>> on the mutable state within the actor.
>> 
>> I think what you said made sense.
> 
> Ok, I captured this in yet-another speculative section:
> https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#intra-actor-concurrency
>  
> <https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#intra-actor-concurrency>
Great. BTW I agree 100% with:

That said, this is definitely a power-user feature, and we should understand, 
build, and get experience using the basic system before considering adding 
something like this.

Private concurrent queues are not a success in dispatch and cause several 
issues, these queues are second class citizens in GCD in terms of feature they 
support, and building something with concurrency *within* is hard. I would keep 
it as "that's where we'll go some day" but not try to attempt it until we've 
build the simpler (or rather less hard) purely serial case first.

> 
> 
>> But it wasn't what I meant. I was really thinking at sqlite where the 
>> database is strongly serial (you can't use it in a multi-threaded way well, 
>> or rather you can but it has a big lock inside). It is much better to 
>> interact with that dude on the same exclusion context all the time. What I 
>> meant is really having some actors that have a "strong affinity" with a 
>> given execution context which eases the task of the actor scheduler.
> 
> Ah ok.  Yes, I think that wrapping a “subsystem with a big lock” in an actor 
> is a very natural thing to do, just as much as it makes sense to wrap a 
> non-threadsafe API in an actor.  Any internal locking would be subsumed by 
> the outer actor queue, but that’s ok - all the lock acquires would be 
> uncontended and fast :)
> 
>> Another problem I haven't touched either is kernel-issued events (inbound 
>> IPC from other processes, networking events, etc...). Dispatch for the 
>> longest time used an indirection through a manager thread for all such 
>> events, and that had two major issues:
>> 
>> - the thread hops it caused, causing networking workloads to utilize up to 
>> 15-20% more CPU time than an equivalent manually made pthread parked in 
>> kevent(), because networking pace even when busy idles back all the time as 
>> far as the CPU is concerned, so dispatch queues never stay hot, and the 
>> context switch is not only a scheduled context switch but also has the cost 
>> of a thread bring up
>> 
>> - if you deliver all possible events this way you also deliver events that 
>> cannot possibly make progress because the execution context that will handle 
>> them is already "locked" (as in busy running something else.
>> 
>> It took us several years to get to the point we presented at WWDC this year 
>> where we deliver events directly to the right dispatch queue. If you only 
>> have very anonymous execution contexts then all this machinery is wasted and 
>> unused. However, this machinery has been evaluated and saves full percents 
>> of CPU load system-wide. I'd hate for us to go back 5 years here.
> 
> I don’t have anything intelligent to say here, but it sounds like you 
> understand the issues well :-)  I agree that giving up 5 years of progress is 
> not appealing.

TBH our team has to explain into more depth how eventing works on Darwin, the 
same way we (or maybe it's just I, I don't want to disparage my colleagues here 
:P) need to understand Actors and what they mean better, I think the swift core 
team (and whoever works on concurrency) needs to be able to understand what I 
explained above.

> 
>> Declaring that an actor targets a given existing serial context also means 
>> that if that actor needs to make urgent progress the context in question has 
>> to be rushed, and its priority elevated. It's really hard to do the same on 
>> an anonymous global context (the way dispatch does it still is to actually 
>> enqueue stealer work that try to steal the "actor" at a higher priority. 
>> this approach is terribly wasteful).
> 
> Ok.
> 
>> It is clear today that it was a mistake and that there should have been 3 
>> kind of queues:
>> - the global queues, which aren't real queues but represent which family of 
>> system attributes your execution context requires (mostly priorities), and 
>> we should have disallowed enqueuing raw work on it
>> - the bottom queues (which GCD since last year tracks and call "bases" in 
>> the source code) that are known to the kernel when they have work enqueued
>> - any other "inner" queue, which the kernel couldn't care less about
>> 
>> In dispatch, we regret every passing day that the difference between the 2nd 
>> and 3rd group of queues wasn't made clear in the API originally.
> 
> Well this is an opportunity to encourage the right thing to happen.  IMO it 
> seems that the default should strongly be for an “everyday actor” to be the 
> third case, which is what developers will reach for when building out their 
> app abstractions.

Correct.

> We can provide some API or other system that allows sufficiently advanced 
> developers to tie an actor into the second bucket, and we can disallow the 
> first entirely if that is desirable (e.g. by a runtime trap if nothing else).

Yes, I would be pleased with something like that, this is exactly what I'm 
advocating.


>> I like to call the 2nd category execution contexts, but I can also see why 
>> you want to pass them as Actors, it's probably more uniform (and GCD did the 
>> same by presenting them both as queues). Such top-level "Actors" should be 
>> few, because if they all become active at once, they will need as many 
>> threads in your process, and this is not a resource that scales. This is why 
>> it is important to distinguish them. And like we're discussing they usually 
>> also wrap some kind of shared mutable state, resource, or similar, which 
>> inner actors probably won't do.
> 
> The thing I still don’t understand is where #2 comes from: are these system 
> defined queues that the developer interacts with, or are these things that 
> developers define in their code?  How does the kernel know about them if the 
> developer defines new ones?

This is, sir, a very good question, and I think anyone working on Actors will 
need to understand most of this ;) You "just" need to understand most of the 
kevent and workq integration to answer this. It ties into what I was saying 
about kernel issued events above. But to try to give you a sense quickly, there 
are two cases, pure queues, and queues with kernel events registrations:

1) for the former case, the kernel knows just in time when the queue is woken 
up (the term GCD uses to describe the idle -> non empty transition for a 
queue), because the kernel never needs to know before that point. The GCD 
runtime will evaluate that the queue is rooted at "the bottom", and will make 
it know to the kernel then.

2) for the second case, queue hierarchies that have event monitoring setup 
(dispatch sources, xpc connections) are registered upfront with the kernel. The 
evaluation of which is the bottom queue an event is handled on is done by the 
GCD runtime when the first registration happens. This allows for both kernel 
and user spaces to the thread requests on behalf of this hierarchy and for 
these requests to coalesce in kernel. (this is btw awesome technology that has 
been very difficult in the making FWIW, but I digress).

(2) is what scares me: the kernel has stuff to deliver, and kernels don't like 
to hold on data on behalf of userspace forever, because this burns wired 
memory. This means that this doesn't quite play nice with a global anonymous 
pool that can be starved at any time. Especially if we're talking XPC 
Connections in a daemon, where super high priority clients such as the 
frontmost app (or even worse, SpringBoard) can ask you questions that you'd 
better answer as fast as you can.

The solution we recommend to solve this to developers (1st and 3rd parties), is 
for all your XPC connections for your clients are rooted at the same bottom 
queue that represents your "communication" subsystem, so you can imagine it be 
the "Incoming request multiplexer" Actor or something, this one is in the 2nd 
category and is known to the kernel so that the kernel can instantiate the 
Actor itself without asking permission to userspace and directly make the 
execution context, and rely on the scheduler to get the priorities right.

The alternative solution is for userspace to have a construct similar to a 
kernel scheduler in every process and have a strong tie with the kernel so that 
the context can be created within the shared pool, and then the userspace 
scheduler would take over to try to schedule this, and this is where all the 
discussion about phsyical threads starts: it starts to be very tempting to have 
"lightweight" (I really believe that for sufficiently complex code, with C 
interleaved, these actually are merely slightly less heavyweight) users-pace 
stack switching and basically make every process replicate what a kernel does. 
This is where I'd like for us not to go because it means solving the same 
problems in two places (kernel and usperspace), which requires synchronization 
between states and doubles your bug surface. It also requires a large amount of 
diagnostic tools to be able to represent these things in meaningful ways, and 
last, doesn't work at all with TSDs (which Swift uses extensively it its 
runtime, and that obj-c & C frameworks use all over the place) and most of our 
locking primitives (which when you take them make the locking primitive tied to 
the thread ID / port that took the lock, so if you have a fluid association 
between userspace threads and physical ones, you're in for a very bad time). 
All these reasons is why we took the direction with the OS of making the kernel 
more aware of the runtime rather than having the runtime be more like a kernel.

This is why our belief (in my larger K(ernel) & R(untime) team) is that instead 
of trying to paper over the fact that there's a real OS running your stuff, we 
need to embrace it and explain to people that everywhere in a traditional POSIX 
world they would have used a real pthread_create()d thread to perform the work 
of a given subsystem, they create one such category #2 bottom queue that 
represents this thread (and you make this subsystem an Actor), and that the 
same way in POSIX you can't quite expect a process to have thousands of 
threads, you can't really have thousands of such top level actors either. It 
doesn't mean that there can't be some anonymous pool to run stuff on it for the 
stuff that is less your raison d'être, it just means that the things your app 
that are really what it was built to do should be organized in a resource aware 
way to take maximum advantage of the system. Throwing a million actors to the 
system without more upfront organization and initial thinking by the developers 
is not optimizable.


FWIW for other people wanting to participate in this discussion, what I'm 
saying feels very Darwin centric because I use XPC as an example, but on Linux 
replace XPC with D-BUS and all my arguments apply the same. OS Design meets the 
same class of problems everywhere, I'm just taking concrete example that people 
who've worked at Apple understand. I don't know Windows at all, but I'd be very 
surprised if they don't have exactly something like this too.


>>>> You can obviously model this as "the database is an actor itself", and 
>>>> have the queue of other actors that only do database work target this 
>>>> database actor queue, but while this looks very appealing, in practice 
>>>> this creates a system which is hard to understand for developers. Actors 
>>>> talking to each other/messaging each other is fine. Actors nesting their 
>>>> execution inside each other is not because the next thing people will then 
>>>> ask from such a system is a way to execute code from the outer actor when 
>>>> in the context of the inner Actor, IOW what a recursive mutex is to a 
>>>> mutex, but for the Actor queue. This obvious has all the terrible issues 
>>>> of recursive locks where you think you hold the lock for the first time 
>>>> and expect your object invariants to be valid, except that you're really 
>>>> in a nested execution and see broken invariants from the outer call and 
>>>> this creates terribly hard bugs to find.
>>> 
>>> Yes, I understand the problem of recursive locks, but I don’t see how or 
>>> why you’d want an outer actor to have an inner actor call back to it.
>> 
>> I don't see why you'd need this with dispatch queues either today, however 
>> my radar queue disagrees strongly with this statement. People want this all 
>> the time, mostly because the outer actor has a state machine and the inner 
>> actor wants it to make progress before it continues working.
>> 
>> In CFRunloop terms it's like running the runloop from your work by calling 
>> CFRunloopRun() yourself again until you can observe some event happened.
>> 
>> It's not great and problematic for tons of reasons. If actors are nested, we 
>> need a way to make sure people don't have to ever do something like that.
>> 
>> Just as a data point, my reactions to this thread yielded a private 
>> discussion off list *exactly* about someone wanting us to provide something 
>> like this for dispatch (or rather in this instance having the inner actor be 
>> able to suspend the outer one, but it's just a corollary / similar layering 
>> violation where the inner actor wants to affect the outer one in a way the 
>> outer one didn't expect).
> 
> I get that this comes up now and then, but I also don’t understand what we 
> can do about it.  Allowing “recursive lock” style reentrancy into an actor 
> breaks fundamental aspects of the design of actors: that you know at the 
> start of any actor message that the actor invariants are intact.

Exactly.

> It would be possible to relax this and give up on that (perhaps rather 
> theoretical) win, but I’d rather not.

Me neither, I hate this, and all the radars I was alluding to are either NTBFed 
with an explanation to the reporter on how to do the same without using 
recursive patterns, or sit in our queue with us being perplexed. It's just that 
we have to know these questions will happen.

> I think that any recursive case can be expressed through a refactoring of the 
> code, e.g. to use async sending instead of sync calls.

This is true, but I would argue that if you reached the point where you need a 
refactor, then you screwed up during design first. I think what I mean here is 
that we should explain upfront how to not design things that lead to this 
problem with simple example so that people can use them to build complex 
systems that never fall in this trap.

>  Also, I think we should strongly encourage pure async “fire and forget” 
> actor methods anyway - IOW, we should encourage push, not pull


I almost agree. We should strongly encourage the `pure async "account for, fire 
and forget" actor methods`. The `account for` is really backpressure, where you 
actually don't fire if the remote queue is full and instead rely on some kind 
of reactive pattern to pull from you. (but I know you wrote that on your 
proposal and you're aware of it).


The other problem is the following. I used to work on iCloud Drive, and the 
design of the daemon uses a pure GCD design (including its interaction with the 
SQLite metadata database(s) it has, which will probably make you understand why 
I took this example earlier) only which makes it relevant to this discussion. 
Everytime you propose an idea, I'm really trying to ask myself "what would we 
do in bird".

One problem that is interesting and that I don't know how to handle in a pure 
fire-and-forget world is "iCloud Account Logout". This is something that today 
is implemented in a very synchronous fashion, because getting this wrong means 
the worst thing you can do for privacy: if you have a very quick succession of 
login/logout/login/logout and you screw this up, then you can have 
cross-polination between accounts, and we take that very seriously as you'd 
imagine.

That being said, the solution that was chosen is that the part of the code 
performing the logout sends tons of cancelations of all the things in flight, 
starts deleting your files on disk, and waits *synchronously* on all the 
subsystems for them to return that they've forgotten everything about the 
previous account. For debugging purposes, this is the right model, because when 
one subsystem has a bug and doesn't complete, you can see by inspecting the 
blocking relationship which one it is, which is trivial with regular operating 
introspection tools. I don't believe that writing the same code in a purely 
async way would be as easy to debug.

> - since they provide much stronger guarantees in general.

It depends which guarantees you're talking about. I don't think this statement 
is true. Async work has good and strong properties when you write code in the 
"normal" priority ranges, what we refer as to "in the QoS world" on Darwin 
(from background up to UI work).

However, there are tons of snowflakes on any platform that can't be in that 
world:
- media rendering (video/audio)
- HID (touch, gesture recognition, keyboard, mouses, trackpads, ...)
- some use cases of networking (bluetooth is a very good example, you hate when 
your audio drops with your bluetooth headset don't you?)
- ...

And these use cases are many, and run in otherwise regular processes all the 
time.

Today, the only way you don't end up with massive priority inversions when one 
of these subsystems has to perform something that interacts with generic code 
(such as loading a user pref, do an IPC, etc...) is that they do everything 
synchronously, because blocking is how the kernel can resolve these inversions 
locally, without priority overhangs. Asynchronous propagation has a ton of 
overhang (the term we use to describe the fact that your current effective 
priority is tainted by a higher priority piece of work that happened in the 
past). Overhang outside of the QoS world is not an option: if you run at the 
priority of gesture recognition by accident and use a lot of CPU, you just 
forfeited the touch interface of your phone, you can see how people would be 
upset with this.

However it is a fact of life that these subsystems, have to interact with 
generic subsystems sometimes, and that mean they need to be able to 
synchronously wait on an actor, so that this actor's priority is elevated. And 
you can't waive this off, there are tons of legitimate reasons for very-high 
priorities subsystems to have to interact and wait on regular priority work.

I'll take one compelling example: logging, when your HID stack wants to report 
a runtime fault of some sort, it has to interact with the logging subsystem 
which is mostly utility work. If the HID stack logs all the time, this is 
wrong, obviously, but if it has to report a critical issue, then it's fair for 
this subsystem to have to log, and asyncing the log is not a good option as 
this breaks the stacks and many things you want for a report. However you 
really want once this log is done, to reset your HID stack and move on with 
your life.

I 100% agree with you that if *everything* was asynchronous and written this 
way, our lives would be great. I don't however think it's possible on real life 
operating system to write all your code this way. And this is exactly where 
things start to be *very* messy.


However, if you're not in one of these "extra difficult cases" you should use 
pure async. Which means that we still need to provide synchronous waiting, but 
explain very clearly when it's okay to use. I think e.g. that if you're running 
on the anonymous pool, you should not be allowed to synchronously wait on 
another actor, for all the examples I took (whether the iCloud Drive logout 
case, or the high priority people), the code that waits on actors is only 
running on a non anonymous context, and it's "fine" to block this context, 
because you made it as a developer in the first place, so you understand what 
you do to that unique serial resource. If you block the shared pool, then you 
can't possibly understand since you don't know what else runs there by 
definition.




>> It is definitely the family of problems I'm worried about. I want to make 
>> sure we have an holistic approach here because I think that recursive 
>> mutexes, recursive CFRunloop runs and similar ideas are flawed and 
>> dangerous. I want to make sure we understand which limitations we want to 
>> impose here, and by limitations I really mean layering/architecture rules, 
>> and that we document them upfront and explain how to work with them.
> 
> +1
> 
>> I think that what I'm getting at is that we can't build Actors without 
>> defining how they interact with the operating system and what guarantees of 
>> latency and liveness they have.
> 
> Agreed, thanks Pierre!
> 
> -Chris
> 
> 
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Re: [swift-evolution] [Concurrency] async/await + actors

Reply via email to