This patch series is an initial implementation of a coroutine feature, expected to be standardised in C++20.
Standardisation status (and potential impact on this implementation): ---------------------- The facility was accepted into the working draft for C++20 by WG21 in February 2019. During two following WG21 meetings, design and national body comments have been reviewed, with no significant change resulting. Mature implementations (several years) of this exist in MSVC, clang and EDG with some experience using the clang one in production - so that the underlying principles are thought to be sound. At this stage, the remaining potential for change comes from two areas of national body comments that were not resolved during the last WG21 meeting: (a) handling of the situation where aligned allocation is available. (b) handling of the situation where a user wants coroutines, but does not want exceptions (e.g. a GPU). It is not expected that the resolution to either of these will produce any major change. The current GCC implementation is against n4835 [1]. ABI --- The various compiler developers have discussed a minimal ABI to allow one implementation to call coroutines compiled by another; this amounts to: 1. The layout of a public portion of the coroutine frame. 2. A number of compiler builtins that the standard library might use. The eventual home for the ABI is not decided yet, I will put a draft onto the wiki this week. The ABI has currently no target-specific content (a given psABI might elect to mandate alignment, but the common ABI does not do this). There is not need to add any new mangling, since the components of this are regular functions with manipulation of the coroutine via a type-erased handle. Standard Library impact ----------------------- The current implementations require addition of only a single header to the standard library (no change to the runtime). This header is part of the patch series. GCC Implementation outline -------------------------- The standard's design for coroutines does not decorate the definition of a coroutine in any way, so that a function is only known to be a coroutine when one of the keywords (co_await, co_yield, co_return) is encountered. This means that we cannot special-case such functions from the outset, but must process them differently when they are finalised - which we do from "finish_function ()". At a high level, this design of coroutine produces four pieces from the original user's function: 1. A coroutine state frame (taking the logical place of the activation record for a regular function). One item stored in that state is the index of the current suspend point. 2. A "ramp" function This is what the user calls to construct the coroutine frame and start the coroutine execution. This will return some object representing the coroutine's eventual return value (or means to continue it when it it suspended). 3. A "resume" function. This is what gets called when a the coroutine is resumed when suspended. 4. A "destroy" function. This is what gets called when the coroutine state should be destroyed and its memory returned. The standard's coroutines involve cooperation of the user's authored function with a provided "promise" class, which includes mandatory methods for handling the state transitions and providing output values. Most realistic coroutines will also have one or more 'awaiter' classes that implement the user's actions for each suspend point. As we parse (or during template expansion) the types of the promise and awaiter classes become known, and can then be verified against the signatures expected by the standard. Once the function is parsed (and templates expanded) we are able to make the transformation into the four pieces noted above. The implementation here takes the approach of a series of AST transforms. The state machine suspend points are encoded in three internal functions (one of which represents an exit from scope without cleanups). These three IFNs are lowered early in the middle end, such that the majority of GCC's optimisers can be run on the resulting output. As a design choice, we have carried out the outlining of the user's function in the front end, and taken advantage of the existing middle end's abilities to inline and DCE where that is profitable. Since the state machine is actually common to both resumer and destroyer functions, we make only a single function "actor" that contains both the resume and destroy paths. The destroy function is represented by a small stub that sets a value to signal the use of the destroy path and calls the actor. The idea is that optimisation of the state machine need only be done once - and then the resume and destroy paths can be identified allowing the middle end's inline and DCE machinery to optimise as profitable as noted above. The middle end components for this implementation are: 1. Lower the coroutine builtins that allow the standard library header to interact with the coroutine frame (these fairly simple logical or numerical substitution of values given a coroutine frame pointer). 2. Lower the IFN that represents the exit from state without cleanup. Essentially, this becomes a gimple goto. 3. Lower the IFNs that represent the state machine paths for the resume and destroy cases. 4. A very late pass that is able to re-size the coroutine frame when there are unused entries and therefore choose the minimum allocation for it. There are no back-end implications to this current design. GCC Implementation Status ------------------------- The current implementation should be considered somewhat experimental and is guarded by a "-fcoroutines" flag. I have set out to minimise impact on the compiler (such that with the switch off, coroutines should be a NOP). The branch has been feature-complete for a few weeks and published on Compiler Explorer since late September. I have been keeping a copy of the branch on my github page, and some bug reports have been filed there (and dealt with). The only common resource taken is a single bit in the function decl to flag that this function is determined to be a coroutine. Patch Series ------------ The patch series is against r278049 (Mon 11th Nov). There are 6 pieces to try an localise the reviewer interest areas. However it would not make sense to commit except as possibly two (main and testsuite). I have not tested that the compiler would even build part-way through this series. 1) Common code and base definitions. This is the background content, defining the gating flag, keywords etc. 2) Builtins and internal functions. Definitions of the builtins used by the standard library header and the internal functions used to implement the state machine. 3) Front end parsing and AST transforms. This is the largest part of the code, and has essentially two phases 1. parse (and template expansion) 2. analysis and transformation, which does the code generation for the state machine. 4) Middle end expanders and transforms As per the description above. 5) Standard library header. This is mostly mandated by the standard, although (of course) the decision to implement the interaction with the coroutine frame by inline builtin calls is pertinent. There is no runtime addition for this (the builtins are expanded directly). 6) Testsuite. There are two chunks of tests. 1. those that check for correct error handling 2. those that check for the correct lowering of the state machine Since the second set are checking code-gen, they are run as 'torture' tests with the default options list. ====== I will put this patch series onto a git branch for those that would prefer to view it in that form. thanks Iain ====== [1] https://wg21.link/n4835