Re: [C++ coroutines] Initial implementation pushed to master.

H.J. Lu Tue, 05 Mar 2024 09:32:23 -0800

On Sat, Jan 18, 2020 at 4:54 AM Iain Sandoe <i...@sandoe.co.uk> wrote:
>
> Hi,
>
> Thanks to:
>
>    * the reviewers, the code was definitely improved by your reviews.
>
>    * those folks who tested the branch and/or compiler explorer
>      instance and reported problems with reproducers.
>
>   * WG21 colleagues, especially Lewis and Gor for valuable input
>     and discussions on the design.
>
> ===== TL;DR:
>
> * This is not enabled by default (even for -std=c++2a), it needs -fcoroutines.
>
> * Like all the C++20 support, it is experimental, perhaps more experimental
>   than some other pieces because wording is still being amended.
>
> * The FE/ME tests are run for ALL targets; in principle this should be target-
>   agnostic, if we see fails then that is probably interesting input for the 
> ABI
>  panel.
>
>  * I regstrapped on 64b LE and BE platforms and a 32b LE host with no observed
>   issues or regressions.
>
>  * it’s just slightly too big to send uncompressed so attached as a bz2.
>
>  * commit is r10-6063-g49789fd08
>
> thanks again to all those who helped,
> Iain
>
> ======  The full covering note:
>
> This is the squashed version of the first 6 patches that were split to
> facilitate review.
>
> The changes to libiberty (7th patch) to support demangling the co_await
> operator stand alone and are applied separately.
>
> The patch series is an initial implementation of a coroutine feature,
> expected to be standardised in C++20.
>
> Standardisation status (and potential impact on this implementation)
> --------------------------------------------------------------------
>
> The facility was accepted into the working draft for C++20 by WG21 in
> February 2019.  During following WG21 meetings, design and national body
> comments have been reviewed, with no significant change resulting.
>
> The current GCC implementation is against n4835 [1].
>
> At this stage, the remaining potential for change comes from:
>
> * Areas of national body comments that were not resolved in the version we
>   have worked to:
>   (a) handling of the situation where aligned allocation is available.
>   (b) handling of the situation where a user wants coroutines, but does not
>       want exceptions (e.g. a GPU).
>
> * Agreed changes that have not yet been worded in a draft standard that we
>   have worked to.
>
> It is not expected that the resolution to these can produce any major
> change at this phase of the standardisation process.  Such changes should be
> limited to the coroutine-specific code.
>
> ABI
> ---
>
> The various compiler developers 'vendors' have discussed a minimal ABI to
> allow one implementation to call coroutines compiled by another.
>
> This amounts to:
>
> 1. The layout of a public portion of the coroutine frame.
>
>  Coroutines need to preserve state across suspension points, the storage for
>  this is called a "coroutine frame".
>
>  The ABI mandates that pointers into the coroutine frame point to an area
>  begining with two function pointers (to the resume and destroy functions
>  described below); these are immediately followed by the "promise object"
>  described in the standard.
>
>  This is sufficient that the builtins can take a coroutine frame pointer and
>  determine the address of the promise (or call the resume/destroy functions).
>
> 2. A number of compiler builtins that the standard library might use.
>
>   These are implemented by this patch series.
>
> 3. This introduces a new operator 'co_await' the mangling for which is also
> agreed between vendors (and has an issue filed for that against the upstream
> c++abi).  Demangling for this is added to libiberty in a separate patch.
>
> The ABI has currently no target-specific content (a given psABI might elect
> to mandate alignment, but the common ABI does not do this).
>
> Standard Library impact
> -----------------------
>
> The current implementations require addition of only a single header to
> the standard library (no change to the runtime).  This header is part of
> the patch.
>
> GCC Implementation outline
> --------------------------
>
> The standard's design for coroutines does not decorate the definition of
> a coroutine in any way, so that a function is only known to be a coroutine
> when one of the keywords (co_await, co_yield, co_return) is encountered.
>
> This means that we cannot special-case such functions from the outset, but
> must process them differently when they are finalised - which we do from
> "finish_function ()".
>
> At a high level, this design of coroutine produces four pieces from the
> original user's function:
>
>   1. A coroutine state frame (taking the logical place of the activation
>      record for a regular function).  One item stored in that state is the
>      index of the current suspend point.
>   2. A "ramp" function
>      This is what the user calls to construct the coroutine frame and start
>      the coroutine execution.  This will return some object representing the
>      coroutine's eventual return value (or means to continue it when it it
>      suspended).
>   3. A "resume" function.
>      This is what gets called when a the coroutine is resumed when suspended.
>   4. A "destroy" function.
>      This is what gets called when the coroutine state should be destroyed
>      and its memory released.
>
> The standard's coroutines involve cooperation of the user's authored function
> with a provided "promise" class, which includes mandatory methods for
> handling the state transitions and providing output values.  Most realistic
> coroutines will also have one or more 'awaiter' classes that implement the
> user's actions for each suspend point.  As we parse (or during template
> expansion) the types of the promise and awaiter classes become known, and can
> then be verified against the signatures expected by the standard.
>
> Once the function is parsed (and templates expanded) we are able to make the
> transformation into the four pieces noted above.
>
> The implementation here takes the approach of a series of AST transforms.
> The state machine suspend points are encoded in three internal functions
> (one of which represents an exit from scope without cleanups).  These three
> IFNs are lowered early in the middle end, such that the majority of GCC's
> optimisers can be run on the resulting output.
>
> As a design choice, we have carried out the outlining of the user's function
> in the front end, and taken advantage of the existing middle end's abilities
> to inline and DCE where that is profitable.
>
> Since the state machine is actually common to both resumer and destroyer
> functions, we make only a single function "actor" that contains both the
> resume and destroy paths.  The destroy function is represented by a small
> stub that sets a value to signal the use of the destroy path and calls the
> actor.  The idea is that optimisation of the state machine need only be done
> once - and then the resume and destroy paths can be identified allowing the
> middle end's inline and DCE machinery to optimise as profitable as noted
> above.
>
> The middle end components for this implementation are:
>
> A pass that:
>  1. Lowers the coroutine builtins that allow the standard library header to
>     interact with the coroutine frame (these fairly simple logical or
>     numerical substitution of values, given a coroutine frame pointer).
>  2. Lowers the IFN that represents the exit from state without cleanup.
>     Essentially, this becomes a gimple goto.
>  3. Sets the final size of the coroutine frame at this stage.
>
> A second pass (that requires the revised CFG that results from the lowering
> of the scope exit IFNs in the first).
>
>  1. Lower the IFNs that represent the state machine paths for the resume and
>     destroy cases.
>
> Patches squashed into this commit:
>
> [C++ coroutines 1] Common code and base definitions.
>
> This part of the patch series provides the gating flag, the keywords,
> cpp defines etc.
>
> [C++ coroutines 2] Define builtins and internal functions.
>
> This part of the patch series provides the builtin functions
> used by the standard library code and the internal functions
> used to implement lowering of the coroutine state machine.
>
> [C++ coroutines 3] Front end parsing and transforms.
>
> There are two parts to this.
>
> 1. Parsing, template instantiation and diagnostics for the standard-
>    mandated class entries.
>
>   The user authors a function that becomes a coroutine (lazily) by
>   making use of any of the co_await, co_yield or co_return keywords.
>
>   Unlike a regular function, where the activation record is placed on the
>   stack, and is destroyed on function exit, a coroutine has some state that
>   persists between calls - the 'coroutine frame' (thus analogous to a stack
>   frame).
>
>   We transform the user's function into three pieces:
>   1. A so-called ramp function, that establishes the coroutine frame and
>      begins execution of the coroutine.
>   2. An actor function that contains the state machine corresponding to the
>      user's suspend/resume structure.
>   3. A stub function that calls the actor function in 'destroy' mode.
>
>   The actor function is executed:
>    * from "resume point 0" by the ramp.
>    * from resume point N ( > 0 ) for handle.resume() calls.
>    * from the destroy stub for destroy point N for handle.destroy() calls.
>
>   The C++ coroutine design described in the standard makes use of some helper
>   methods that are authored in a so-called "promise" class provided by the
>   user.
>
>   At parse time (or post substitution) the type of the coroutine promise
>   will be determined.  At that point, we can look up the required promise
>   class methods and issue diagnostics if they are missing or incorrect.  To
>   avoid repeating these actions at code-gen time, we make use of temporary
>   'proxy' variables for the coroutine handle and the promise - which will
>   eventually be instantiated in the coroutine frame.
>
>   Each of the keywords will expand to a code sequence (although co_yield is
>   just syntactic sugar for a co_await).
>
>   We defer the analysis and transformatin until template expansion is
>   complete so that we have complete types at that time.
>
> 2. AST analysis and transformation which performs the code-gen for the
>    outlined state machine.
>
>    The entry point here is morph_fn_to_coro () which is called from
>    finish_function () when we have completed any template expansion.
>
>    This is preceded by helper functions that implement the phases below.
>
>    The process proceeds in four phases.
>
>    A Initial framing.
>      The user's function body is wrapped in the initial and final suspend
>      points and we begin building the coroutine frame.
>      We build empty decls for the actor and destroyer functions at this
>      time too.
>      When exceptions are enabled, the user's function body will also be
>      wrapped in a try-catch block with the catch invoking the promise
>      class 'unhandled_exception' method.
>
>    B Analysis.
>      The user's function body is analysed to determine the suspend points,
>      if any, and to capture local variables that might persist across such
>      suspensions.  In most cases, it is not necessary to capture compiler
>      temporaries, since the tree-lowering nests the suspensions correctly.
>      However, in the case of a captured reference, there is a lifetime
>      extension to the end of the full expression - which can mean across a
>      suspend point in which case it must be promoted to a frame variable.
>
>      At the conclusion of analysis, we have a conservative frame layout and
>      maps of the local variables to their frame entry points.
>
>    C Build the ramp function.
>      Carry out the allocation for the coroutine frame (NOTE; the actual size
>      computation is deferred until late in the middle end to allow for future
>      optimisations that will be allowed to elide unused frame entries).
>      We build the return object.
>
>    D Build and expand the actor and destroyer function bodies.
>      The destroyer is a trivial shim that sets a bit to indicate that the
>      destroy dispatcher should be used and then calls into the actor.
>
>      The actor function is the implementation of the user's state machine.
>      The current suspend point is noted in an index.
>      Each suspend point is encoded as a pair of internal functions, one in
>      the relevant dispatcher, and one representing the suspend point.
>
>      During this process, the user's local variables and the proxies for the
>      self-handle and the promise class instanceare re-written to their
>      coroutine frame equivalents.
>
>      The complete bodies for the ramp, actor and destroy function are passed
>      back to finish_function for folding and gimplification.
>
> [C++ coroutines 4] Middle end expanders and transforms.
>
> The first part of this is a pass that provides:
>  * expansion of the library support builtins, these are simple boolean
>    or numerical substitutions.
>
>  * The functionality of implementing an exit from scope without cleanup
>    is performed here by lowering an IFN to a gimple goto.
>
> This pass has to run for non-coroutine functions, since functions calling
> the builtins are not necessarily coroutines (i.e. they are implementing the
> library interfaces which may be called from anywhere).
>
> The second part is the expansion of the coroutine IFNs that describe the
> state machine connections to the dispatchers.  This only has to be run
> for functions that are coroutine components.  The work done by this pass
> is:
>
>    In the front end we construct a single actor function that contains
>    the coroutine state machine.
>
>    The actor function has three entry conditions:
>     1. from the ramp, resume point 0 - to initial-suspend.
>     2. when resume () is executed (resume point N).
>     3. from the destroy () shim when that is executed.
>
>    The actor function begins with two dispatchers; one for resume and
>    one for destroy (where the initial entry from the ramp is a special-
>    case of resume point 0).
>
>    Each suspend point and each dispatch entry is marked with an IFN such
>    that we can connect the relevant dispatchers to their target labels.
>
>    So, if we have:
>
>    CO_YIELD (NUM, FINAL, RES_LAB, DEST_LAB, FRAME_PTR)
>
>    This is await point NUM, and is the final await if FINAL is non-zero.
>    The resume point is RES_LAB, and the destroy point is DEST_LAB.
>
>    We expect to find a CO_ACTOR (NUM) in the resume dispatcher and a
>    CO_ACTOR (NUM+1) in the destroy dispatcher.
>
>    Initially, the intent of keeping the resume and destroy paths together
>    is that the conditionals controlling them are identical, and thus there
>    would be duplication of any optimisation of those paths if the split
>    were earlier.
>
>    Subsequent inlining of the actor (and DCE) is then able to extract the
>    resume and destroy paths as separate functions if that is found
>    profitable by the optimisers.
>
>    Once we have remade the connections to their correct postions, we elide
>    the labels that the front end inserted.
>
> [C++ coroutines 5] Standard library header.
>
> This provides the interfaces mandated by the standard and implements
> the interaction with the coroutine frame by means of inline use of
> builtins expanded at compile-time.  There should be a 1:1 correspondence
> with the standard sections which are cross-referenced.
>
> There is no runtime content.
>
> At this stage, we have the content in an inline namespace "__n4835" for
> the CD we worked to.
>
> [C++ coroutines 6] Testsuite.
>
> There are two categories of test:
>
> 1. Checks for correctly formed source code and the error reporting.
> 2. Checks for transformation and code-gen.
>
> The second set are run as 'torture' tests for the standard options
> set, including LTO.  These are also intentionally run with no options
> provided (from the coroutines.exp script).
>
> gcc/ChangeLog:
>
> 2020-01-18  Iain Sandoe  <i...@sandoe.co.uk>
>
>         * Makefile.in: Add coroutine-passes.o.
>         * builtin-types.def (BT_CONST_SIZE): New.
>         (BT_FN_BOOL_PTR): New.
>         (BT_FN_PTR_PTR_CONST_SIZE_BOOL): New.
>         * builtins.def (DEF_COROUTINE_BUILTIN): New.
>         * coroutine-builtins.def: New file.
>         * coroutine-passes.cc: New file.


There are

              tree res_tgt = TREE_OPERAND (gimple_call_arg (stmt, 2), 0);
              tree &res_dest = destinations.get_or_insert (idx, &existed);
              if (existed && dump_file)
                                Why does this behavior depend on dump_file?
                {
                  fprintf (
                    dump_file,
                    "duplicate YIELD RESUME point (" HOST_WIDE_INT_PRINT_DEC
                    ") ?\n",
                    idx);
                  print_gimple_stmt (dump_file, stmt, 0, TDF_VOPS|TDF_MEMSYMS);
                }
              else
                res_dest = res_tgt;

H.J.

Re: [C++ coroutines] Initial implementation pushed to master.

Reply via email to