Re: Fwd: Proposal: Roundtrip serialization of Cmm (parser-compatible pretty-printer output)

Hécate via ghc-devs Mon, 28 Jul 2025 14:31:16 -0700

Thanks a lot Diego, that indeed addresses my concerns. :)


Le 28/07/2025 à 20:26, Diego Antonio Rosario Palomino a écrit :

---------- Forwarded message ---------
De: *Diego Antonio Rosario Palomino* <[email protected]>
Date: lun, 28 jul 2025 a la(s) 12:56 p.m.
Subject: Re: Proposal: Roundtrip serialization of Cmm(parser-compatible pretty-printer output)
To: Hécate <[email protected]>


Hello all,
Thank you for the thoughtful responses so far, and thank you Simon forsummarizing Andreas's comments.
    /"Do you have any use-cases in mind? Suppose you were 100%
    successful — would anyone use it?"/
Yes — my mentor, *Csaba Hruska*, would. He's currently working on acustom STG optimizer that uses experimental techniques to enablewhole-program optimizations for Haskell code. The intended pipeline is:
*GHC STG → custom optimizer → textual Cmm → code generation*
However, the current /parseable/ Cmm is not sufficient for his usecase, because it *cannot represent everything the Cmm AST can express*.
Beyond this specific use case, achieving *roundtrip serializability*for Cmm could make it a *viable alternative to LLVM* for Haskellprojects. Native code generation via Cmm is much faster than throughLLVM. And while outputting LLVM from Cmm currently produces /lessperformant/ code than directly targetting LLVM, I believe theinefficiencies could be fixed relatively easily. Enabling suchimprovements is part of the motivation for my documentation work — tohelp developers understand and work with Cmm and its infrastructure.
    /"You need a compelling reason to change the input language
    (understood by the parser) since libraries may include .cmm files,
    which will break. (It'd be interesting to audit Hackage to see how
    many libraries do include such .cmm files.)"/
To clarify, this proposal would *not* break backwards compatibility.There are two implementation paths:
1.

    Introduce a *second parser* that accepts a syntax 100% identical
    to the pretty printer output.

2.

    Extend the *current parser* by adding a mode (or block) that uses
    a distinct keyword (e.g., |low_level_unwrapped|) to indicate:
    "expect exact syntax, no convenience fills."
In either case, existing |.cmm| files would continue to be supportedas-is. The current parser wouldn't need features removed or changed —the new syntax would *only add capabilities*.
    /"It’s unclear from your example how those blocks would work
    exactly. Is |low_level_unwrapped| a label? If so can we |goto| it?
    Is it a keyword? Something else entirely?"/ — Andreas
Apologies for the confusion — I’m not well-versed in the formalterminology.
To clarify: |low_level_unwrapped| (or |very_low_level|, or anothername) would be a *keyword or syntactic construct* that tells theparser to interpret the contents of the block |{ ... }| using a syntax*identical to what the pretty printer emits*.
For example:
|function1 { } // existing low-level syntax function2() { } //existing high-level syntax very_low_level { ... } // new mode: codewith exact pretty-printed syntax inside the block |
    /"Rather than change the language understood by the parser, would
    it not be easier to change the language spat out by the
    pretty-printer to be compatible with the parser?"/

Unfortunately, that’s not a practical path forward.
At the start of the project, Csaba (my mentor) recommended leaving theparser mostly untouched and focusing instead on extending the prettyprinter. However, we’ve realized that the differences between theparser and the pretty printer are not trivial. The parser — even inits current “low-level” mode — *inserts inferred data* via conveniencefunctions. It *abstracts part of the structure*, meaning we cannotfully recover the original Cmm ADT just by parsing.
In other words, *modifying the pretty printer to match the parserwould require it to /lose information/* — which I strongly oppose. IfCmm is generated programmatically, the pretty-printed version wouldlack structural information present in the internal data structure.And parseable Cmm would still be *incapable of expressing all featuresof the AST*.
I hope that also addresses your concern, Hécate.
This GSoC project runs until *November 10th*. I was granted extra timesince, unlike most participants, I’m not working through summervacation — I’m in the Southern Hemisphere.
(Also, I realize I previously used the wrong project name in thisthread — the correct title of my GSoC project is *“Documenting andimproving Cmm.”*)
Regarding the risk of *bitrot* in a new parser or new syntax mode: onepossible mitigation would be to add *regression tests* that checkwhether parsing a file and pretty-printing it results in compatibleoutput.
On a related note, I’ve noticed that *some Cmm examples in thedocumentation and even in source code comments are incorrect oroutdated*. Part of my work includes identifying and correcting theseinconsistencies.
Thanks again to everyone for your time and input — I greatlyappreciate the discussion and feedback.
Best regards,
*Diego Antonio Rosario Palomino*
GSoC 2025 – Documenting and improving Cmm
El lun, 28 jul 2025 a la(s) 11:04 a.m., Hécate via ghc-devs([email protected]) escribió:
    Hi Diego,

    Thank you very much for your work in this direction, it's sorely
    needed.

    I'm all for having proper roundtrip correctness for Cmm, but I am
    not sure altering the parser is the way to go.
    In my opinion, GHC should produce valid textual Cmm, that can be
    ingested by the parser at it is today.

    Have a nice day,
    Hécate

    Le 28/07/2025 à 02:16, Diego Antonio Rosario Palomino a écrit :
    Hello GHC devs,

    I'm currently working on Cmm documentation and tooling
    improvements as part of my Google Summer of Code project. One of
    my core goals is to make Cmm roundtrip serializable.

    Right now, the in-memory Cmm data structure—generated
    programmatically (e.g., from STG via GHC)—can be pretty-printed,
    and Cmm can also be parsed. However, the pretty-printed version
    is not compatible with the parser. That is, we cannot take the
    output of the pretty printer and feed it directly back into the
    parser.

    Example:

    Parseable version:

    |sum { cr: bits64 x; x = R1 + R2; R1 = x; jump
    %ENTRY_CODE(Sp(0))[R1]; } |

    Pretty-printed version:

    |sum() { // [] { info_tbls: [] stack_info: arg_space: 8 } {offset
    cf: // global _ce::I64 = R1 + R2; R1 = _ce::I64; call (I64[Sp + 0
    * 8])(R1) args: 8, res: 0, upd: 8; } } |

    Another example:

    Parseable version:

    |simple_sum_4 { // [R2, R1] cr: // global bits64 _cq; _cq = R2;
    bits64 _cp; _cp = R1; R1 = _cq + _cp; jump (bits64[Sp])[R1]; } |

    Pretty-printed version:

    |simple_sum_4() { // [] { info_tbls: [] stack_info: arg_space: 8
    } {offset cs: // global _cq::I64 = R2; _cr::I64 = R1; R1 =
    _cq::I64 + _cr::I64; call (I64[Sp])(R1) args: 8, res: 0, upd: 8; } } |

    While it’s possible to write parseable Cmm that resembles the
    pretty-printed version (and hence the internal ADT), they don’t
    fully match—mainly because the parser inserts inferred fields
    using convenience functions.

    Proposal:

    To make roundtrip serialization possible, I propose supporting a
    new syntax that matches the pretty printer output exactly.

    There are a couple of design options:

    1.

        Create a separate parser that accepts the pretty-printed
        syntax. Files could then use either the current parser or the
        new strict one.

    2.

        Extend the current parser with a dedicated block syntax like:

    |low_level_unwrapped { ... } |

    This second option is the one my mentor recommends, as it may
    better reflect GHC developers' preferences. In this mode, the
    parser would not insert any inferred data and would expect the
    input to match the pretty-printed form exactly.

    This would enable a true roundtrip:

     *

        Compile Haskell to Cmm (in-memory AST)

     *

        Pretty-print and write it to disk (wrapped in
        low_level_unwrapped { ... })

     *

        Later read it back using the parser and continue with codegen

    Optional future direction:

    As a side note: currently the parser has both a “high-level” and
    a “low-level” mode. The low-level mode resembles the AST more
    closely but still inserts some inferred data.

    If we introduce this new “exact” low-level form, it's possible
    the existing low-level mode could become redundant. We might then
    have:

     *

        High-level syntax

     *

        New low-level (exact)

     *

        And possibly deprecate the current low-level variant

    I’d be interested in your thoughts on whether that direction
    makes sense.

    Serialization libraries?

    One technically possible—but likely unacceptable—alternative
    would be to derive serialization via a library like |aeson|. That
    would enable serializing and deserializing the Cmm AST directly.
    However, I understand that |aeson| adds a large dependency
    footprint, and likely wouldn't be suitable for inclusion in GHC.

    Final question:

    Lastly—I’ve heard that parts of the Cmm pipeline may currently be
    under refactoring. If that’s the case, could you point me to
    which parts (parser, pretty printer, internal representation,
    etc.) are being modified? I’d like to align my efforts
    accordingly and avoid conflicts.

    Thanks very much for your time and input! I'm happy to iterate on
    this based on your feedback.

    Best regards,
    Diego Antonio Rosario Palomino
    GSoC 2025 – Cmm Documentation & Tooling


    _______________________________________________
    ghc-devs mailing list
    [email protected]
    http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
--Hécate ✨
    🐦: @TechnoEmpress
    IRC: Hecate
    WWW:https://glitchbra.in
    RUN: BSD

    _______________________________________________
    ghc-devs mailing list
    [email protected]
    http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


_______________________________________________
ghc-devs mailing list
[email protected]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


--
Hécate ✨
🐦: @TechnoEmpress
IRC: Hecate
WWW:https://glitchbra.in
RUN: BSD

_______________________________________________
ghc-devs mailing list
[email protected]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Re: Fwd: Proposal: Roundtrip serialization of Cmm (parser-compatible pretty-printer output)

Reply via email to