Re: Draft GSoC 2026 Proposal: CPython API Checking, PR 107646

David Malcolm via Gcc Sun, 15 Mar 2026 17:54:51 -0700

On Sun, 2026-03-15 at 12:23 +0530, Saksham Gupta wrote:
> Hi David,
> 
> I’ve attached the draft of my GSoC proposal for the CPython API
> checker. I
> haven't submitted it to the official portal yet—I wanted to run it by
> you
> first to catch any mistakes and make sure the technical direction
> actually
> makes sense.
> 
> I made sure to include your recent advice. The scope now explicitly
> targets
> Python 3.11+ to handle the PEP 683 changes. My Compile Farm account
> (am-saksham) is also fully set up, so I added that to the testing
> strategy,
> along with a quick example of handling CFG bifurcation for PyList_New
> failures.
>


Hi Saksham

> If you have a few minutes next week, I’d love your brutal honesty on
> this.

Challenge accepted :)

One thing that might not be mentioned yet on the wiki page is that the
existing plugin is the result of a previous GSoC project (by Eric Feng,
in 2023):
https://summerofcode.withgoogle.com/archive/2023/projects/EzIUWs5x
https://gist.github.com/efric/9faa9cb19fe829b97a54d5c7eabf5e72

(I've added a link to the wiki)

You should update the wording of your proposal to mention this (and
e.g. how 3.11 broke the old code).

Re: 1. Abstract; probably worth noting that there are multiple ways to
interface CPython with C: using libffi, using a binding generator (such
as Cython), or writing C by hand.  This project is focusing on the
"writing C by hand" case, but we don't recommend people use this
approach; this is more about supporting legacy code.

Re 2. Motivation & Background:

"Crucially, the analysis will explicitly target CPython 3.11+ headers
as a baseline. This ensures accurate struct layouts,":  a nitpick: note
that we don't want to have to care about precise in-memory layouts,
GCC's C frontend does this for us; what we care about is what fields
there are and what their types are.  The region_model/store.cc code
does track things in terms of bit offsets, so we'll see those when
debugging, but the plugin should be written in terms of types and
fields.

"this project will integrate Python-specific domain knowledge directly
into the analyzer core."  Really?  I was thinking that it's best to
keep this as a plugin, albeit an in-tree plugin.

"Crucially, the analysis will explicitly target CPython 3.11+ headers 
as a baseline."  note that there have been other recent changes beyond
PEP 683 as CPython developers have tried to optimize more aggressively
than in the past (e.g. for JIT compilation).  The most recent release
is 3.14, and that might well have other changes that the plugin needs
to be aware of.  The ideal would be to support a wide range of 3.*
headers, but it's good to pick one and get that working first, to avoid
getting swamped by compatibility concerns.

"Illustrative Example: The Silent Leak": looks good.

Re 3.2. Phase 2: Implementing the Reference Count State Machine:
Your implementation plan is rather different to what we tried before,
in that you're proposing using a state_machine subclass to associate
state with a pointer.  What we tried in 2023 is to count the number of
pointers being stored pointing at each PyObject, and then compare
against the ob_refcnt, and complain at certain points when they got
out-of-sync (e.g. when the stack frame is popped).  This was working
purely with the region_model/state code and didn't need a new
state_machine.  That approach did seem to work with the pre-PEP-683
implementation, but IIRC Eric got stuck spending a lot of his time on
PyList_Append, and thus we only got a tiny subset of the API covered -
but it did work.  Py_INCREF and PyDECREF are typically macros, and so
by the time the analyzer "sees" the user's code, all we see are
reference count increments, decrements, and conditionals, and this is
captured for us in the store by the region_model code; I think it would
be hard to implement using a state_machine (though maybe I'm wrong).

Note that there's huge amounts of repetition in the API (e.g.
"succeeds, returning a new reference, or fails, returning null" is a
very common pattern).  So please make plenty of use of helper
subroutines, or the attributes idea described on the project wiki page.

re "DejaGnu Regression Suite": re"the ascii-art execution paths" note
that these tests tend to be "brittle" so we don't want many tests
expressed this way, if any at all - dg-warning and dg-message tend to
be much more robust.

re "5. Timeline & Milestones (350 Hours)": I suggest dropping the
mentions of the state_machine approach, and this suggests a rewrite of
this section.  I like the idea of building up a suite of buggy
extensions.  You'll want most of them to be as simple as possible,
along with some larger examples for "integration testing".  I recommend
early on categorizing the API into the various patterns of
ownership/borrowing/stealing etc, and identifying examples of each, and
trying a simple example of each early on, to verify that the overall
approach will work on all the cases.

I don't like "strict formatting to GNU coding standard" being done at
the end.  Better to set up your editor early on to adhere to these, and
then have this happen throughout.  IIRC we have a .editorconfig file,
so this should be trivial.  So this should be in the "community
bonding" phase.

The other thing you might like to try is some of the other subprojects
within https://gcc.gnu.org/wiki/StaticAnalyzer/CPython ; some of these
are relatively easy compared to reference count checking, e.g.
"Verification of PyMethodDef tables" and "Checking arguments of "call"
calls" (though note the word "relatively" here).

Hope this makes sense; let me know if you have questions.  I need to
move on, but note I may have missed some things, so consider running an
update past me.

Dave


> I really want to make sure my plan for the state machine over GIMPLE
> aligns
> with the new class api. If my approach is off base anywhere, please
> let me
> know so I can rewrite it before the deadline.
> 
> Working on this project is my absolute top priority right now, so I'm
> ready
> to iterate on this draft as much as needed to get it right.
> 
> Thanks again for the atoi patch review earlier this week!
> 
> Best,
> Saksham Gupta

Re: Draft GSoC 2026 Proposal: CPython API Checking, PR 107646

Reply via email to