Re: [PATCH] [RFC] Delayed parsing for bounds safety attributes

Aaron Ballman Thu, 24 Jul 2025 09:26:41 -0700

On Thu, Jul 24, 2025 at 3:03 PM Martin Uecker <ma.uec...@gmail.com> wrote:
>
> Am Donnerstag, dem 24.07.2025 um 14:08 +0000 schrieb Aaron Ballman:
> > On Wed, Jul 23, 2025 at 8:38 PM Martin Uecker <ma.uec...@gmail.com> wrote:
> > >
> ...
>
> >
> > > > That said, John McCall pointed out some usage patterns Apple has with
> > > > their existing feature:
> > > >
> > > > * 655 simple references to variables or struct members: 
> > > > __counted_by(len)
> > > > * 73 dereferences of variables or struct members: __counted_by(*lenp)
> > > > * 80 integer literals: __counted_by(8)
> > > > * 60 macro references: __counted_by(NUM_EIGHT) [1]
> > > > * 9 simple sizeof expressions: __counted_by(sizeof(eight_bytes_t))
> > > > * 28 others my script couldn’t categorize:
> > > >   * 7 more complicated integer constant expressions:
> > > > __counted_by(num_bytes_for_bits(NUM_FIFTY_SEVEN)) [2]
> > > >   * 16 arithmetically-adjusted references to a single variable or
> > > > struct member: __counted_by(2 * len + 8)
> > > >   * 1 nested struct member: __counted_by(header.len)
> > > >   * 4 combinations of struct members: __counted_by(len + cnt) [3]
> > > >
> > > > Do the Linux kernel folks think this looks somewhat like what their
> > > > usage patterns will be as well? If so, I'd like to argue for my
> > > > personal stake in the ground: we don't need any new language features
> > > > to solve this problem, we can use the existing facilities to do so and
> > > > downscope the initial feature set until a better solution comes along
> > > > for forward references. Use two attributes: counted_by (whose argument
> > > > specifies an already in-scope identifier of what holds the count) and
> > > > counts (whose argument specifies an already in-scope identifier of
> > > > what it counts). e.g.,
> > > > ```
> > > > struct S {
> > > >   char *start_buffer;
> > > >   int start_len __counts(start_buffer);
> > > >   int end_len;
> > > >   char *end_buffer __counted_by(end_len);
> > > > };
> > > >
> > > > void func(char *buffer, int N __counts(buffer), int M, char *buffer
> > > > __counted_by(M));
> > > > ```
> > > > It's kind of gross to need two attributes to do the same notional
> > > > thing, but it does solve the vast majority of the usages seen in the
> > > > wild if you're willing to accept some awkwardness around things like:
> > > > ```
> > > > struct S {
> > > >   char *buffer;
> > > >   int *len __counts(buffer); // Note that len is a pointer
> > > > };
> > > > ```
> > > > because we'd need the semantics of `counts` to include dereferencing
> > > > to the `int` in order to be a valid count. We'd be sacrificing the
> > > > ability to handle the "others my script couldn't categorize", but
> > > > that's 28 out of the 905 total cases and maybe that's acceptable?
> > >
> > > So what do you think about the solution Qing mentioned:
> > >
> > > struct {
> > >   char *buf __counted_by_expr(int len; len + 7);
> > >   int len;
> > > };
> > >
> > > which would be very flexible and support all possible use cases
> > > and has no parsing or semantic interpretation issues.
> >
> > Personally, I'm not excited by it because one of the big sticking
> > points in the Clang community is shared header files with C++. Because
> > these attributes are used on structures and functions, the two most
> > common things you'll find in a shared header file, we *really* want
> > the feature to be workable in both languages to the greatest extent
> > possible. And once we care about C++, things get so much harder due to
> > the extra complexity it brings. So, for example, we'd have to figure
> > out how to handle things like:
> > ```
> > template <typename Ty>
> > struct S {
> >   char *buffer __counted_by_expr(Ty len; len + 7);
> >   int len; // Oooooops
> > };
> >
> > template <typename Ty, typename Uy>
> > struct T {
> >   char *buffer __counted_by_expr(Ty len; len + 7);
> >   Uy len; // Grrrrr
> > };
> > ```
>
> For my understanding: What is the problem here?  I would be an
> error if the declared type of len is inconsistent between the
> attribute and the type that cames later in the member. I guess
> a compiler could also warn already when it sees a template
> like this where it refers to different template arguments.
>
> But then, templates also certainly do not appear in shared headers,
> so I am not sure why Clang could not simmply offer both,
> a late-parsing version and also a C-compatible __counted_by_expr.
>
> I can understand if this is not your first choice,  but it seems
> to be a reasonable compromise to me.


Ah, apologies, I wasn't clear. My thinking is: we're (Clang folks)
going to want it to work in C++ mode because of shared headers. If it
works in C++ mode, then we have to figure out what it means with all
the various C++ features that are possible, not just the use cases
that appear within a shared header. That's how I got onto templates as
just one example of where we'd have to figure out what the behavior
should be. Other questions come up outside of templates as well (use
in default arguments, lambda captures, use of types with conversion
operators, etc). I think some of these questions may even apply in C.
Like... is this valid?

void func(char *buffer [[counted_by(int N[12]; N[10])]], int N[8]);

Or this?

void func(char *buffer [[counted_by(int N[*]; N[10])]], int N[*]); /*
Can't usually spelling int N[*] outside of a parameter list, but this
is a parameter list of sorts, so maybe it's fine? */

I'm not saying it's not a reasonable compromise, FWIW. More just... I
think allowing for two declarations increases the complexity of the
feature in ways that aren't well understood yet.

> > I think it's possible to handle these situations, but we'd have to sit
> > down and think through all the edge cases and whether we can handle
> > them all with some reasonable QoI. I think we'd ultimately run into
> > the same concerns here as we ran into with forward declared
> > parameters. I think the reason folks in Clang are more comfortable
> > with late parsing is because it means the user doesn't have to repeat
> > the type and the name which makes for less chances for the user to
> > screw things up and get into these weird situations. There can be
> > other weird situations with late parsing too, of course, but I think
> > the scope of those edge cases is a bit narrower.
>
> TBH, I am not terrible convinced about this argument.
>
> If I understood it correctly, the late parsing design seems to make
> no distinctions between which identifiers is used, the local or
> the global one and just prefers the local one if it exists, possibly
> giving a warning if there is also a global one.

I think I'd describe it as following typical lexical scoping behavior
-- the closest declaration of the identifier is what's found by the
lookup. But in the event that causes a different lookup result from
what the current standard behavior would give you, it should be
diagnosed. Personally, I'd feel most comfortable if that diagnostic
was a warning which defaults to an error; basically, make the user
decide how to handle it on a case by case basis but "standard behavior
wins" if you disable the diagnostic.

> My C++ examples shows that you can easily run into UB here in C++,
> especially since subtle differnt rules apply in different but very
> similar scenarious. How can this not be error prone?
>
> The forward declaration, the [.N] syntax, and also __self__ etc.
> would all make this explicit which identifiers is meant.

I think they come with tradeoffs but so far, everything seems to be
error prone in different ways. :-(

> > The other downside is that we have more attributes that need to
> > support something similar, like the thread safety attributes (which I
> > believe is also an important use case for the Linux kernel folks?). We
> > could do this dance on a per-attribute basis, but if the solution
> > worked for all attributes *and* array extents at the same time, that
> > would be nice. Not certain it's a requirement though.
>
> True.  But if it is to work properly for arrays in C too, then the
> C constraints are also important IMHO, not just the C++ rules.

Agreed! I don't think "C++ rules should win, that's the end of the
discussion" is tenable; I think it's more that we need to handle all
the oddities of both languages otherwise we're going to end up with
something like C99's array parameter features that didn't get adopted
into C++ (e.g., [static 12] or [*], etc). I think that hampered their
adoption in the wild at least in part because of the cross-language
issues.

> > > The the thing is that WG14 had (weak) consensus for parameter
> > > forward declarations and  I think more consensus for [.N]
> > > syntax in structures already.  So I had hoped that we will be
> > > able to make progress on this.
> >
> > Question on the .N syntax: I thought I heard that this was something
> > GCC could handle, but that it still requires late parsing to ensure
> > type information for N is available and that was a problem. e.g.,
> >
> > void func(char *buffer __counted_by(.N * sizeof(.N)), int N);
>
> >
> > where we'd need to know both the name and the type. Am I wrong about
> > that being a challenge for GCC to support?
>
> I think it is generally a challenge to support.

Thanks for the confirmation!

> One could certainly
> store away the tokens and parse them later (this is certainly doable),
> but it adds a lot of issues because you need to add a lot of constraints
> for things which should then not be allwoed.  And it is still not an
> acceptable solution for size arguments in C.

Yeah, that's basically the same gripe I have about having forward
declarations; we have to figure out all the weird edge cases and what
constraints are necessary to have decent QoI.

> .N would work here if you combine with a rule such as ".N" is always
> converted to "size_t".   Or you require an explicit cast if is different
> to "size_t" .

I think converting .N to size_t could work if the feature was limited
to just bounds safety, but it would be unusable for things like thread
safety attributes where the argument needs to be some kind of
mutex-like object. But even for bounds safety, I think we end up with
really weird behavior: void func(char *buffer __counted_by(sizeof(.N)
/* sizeof(size_t) */), typeof(.N) x /* size_t x */, int N /* wait,
what? */);

If we were to require explicit casts... I think we're back to the same
thing as having forward declarations in terms of concerns because
that's what opens up the possibility of type mismatches.

I guess the way I see it is:

If there's only one declaration involved (late parsing approach), then
there's potential name lookup confusion which I think is worse in C
than it is in C++.
If there's more than one declaration involved (any kind of forward
declaration syntax, use of .N with explicit casts, etc), then there's
potential type confusion which I think is worse in C++ than it is in
C.

Either one is going to need constraints to help with the confusion. I
don't know which one ends up with less constraints or more readable
code.

~Aaron

>
>
> Martin
>
>
> > If so, I think it may be
> > plausible in Clang (implementation-wise, if we can handle late parsing
> > without the dot, it sure seems like handling it with the dot won't be
> > any harder). Whether the community will go for it or not, I'm not
> > certain, but if GCC can support it, I can try to sell it to Clang
> > folks as a good compromise.
> >
> > ~Aaron
>

Re: [PATCH] [RFC] Delayed parsing for bounds safety attributes

Reply via email to