On Thu, 2026-02-19 at 16:50 +0530, Ridham Khurana wrote:
> Hi Dave

Hi Ridham.

Thanks for taking this discussion to the mailing list.

For reference of other people, this is the GSoC proposal relating to PR
107017 (adding format string support to -fanalyzer).

> Thanks for the clarifications and the detailed pointers. Here is my
> understanding of the design architecture and progress so far:
> 
> I have been going through both files c-format.cc and gimple-ssa-
> sprintf.cc,
> and I think I understand the structure/architecture you are
> suggesting. So,
> the idea is to introduce a format_parser class (that you mentioned in
> last
> email), that takes the raw format string as input and breaks it into
> different but sequential components, for example literal text
> segments or
> directives like %d or %f. I have been calling this sequential
> structure the
> "action-map", just to address it with a name. The main point of this
> shared
> approach is to avoid having the frontend, GIMPLE and
> analyzer(upcoming) to
> run their own logic to decode it, but use this common parser, and
> instead
> have them all start from the same parsed action-map. 

(nods)

Maybe an "action_list" rather than "action_map", given that it has a
sequence?  (but that risks a "what color shall we paint the bikeshed"
discussion)

> The
> "format_string_iterator" will then provide us an easy and consistent
> way
> for different files to walk through the action-map, at the same time
> make
> sure the internals of the action-map are not exposed. 

(nods)

> From this, every
> subsystem that requires going through the action-map can use it with
> its
> own logic as required, like argument checking in frontend, sprintf
> folding
> in GIMPLE, and mainly modelling memory-effects in analyzer, while
> keeping
> all the existing semantic algorithms in their own respective files.

A simple way of doing this might be for the shared code to generate a
std::list<format_action> (or a std::vector), which would be your
action-map, but another approach might be to eschew an object for the
action-map, and instead have format_string_iterator provide "actions"
on demand as clients iterate through the actions, something like:

class format_string_iterator
{
public:
   bool done_p () const;
   void next ();
   format_string_action get_next_action () const;

private:
   /* implementation-specific details here */
};

or somesuch.

I'm not sure which approach is better.

IIRC, the current implementations in both c-format.cc and gimple-ssa-
sprintf.cc are structured around iterations, rather than reusing an
existing container and populating/iterating over that.

> 
> One more thing that I was thinking is, where should the shared
> format_parser and format_string_iterator classes live in the tree?
> Right
> now, c-format.cc is under gcc/c-family/, gimple-ssa-sprintf.cc is
> directly
> under gcc, and the analyzer is under gcc/analyzer/. Since the shared
> code
> needs to be accessible for all three, I am guessing it should go
> directly
> under gcc, but I wanted to confirm if you have some other approach to
> deal
> with this before I start thinking about the layout.

I think directly under the gcc/ directory, as part of gcc/Makefile.in's
OBJs.

Maybe we should introduce a gcc::format_string namespace for the shared
code to live in???

One thing to remember is that patches need to be reviewable - it's good
to avoid a huge patch that deletes a big chunk of code and replaces it
with a different other chunk of code.  So it might be best to build
this up as a series of patches, expressing refactorings in c-format.cc
that eventually move the new classes elsewhere (given that IIRC c-
format.cc currently got the most complicated logic).  But maybe we
don't need to worry about that during prototyping/experimentation.

A lot of the complexity in c-format.cc is that if we emit a warning, we
have to do some work to generate source-location information for the
parts of the format string of interest, but we don't want to do that
work unless it's going to be needed (e.g. finding the exact location of
a "%i").  So we'll want the action-map to keep that deferral of work,
so that we only do it if we issue a diagnostic relating to part of the
format string.

> On the region_model vs sm_state_map distinction, I think I follow
> what you
> mean. I am looking at how both appear in the .dot dumps (e.g. the
> rmodel:
> part showing memory state, and the malloc: part showing checker
> states like
> unchecked → freed). I am still going through the callbacks to
> understand
> how the two sides connect, but I will keep this in mind as I continue
> studying more about the analyzer internals.

(nods)

> 
> My next step is to continue my deep dive into both implementations
> and
> start planning out how the shared parser API could look, depending on
> where
> it will live.

Sounds great.

Let me know if you have any questions.

Thanks
Dave

Reply via email to