On Mon, 16 Mar 2026 17:01:11 -0600 Jonathan Corbet <[email protected]> wrote:
> On Thu, 12 Mar 2026 15:54:25 +0100, Mauro Carvalho Chehab > <[email protected]> wrote: > > Handling C code purely using regular expressions doesn't work well. > > > > Add a C tokenizer to help doing it the right way. > > > > The tokenizer was written using as basis the Python re documentation > > tokenizer example from: > > https://docs.python.org/3/library/re.html#writing-a-tokenizer > > > > Signed-off-by: Mauro Carvalho Chehab <[email protected]> > > Message-ID: > > <c63ad36c81fe043e9e33ca55630414893f127413.1773074166.git.mchehab+hua...@kernel.org> > > Message-ID: > > <8541ffa469647db1a7154f274fb2d55b4c127dcb.1773326442.git.mchehab+hua...@kernel.org> > > > > This is a combined effort to review this patch and to try out "b4 review", > we'll see how it goes :). > > > diff --git a/tools/lib/python/kdoc/kdoc_re.py > > b/tools/lib/python/kdoc/kdoc_re.py > > index 085b89a4547c0..7bed4e9a88108 100644 > > --- a/tools/lib/python/kdoc/kdoc_re.py > > +++ b/tools/lib/python/kdoc/kdoc_re.py > > @@ -141,6 +141,240 @@ class KernRe: > > [ ... skip 4 lines ... ] > > + > > + @staticmethod > > + def __str__(val): > > + """Return the name of an enum value""" > > + return TokType._name_by_val.get(val, f"UNKNOWN({val})") > > + > > What is this class supposed to do? This __str__() method ensures that, when printing a CToken object, the name will be displayed, instead of a number. This is really useful when debugging. See, if I add a print: <snip> --- a/tools/lib/python/kdoc/kdoc_parser.py +++ b/tools/lib/python/kdoc/kdoc_parser.py @@ -87,6 +87,7 @@ def trim_private_members(text): """ tokens = CTokenizer(text) + print(tokens.tokens) return str(tokens) </snip> the tokens will appear as names at the output: $ ./scripts/kernel-doc -none er.c [CToken(CToken.ENUM, "enum", 0, (0, 0, 0)), CToken(CToken.SPACE, " ", 4, (0, 0, 0)), CToken(CToken.NAME, "dmub_abm_ace_curve_type", 5, (0, 0, 0)), CToken(CToken.SPACE, " ", 28, (0, 0, 0)), CToken(CToken.BEGIN, "{", 29, (0, 0, 1)), CToken(CToken.SPACE, " ", 30, (0, 0, 1)), CToken(CToken.COMMENT, "/** * ACE curve as defined by the SW layer. */", 31, (0, 0, 1)), CToken(CToken.SPACE, " ", 86, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW", 87, (0, 0, 1)), CToken(CToken.SPACE, " ", 109, (0, 0, 1)), CToken(CToken.OP, "=", 110, (0, 0, 1)), CToken(CToken.SPACE, " ", 111, (0, 0, 1)), CToken(CToken.NUMBER, "0", 112, (0, 0, 1)), CToken(CToken.PUNC, ",", 113, (0, 0, 1)), CToken(CToken.SPACE, " ", 114, (0, 0, 1)), CToken(CToken.COMMENT, "/** * ACE curve as defined by the SW to HW translation interface layer. */", 115, (0, 0, 1)), CToken(CToken.SPACE, " ", 198, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW_IF", 199, (0, 0, 1)), CToken(CToken.SPACE, " ", 224, (0, 0, 1)), CToken(CToken.OP, "=", 225, (0, 0, 1)), CToken(CToken.SPACE, " ", 226, (0, 0, 1)), CToken(CToken.NUMBER, "1", 227, (0, 0, 1)), CToken(CToken.PUNC, ",", 228, (0, 0, 1)), CToken(CToken.SPACE, " ", 229, (0, 0, 1)), CToken(CToken.END, "}", 230, (0, 0, 0)), CToken(CToken.PUNC, ";", 231, (0, 0, 0))] > > > [ ... skip 27 lines ... ] > > + _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, > > int)} > > + > > + # Dict to convert from string to an enum-like integer value. > > + _name_to_val = {k: v for v, k in _name_by_val.items()} > > + > > + @staticmethod > > This stuff strikes me as a bit overdone; _name_to_val is really just the > variable list for the class, right? Those two vars are a kind of magic: they create two dictionaries: - _name_by_val converts a token integer into a string; - _name_to_val converts a string to an integer. I opted to use this approach for a couple of reasons: 1. using tok.kind == "BEGIN" (and similar) everywhere is harder to maintain, as python won't check for typos. Now, if one writes: CToken.BEGHIN, an error will be raised; 2. the cost to convert from string to int is O(1), so not much a performance issue at the conversion; 3. using an integer on all checks should make the code faster as it doesn't require a loop to check the string. > > > [ ... skip 30 lines ... ] > > + f"{self.brace_level}, {self.paren_level}, > > {self.bracket_level})" > > + > > +#: Tokens to parse C code. > > +TOKEN_LIST = [ > > + (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"), > > + > > So these aren't "tokens", this is a list of regexes; how is it intended > to be used? Right. I could have named it better, like RE_TOKEN_LIST, or TOKEN_REGEX, as at the original example, which came from Python documentation for "re" module: https://docs.python.org/3/library/re.html#writing-a-tokenizer basically, we have the token type as the first element at the tuple, and regex as the second one. When regex matches, the CToken will be filed with kind=tuple[0]. the loop: for match in re.finditer(TOKEN_LIST, code): will parse the entire C code source, in order, converting it into a token list. So, a file like this: /** * enum dmub_abm_ace_curve_type - ACE curve type. */ enum dmub_abm_ace_curve_type { /** * ACE curve as defined by the SW layer. */ ABM_ACE_CURVE_TYPE__SW = 0, /** * ACE curve as defined by the SW to HW translation interface layer. */ ABM_ACE_CURVE_TYPE__SW_IF = 1, }; will become (I used pprint here to better align the tokens): [CToken(CToken.ENUM, "enum", 0, (0, 0, 0)), CToken(CToken.SPACE, " ", 4, (0, 0, 0)), CToken(CToken.NAME, "dmub_abm_ace_curve_type", 5, (0, 0, 0)), CToken(CToken.SPACE, " ", 28, (0, 0, 0)), CToken(CToken.BEGIN, "{", 29, (0, 0, 1)), CToken(CToken.SPACE, " ", 30, (0, 0, 1)), CToken(CToken.COMMENT, "/** * ACE curve as defined by the SW layer. */", 31, (0, 0, 1)), CToken(CToken.SPACE, " ", 86, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW", 87, (0, 0, 1)), CToken(CToken.SPACE, " ", 109, (0, 0, 1)), CToken(CToken.OP, "=", 110, (0, 0, 1)), CToken(CToken.SPACE, " ", 111, (0, 0, 1)), CToken(CToken.NUMBER, "0", 112, (0, 0, 1)), CToken(CToken.PUNC, ",", 113, (0, 0, 1)), CToken(CToken.SPACE, " ", 114, (0, 0, 1)), CToken(CToken.COMMENT, "/** * ACE curve as defined by the SW to HW translation interface layer. */", 115, (0, 0, 1)), CToken(CToken.SPACE, " ", 198, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW_IF", 199, (0, 0, 1)), CToken(CToken.SPACE, " ", 224, (0, 0, 1)), CToken(CToken.OP, "=", 225, (0, 0, 1)), CToken(CToken.SPACE, " ", 226, (0, 0, 1)), CToken(CToken.NUMBER, "1", 227, (0, 0, 1)), CToken(CToken.PUNC, ",", 228, (0, 0, 1)), CToken(CToken.SPACE, " ", 229, (0, 0, 1)), CToken(CToken.END, "}", 230, (0, 0, 0)), CToken(CToken.PUNC, ";", 231, (0, 0, 0))] > > > + (CToken.STRING, r'"(?:\\.|[^"\\])*"'), > > + (CToken.CHAR, r"'(?:\\.|[^'\\])'"), > > + > > + (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|" > > How does "[\s\S]*" differ from plain old "*" ? They are not identical, as "*" doesn't match "\n". As the tokenizer also picks "\n" on several cases, like on comments, r"\s\S" works better. > > [ ... skip 15 lines ... ] > > + (CToken.STRUCT, r"\bstruct\b"), > > + (CToken.UNION, r"\bunion\b"), > > + (CToken.ENUM, r"\benum\b"), > > + (CToken.TYPEDEF, r"\bkinddef\b"), > > + > > + (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"), > > "-" and "!" never need to be escaped. "-" usually needs to be escaped, because it can be a range. I had some troubles with the parser due to the lack of escapes, so I ended being conservative. ( I'm finding a little bit hard to follow your comments... Here, for instance, "!" is not at CToken.NAME regex. Did b4 review place your comment at the wrong place? ) > > > + > > + (CToken.SPACE, r"[\s]+"), > > + > > + (CToken.MISMATCH,r"."), > > +] > > + > > "kinddef" ? Should be "typedef". This was due to a "sed s,type,kind," I applied to avoid using "type" for the token type, as, when I started integrating it with kdoc_re, it became confusing. I'll fix at the next respin. > > > +#: Handle C continuation lines. > > +RE_CONT = KernRe(r"\\\n") > > + > > +RE_COMMENT_START = KernRe(r'/\*\s*') > > + > > Don't need the [brackets] here where? > > > [ ... skip 6 lines ... ] > > + > > + When converted to string, it drops comments and handle public/private > > + values, respecting depth. > > + """ > > + > > + # This class is inspired and follows the basic concepts of: > > That seems weird, why don't you just initialize it here? Hard to tell what you're referring to. Maybe this: RE_SCANNER = fill_re_scanner(TOKEN_LIST) The rationale is that I don't want to re-create this every time, as this is const. > > > [ ... skip 14 lines ... ] > > + source = RE_CONT.sub("", source) > > + > > + brace_level = 0 > > + paren_level = 0 > > + bracket_level = 0 > > + > > Do you mean "iterator" here? If you mean the typo at _tokenize() help text, yes: interactor -> iterator > > > [ ... skip 33 lines ... ] > > + in this particular case, it makes sense, as we can pick the name > > + when matching a code via re_scanner(). > > + """ > > + global re_scanner > > + > > + if not re_scanner: > > Putting __init__() first is fairly standard, methinks. class CTokenizer __init__() module calls _tokenize() method on it. My personal preference is to have the caller methods before the methods that actually call them, even inside a class, where the order doesn't matter - or even in C, when we have an include with all prototypes. But if you prefer, I can reorder it. > > [ ... skip 15 lines ... ] > > + > > + for tok in self.tokens: > > + if tok.kind == CToken.BEGIN: > > + show_stack.append(show_stack[-1]) > > + > > + elif tok.kind == CToken.END: > > I still don't understand why you do this here - this is all constant, right? This one I didn't get to what part of the code you're referring to. All constants on this code are using upper case names. They are: - TOKEN_LIST (which should probably be named as TOKEN_REGEX_LIST). - CToken enum-like names (BEGIN, END, OP, NAME, ...) - three regexes (RE_COUNT, RE_COMMENT_START, RE_SCANNER) See, what the tokenizer does is a linear transformation from a C source string into a token list. So, for each instance, its content will change. Also, when we apply CMatch logic, its content will also change. > > > + prev = show_stack[-1] > > + if len(show_stack) > 1: > > + show_stack.pop() > > + > > + if not prev and show_stack[-1]: > > So you create a nice iterator structure, then just put it all together into a > list anyway? Not sure what you meant here. The end result of the tokenizer is a list of tokens, in the order they appear at the source code. To be able to handle public/private and do code transforms, using it as a list is completely fine. Now, if we want to use the tokenizer to parse things like: typedef struct ca_descr_info { unsigned int num; unsigned int type; } ca_descr_info_t; Then having iterators to parse tokens on both directions would be great, as the typedef identifier is at the end. Thanks, Mauro

