Re: [PATCH 2/5] json-parser: replace with a push parser

Markus Armbruster Mon, 16 Feb 2026 23:40:29 -0800

Paolo Bonzini <[email protected]> writes:

> On 2/12/26 14:12, Markus Armbruster wrote:
>>>> We know the stack is at most 1024 deep (MAX_NESTING).  Have you
>>>> considered using an array instead of GQueue?
>>>
>>> It'd be only 8K, but I preferred to keep MAX_NESTING hidden within
>>> json-streamer.c.  Not having bounds checking makes me a bit nervous
>>> about having fixed-size arrays.
>> 
>> Looks mostly harmless to me: there's exactly one spot where we push
>> something onto the stack, one where we pop, and one where we pop
>> everything.  We only ever look at the top of the stack.
>
> Yes, code-wise it's not hard.  But it's ugly/brittle to have two pieces 
> of code that have to sync on MAX_NESTING.
>
>>>> The parser's interesting part follows.  The parser is a pushdown
>>>> automaton.  The pushdown automaton is coded in C.  On the one hand,
>>>> d'oh!  Of course it is.  On the other hand, it's hard to see the actual
>>>> automaton in C.  May I have a comment explaining state and state
>>>> transitions?  Perhaps we better start with an informal description,
>>>> discuss it, then distill the discussion into a comment.
>>>
>>> That was the purpose of the mysterious comment above.  If a mix between
>>> BNF and automaton is okay for you, it could be something like
>>>
>>> input := value        -> END_OF_VALUE
>>>            END_OF_INPUT -> check stack is empty, return parsed value
>>>
>>> // entered on BEFORE_VALUE; after any of these rules are processed, the
>>> // parser has completed a QObject and is in the END_OF_VALUE state.
>> 
>> What do you mean by "after any of these rules are processed, ..."?
>
> That "literal", "'[' after_lsquare" and "'{' after_lcurly" both end up 
> with the state machine in the END_OF_VALUE state.


So this is a "how to read the annotated BNF" comment?  

>>> //
>>> // When the parser reaches the END_OF_VALUE state, it examines the
>>> // top of the stack to see if it's coming from "input" (stack empty),
>>> // "array_items" (TOS is a QList) or "dict_pairs" (TOS is a QString; the
>>> // item below will be a QDict).  It then proceeds with the corresponding
>>> // actions, which will be one of:
>>> // - return parsed value
>>> // - add value to QList
>>> // - pop QString with the key, add key/value to the QDict
>> 
>> Forward references to array_items and dict_pairs.  Suggest to add a
>> suitable "below", or indicate the forward reference some other way.
>
> I can put instead after_lsquare and after_lcurly.

BNF usually reads easiest when written top down, i.e. start symbol
production first, other non-terminals' productions after their use.

I don't mind the forward references in the comment here, it just took me
a second to recognize them.  No real problem, but perhaps adding "above"
/ "below" could help the reader.

>>> value := literal       -> END_OF_VALUE
>>>         | '['            -> push empty QList -> AFTER_LSQUARE
>>>            after_lsquare -> END_OF_VALUE
>>>         | '{'            -> push empty QDict -> AFTER_LCURLY
>>>            after_lcurly  -> END_OF_VALUE
>>>
>>> // non-recursive values, entered on BEFORE_VALUE
>>> literal := INTEGER   -> END_OF_VALUE
>>>        | FLOAT         -> END_OF_VALUE
>>>        | KEYWORD       -> END_OF_VALUE
>>>        | STRING        -> END_OF_VALUE
>>>        | INTERP        -> END_OF_VALUE
>>>
>>> // entered on AFTER_LSQUARE
>>> after_lsquare := ']' -> pop completed QList -> END_OF_VALUE
>>>          | ϵ           -> BEFORE_VALUE
>>>            array_items -> END_OF_VALUE
>>>
>>> // entered on BEFORE_VALUE, with TOS being a QList
>>> array_items := value   -> add value to QList -> END_OF_VALUE
>>>          (']'            -> pop completed QList -> END_OF_VALUE
>>>           | ','          -> BEFORE_VALUE
>>>             array_items) -> END_OF_VALUE
>>>
>>> // entered on AFTER_LCURLY
>>> after_lcurly := '}'  -> pop completed QDict -> END_OF_VALUE
>>>          | ϵ           -> BEFORE_KEY
>>>            dict_pairs  -> END_OF_VALUE
>>>
>>> // entered on BEFORE_KEY, with TOS being a QDict
>>> dict_pairs := STRING  -> push QString -> END_OF_KEY
>>>          ':'            -> BEFORE_VALUE
>>>          value          -> pop QString + add pair to QDict -> END_OF_VALUE
>>>          ('}'           -> pop completed QDict -> END_OF_VALUE
>>>           | ','         -> BEFORE_KEY
>>>             dict_pairs) -> END_OF_VALUE
>> 
>> The BNF part is everything but the -> ...  Easy enough.  However, I'm
>> not quite certain how to interpret the --> ...
>> 
>> The "// entered on STATE ..." I can read as assertion.
>
> Yes, that's a comment (with C++ syntax to have something familiar).
>
>> The text after "->" seems to be one ore more state / stack transitions.
>> Let me describe them in my own words to make sure I understand you.
>> Note that I had to also look at the source code to produce my
>> description.
>> 
>> -> STATE
>> 
>>     Set top stack entry's state to STATE.
>> 
>> -> push OBJ -> STATE
>> 
>>     Push (STATE, OBJ) onto the stack.
>
> Intended more like push (OBJ, undefined), set top stack entry state to 
> STATE, but yes.
>
>> -> pop completed OBJ -> END_OF_VALUE
>> 
>>     Pop (_, OBJ) from the stack, set new top stack entry's state to
>>     END_OF_VALUE.  OBJ is the result of parsing non-terminal @value.
>> 
>> -> add value to QList -> END_OF_VALUE
>> 
>>     Append the result of parsing non-terminal @value to the top stack
>>     entry's QList.  Set the top stack entry's state to END_OF_VALUE.
>> 
>> -> pop QString + add pair to QDict -> END_OF_VALUE
>> 
>>     Pop (_, QString) from the stack, it's the key.  Its value is the
>>     result of parsing non-terminal value.  Store this key: value pair in
>>     the top stack entry's QDict.  Set the top stack entry's state to
>>     END_OF_VALUE.
>
> Correct. "-> STATE" generally means "go to state STATE", where teh state 
> is stored in the top stack entry.

So I was able to work out how to read the annotations.  How can we make
that easier for the next guy?  Could well be me in six months...

>> Stack entry members @partial from bottom to top:
>> 
>> * a QList element for an unclosed '[',
>> 
>> * a QDict and a QString element for an unclosed '{', except the
>>    innermost one need not have a QString.  This is because the QString is
>>    on the stack only while we're parsing ':' value.
>> 
>> So, stack contents invariant:
>> 
>>      ( QList | QDict QString )* QDict?
>
> Makes sense.

Let's write it down where we describe the structure of the stack.

>> I found the QString business somewhat confusing, probably because I
>> expected one stack entry per nesting level, and tossing that invalid
>> assumption took mental effort.
>
> ... while I wrote the parser as one per pending nonterminal.  Coming 
> from the recursive descent probably suggested that approach, somewhat 
> subconsciously.
>
>> I figure we could have one stack entry per: add a const char *key member
>> to JSONParserStackEntry, non-null exactly when we have a QString sitting
>> on a QDict now.  We'd trade representable invalid state "QString
>> directly above non-QDict in the stack" for "QList stack entry has member
>> @key set".  Would be slightly more local.  Matter of taste, I guess.
>
> I can check what the code looks like, sure.
>
>> More representable invalid state, as far as I can tell (observation, not
>> objection):
>> 
>> * a QList stack entry can only have state AFTER_LSQUARE, BEFORE_VALUE,
>>    END_OF_VALUE, never AFTER_LCURLY, BEFORE_KEY, END_OF_KEY
>> 
>> * a QDict stack entry can only have state AFTER_LCURLY, BEFORE_KEY,
>>    END_OF_VALUE, never AFTER_LSQUARE, BEFORE_VALUE, END_OF_KEY
>> 
>> * a QString stack entry can only have state BEFORE_VALUE and END_OF_KEY,
>>    never AFTER_LCURLY, AFTER_LSQUARE, BEFORE_VALUE, END_OF_VALUE.
>> 
>> Stack entries by @state instead of by @partial:
>>
>> * AFTER_LCURLY: @partial is QDict
>>
>> * AFTER_LSQUARE: @partial is QList
>>
>> * BEFORE_KEY: @partial is QDict
>>
>> * BEFORE_VALUE: @partial is QList or QString
>
> ... or empty:
>
>      state = entry ? entry->state : BEFORE_VALUE;

Yes, forgot that one.

>> * END_OF_KEY: @partial is QString
>>
>> * END_OF_VALUE: @partial is QList or QString
>>
>> Invariants worth documenting, I think.
>
> Ok---as assertions in the code?

Making invalid states unrepresentable is best, but that's often not
practical in C.

I generally prefer assertions to comments as long as the assertions are
easy enough to read.

>> Here's my attempt at explaining the automaton and how states and stack
>> fit together: annotated JSON syntax diagrams.  The diagrams are from
>> www.json.org with whitespace nonterminal omitted for brevity.
>> 
>> Semantic actions are connected with dashed lines.  State names are
>> abbreviated to ALC, ALS, BK, BV, EOK, EOV.  (STATE, QTYPE) describes the
>> top stack entry.  "assert (STATE, QTYPE)" asserts the top stack entry's
>> @state is STATE, and its @partial is of QTYPE.  "set STATE" means the
>> top stack entry's state is set to STATE.
>> 
>>                                 set EOV
>>                            result is QObject
>>           value                    ┆
>>           >────────╮─── string ──╭───>
>>           ┆        │             │
>>   assert (BV, _)   ╰─── number ──╯
>>                    │             │
>>                    ╰─── object ──╯
>>                    │             │
>>                    ╰─── array ───╯
>>                    │             │
>>                    ╰─── true ────╯
>>                    │             │
>>                    ╰─── false ───╯
>>                    │             │
>>                    ╰─── null ────╯
>> 
>> 
>>               push (ALS, QList)   pop (ALS, QList)
>>                      ┆             assert (BV, _)
>>                      ┆               set EOV
>>                      ┆          result is popped QList
>>           array      ┆                  ┆
>>           >─────── [ ─╮────────────╭─ ] ──>
>>           ┆           │            │ 
>>   assert (BV, _)      │ ┄ set BV   │
>>                       │            │
>>               set BV  │            │
>>                   ┆   │            │
>>              ╭─ , ────╰── value ──╮╯
>>              │                    │ ┄ assert (EOV, QList)
>>              ╰────────────────────╯   put value into QList
>> 
>> 
>> 
>>               push (ALC, QDict)                  pop (ALC, QDict)
>>                      ┆                            assert (BV, _)
>>                      ┆                              set EOV
>>                      ┆                         result is popped QDict
>>           object     ┆                                 ┆
>>           >─────── { ─╮───────────────────────────╭─ } ──>
>>           ┆           │                           │
>>   assert (BV, _)      │ ┄ set BK                  │
>>                       │                           │
>>               set BK  │  push (EOK, QString)      │
>>                   ┆   │         ┆                 │
>>              ╭─ , ────╰─ string ─── : ─── value ─╮╯
>>              │                        ┆          │   pop (BV, QString)
>>              │                     set BV        │ ┄ assert (EOV, QDict)
>>              │                                   │   put string: value
>>              ╰───────────────────────────────────╯      into QDict
>> 
>> Might be your turn to be confused ;)
>
> I'm not confused but maybe it is a bit too detailed.

We can elide detail.  For instance, if we document the stack structure
elsewhere, assertions on stack contents are low value here.

>>>>> +static QObject *json_parser_parse_token(JSONParserContext *ctxt, const 
>>>>> JSONToken *token)
>>>>
>>>> Rename to parse_token()?  Or inline into json_parser_feed()?
>>>
>>> Renaming is good.
>>>
>>>>> +            if (!key) {
>>>>> +                return NULL;
>>>>> +            }
>>>>
>>>> Key to understand the switch is the meaning of "break" and "return
>>>> NULL".
>>>>
>>>> We do the latter when we're done for this token.
>>>>
>>>> We do the former when this token completed a (sub-)value.  @value holds
>>>> it.  If the stack is empty, @value is the result of the parse, and the
>>>> code after the switch returns it.  Else, @value is a sub-value, and the
>>>> code after the switch stores it in the object being constructed.
>>>
>>> Correct.  Want me to include it somehow or is the grammar above fine?
>> 
>> Whether the grammar makes the meaning of "break" and "return NULL"
>> obvious enough is hard to tell now: I know too much.  Should fix itself
>> within a few days ;)
>
> Probably not. :)  I'll see if I can move the post-pr

I can offer a pair of fresher eyes on v2.

Re: [PATCH 2/5] json-parser: replace with a push parser

Reply via email to