Paolo Bonzini <[email protected]> writes:
> On 2/9/26 10:36, Markus Armbruster wrote:
>>> +typedef struct JSONParserContext {
>>> + Error *err;
>>> + GQueue *stack;
>>
>> This tells us we have a stack of something, but not what the something
>> is. Further down, we'll see what the something is.
>>
>> The commit message explains what the stack means.
>>
>> Worth a comment here?
>
> Yes, makes sense.
>
>> We know the stack is at most 1024 deep (MAX_NESTING). Have you
>> considered using an array instead of GQueue?
>
> It'd be only 8K, but I preferred to keep MAX_NESTING hidden within
> json-streamer.c. Not having bounds checking makes me a bit nervous
> about having fixed-size arrays.
Looks mostly harmless to me: there's exactly one spot where we push
something onto the stack, one where we pop, and one where we pop
everything. We only ever look at the top of the stack.
> See more below about responsibility split between parser and streamer.
>
>>> + va_list *ap;
>>> +} JSONParserContext;
>>> +
>>
>> Having to move struct definitions to headers is always a bit sad. Could
>> this go into json-parser-int.h instead?
>
> No because it's embedded in JSONMessageParser, just like JSONLexer. It
> was previously hidden only because the parser was created and destroyed
> after parentheses were balanced, but not doing that is kind of the whole
> story here. :)
I see.
To avoid putting private parser state in json-parser.h, we'd need to let
json_message_parser_init() allocate it dynamically. Reasonable option,
but let's stay on target here. We can always do it later.
[...]
>>> +typedef enum JSONParserState {
>>> + AFTER_LCURLY,
>>> + AFTER_LSQUARE,
>>> + BEFORE_KEY,
>>> + BEFORE_VALUE,
>>> + END_OF_KEY,
>>> + END_OF_VALUE,
>>> +} JSONParserState;
>>> +
>>> +typedef struct JSONParserStackEntry {
>>> + /* A QString with the last parsed key, or a QList/QDict for the
>>> current container. */
>>> + QObject *partial;
>>> +
>>> + /* Needed to distinguish whether the parser is waiting for a colon or
>>> comma. */
>>> + JSONParserState state;
>> If the comment was 100% accurate, a bool would do.
>
> Hmm when I wrote I did think that a bool + the type of the top QObject would
> do. But really there are three cases that the top QObject does not
> distinguish:
>
> - "after opening bracket, closed parenthesis or next element accepted",
>
> - "after comma, next element required"
>
> - "after element, closed parenthesis or comma accepted"
>
> So, a bool would not be enough. Will fix.
>
>>> +/* Note that entry->partial does *not* lose its reference count even if
>>> value == NULL. */
>>> +static JSONParserStackEntry *pop_entry(JSONParserContext *ctxt, QObject
>>> **value)
>>> +{
>>> + JSONParserStackEntry *entry = g_queue_pop_tail(ctxt->stack);
>>> + if (value) {
>>> + *value = entry->partial;
>>> + }
>>> + g_free(entry);
>>> + return current_entry(ctxt);
>>> +}
>>
>> This is a rather peculiar function. If you disagree, try writing a
>> contract for it :)
>
> Hmm... yes, better pull the "value = entry->partial" to the callers even
> if there are several of them. The code is overall simpler.
That's what I figured.
>> The parser's interesting part follows. The parser is a pushdown
>> automaton. The pushdown automaton is coded in C. On the one hand,
>> d'oh! Of course it is. On the other hand, it's hard to see the actual
>> automaton in C. May I have a comment explaining state and state
>> transitions? Perhaps we better start with an informal description,
>> discuss it, then distill the discussion into a comment.
>
> That was the purpose of the mysterious comment above. If a mix between
> BNF and automaton is okay for you, it could be something like
>
> input := value -> END_OF_VALUE
> END_OF_INPUT -> check stack is empty, return parsed value
>
> // entered on BEFORE_VALUE; after any of these rules are processed, the
> // parser has completed a QObject and is in the END_OF_VALUE state.
What do you mean by "after any of these rules are processed, ..."?
> //
> // When the parser reaches the END_OF_VALUE state, it examines the
> // top of the stack to see if it's coming from "input" (stack empty),
> // "array_items" (TOS is a QList) or "dict_pairs" (TOS is a QString; the
> // item below will be a QDict). It then proceeds with the corresponding
> // actions, which will be one of:
> // - return parsed value
> // - add value to QList
> // - pop QString with the key, add key/value to the QDict
Forward references to array_items and dict_pairs. Suggest to add a
suitable "below", or indicate the forward reference some other way.
>
> value := literal -> END_OF_VALUE
> | '[' -> push empty QList -> AFTER_LSQUARE
> after_lsquare -> END_OF_VALUE
> | '{' -> push empty QDict -> AFTER_LCURLY
> after_lcurly -> END_OF_VALUE
>
> // non-recursive values, entered on BEFORE_VALUE
> literal := INTEGER -> END_OF_VALUE
> | FLOAT -> END_OF_VALUE
> | KEYWORD -> END_OF_VALUE
> | STRING -> END_OF_VALUE
> | INTERP -> END_OF_VALUE
>
> // entered on AFTER_LSQUARE
> after_lsquare := ']' -> pop completed QList -> END_OF_VALUE
> | ϵ -> BEFORE_VALUE
> array_items -> END_OF_VALUE
>
> // entered on BEFORE_VALUE, with TOS being a QList
> array_items := value -> add value to QList -> END_OF_VALUE
> (']' -> pop completed QList -> END_OF_VALUE
> | ',' -> BEFORE_VALUE
> array_items) -> END_OF_VALUE
>
> // entered on AFTER_LCURLY
> after_lcurly := '}' -> pop completed QDict -> END_OF_VALUE
> | ϵ -> BEFORE_KEY
> dict_pairs -> END_OF_VALUE
>
> // entered on BEFORE_KEY, with TOS being a QDict
> dict_pairs := STRING -> push QString -> END_OF_KEY
> ':' -> BEFORE_VALUE
> value -> pop QString + add pair to QDict -> END_OF_VALUE
> ('}' -> pop completed QDict -> END_OF_VALUE
> | ',' -> BEFORE_KEY
> dict_pairs) -> END_OF_VALUE
The BNF part is everything but the -> ... Easy enough. However, I'm
not quite certain how to interpret the --> ...
The "// entered on STATE ..." I can read as assertion.
The text after "->" seems to be one ore more state / stack transitions.
Let me describe them in my own words to make sure I understand you.
Note that I had to also look at the source code to produce my
description.
-> STATE
Set top stack entry's state to STATE.
-> push OBJ -> STATE
Push (STATE, OBJ) onto the stack.
-> pop completed OBJ -> END_OF_VALUE
Pop (_, OBJ) from the stack, set new top stack entry's state to
END_OF_VALUE. OBJ is the result of parsing non-terminal @value.
-> add value to QList -> END_OF_VALUE
Append the result of parsing non-terminal @value to the top stack
entry's QList. Set the top stack entry's state to END_OF_VALUE.
-> pop QString + add pair to QDict -> END_OF_VALUE
Pop (_, QString) from the stack, it's the key. Its value is the
result of parsing non-terminal value. Store this key: value pair in
the top stack entry's QDict. Set the top stack entry's state to
END_OF_VALUE.
Unsaid, but reasonably obvious: the result of parsing non-terminal
@value when it's not an array or object.
The structure of the stack isn't explicitly explained, but can be
extracted from the above.
Stack entry members @partial from bottom to top:
* a QList element for an unclosed '[',
* a QDict and a QString element for an unclosed '{', except the
innermost one need not have a QString. This is because the QString is
on the stack only while we're parsing ':' value.
So, stack contents invariant:
( QList | QDict QString )* QDict?
I found the QString business somewhat confusing, probably because I
expected one stack entry per nesting level, and tossing that invalid
assumption took mental effort.
I figure we could have one stack entry per: add a const char *key member
to JSONParserStackEntry, non-null exactly when we have a QString sitting
on a QDict now. We'd trade representable invalid state "QString
directly above non-QDict in the stack" for "QList stack entry has member
@key set". Would be slightly more local. Matter of taste, I guess.
More representable invalid state, as far as I can tell (observation, not
objection):
* a QList stack entry can only have state AFTER_LSQUARE, BEFORE_VALUE,
END_OF_VALUE, never AFTER_LCURLY, BEFORE_KEY, END_OF_KEY
* a QDict stack entry can only have state AFTER_LCURLY, BEFORE_KEY,
END_OF_VALUE, never AFTER_LSQUARE, BEFORE_VALUE, END_OF_KEY
* a QString stack entry can only have state BEFORE_VALUE and END_OF_KEY,
never AFTER_LCURLY, AFTER_LSQUARE, BEFORE_VALUE, END_OF_VALUE.
Stack entries by @state instead of by @partial:
* AFTER_LCURLY: @partial is QDict
* AFTER_LSQUARE: @partial is QList
* BEFORE_KEY: @partial is QDict
* BEFORE_VALUE: @partial is QList or QString
* END_OF_KEY: @partial is QString
* END_OF_VALUE: @partial is QList or QString
Invariants worth documenting, I think.
Here's my attempt at explaining the automaton and how states and stack
fit together: annotated JSON syntax diagrams. The diagrams are from
www.json.org with whitespace nonterminal omitted for brevity.
Semantic actions are connected with dashed lines. State names are
abbreviated to ALC, ALS, BK, BV, EOK, EOV. (STATE, QTYPE) describes the
top stack entry. "assert (STATE, QTYPE)" asserts the top stack entry's
@state is STATE, and its @partial is of QTYPE. "set STATE" means the
top stack entry's state is set to STATE.
set EOV
result is QObject
value ┆
>────────╮─── string ──╭───>
┆ │ │
assert (BV, _) ╰─── number ──╯
│ │
╰─── object ──╯
│ │
╰─── array ───╯
│ │
╰─── true ────╯
│ │
╰─── false ───╯
│ │
╰─── null ────╯
push (ALS, QList) pop (ALS, QList)
┆ assert (BV, _)
┆ set EOV
┆ result is popped QList
array ┆ ┆
>─────── [ ─╮────────────╭─ ] ──>
┆ │ │
assert (BV, _) │ ┄ set BV │
│ │
set BV │ │
┆ │ │
╭─ , ────╰── value ──╮╯
│ │ ┄ assert (EOV, QList)
╰────────────────────╯ put value into QList
push (ALC, QDict) pop (ALC, QDict)
┆ assert (BV, _)
┆ set EOV
┆ result is popped QDict
object ┆ ┆
>─────── { ─╮───────────────────────────╭─ } ──>
┆ │ │
assert (BV, _) │ ┄ set BK │
│ │
set BK │ push (EOK, QString) │
┆ │ ┆ │
╭─ , ────╰─ string ─── : ─── value ─╮╯
│ ┆ │ pop (BV, QString)
│ set BV │ ┄ assert (EOV, QDict)
│ │ put string: value
╰───────────────────────────────────╯ into QDict
Might be your turn to be confused ;)
>>> +static QObject *json_parser_parse_token(JSONParserContext *ctxt, const
>>> JSONToken *token)
>>
>> Rename to parse_token()? Or inline into json_parser_feed()?
>
> Renaming is good.
>
>>> + if (!key) {
>>> + return NULL;
>>> + }
>>
>> Key to understand the switch is the meaning of "break" and "return
>> NULL".
>>
>> We do the latter when we're done for this token.
>>
>> We do the former when this token completed a (sub-)value. @value holds
>> it. If the stack is empty, @value is the result of the parse, and the
>> code after the switch returns it. Else, @value is a sub-value, and the
>> code after the switch stores it in the object being constructed.
>
> Correct. Want me to include it somehow or is the grammar above fine?
Whether the grammar makes the meaning of "break" and "return NULL"
obvious enough is hard to tell now: I know too much. Should fix itself
within a few days ;)
>>> +
>>> + /* Store key in a special entry on the stack */
>>> + push_entry(ctxt, QOBJECT(key), END_OF_KEY);
>>> + } else {
>>> + parse_error(ctxt, token, "expecting key");
>>
>> What's the state machine's parse error recovery story?
>
> As simple as it can be:
>
> assert(!ctxt->err);
>
> because a push parser can actually refuse to deal with tokens after an
> error, and make recovery someone else's problem.
>
> In other words a push parser can actually be the pure theoretical thing
> from your compilers and automata book, and all the engineering sits on
> top. IMO this (at least for a JSON parser that you won't touch that
> often) balances the complexity of the state machine.
>
> So, error recovery is handled entirely in json-streamer.c. As of this
> patch, json-streamer.c has gathered all tokens up to the next balanced
> parentheses, and that's implicitly the place where the error recovery is
> complete.
This became clear to me once I had processed the remainder of the
series.
Peeling error recovery off the pushdown automaton that way indeed
simplifies the automaton. Neat! Possible because JSON's simple syntax
lets us get away with simple parenthesis counting error recovery.
>>> +QObject *json_parser_feed(JSONParserContext *ctxt, const JSONToken *token,
>>> Error **errp)
>>
>> The return value isn't immediately obvious. A brief function contract
>> could help the reader.
>
> Ok, will add.
>
>> Before the patch: flush the queue when both counts are zero or at least
>> one is negative.
>>
>> After the patch: additionally flush it on end of input.
>>
>> I figure the two hunks together make the parser see a JSON_END_OF_INPUT
>> token at the end of input. Correct?
>
> Yes, see the commit message. This is the only change I have made here
> to the parser/streamer split, because it makes sense to have the
> unbalanced parentheses error happen in the parser. Otherwise you're not
> really implementing the whole grammar.
Yes.
At the end of the series, the only error checks in json-streamer.c are
about limits. These aren't part of the syntax. Everything that is part
of the syntax is checked in json-parser.c, except for lexical errors,
which are diagnosed in json-lexer.c, and reported in json-parser.c.
Perfectly sensible.
[...]