Paolo Bonzini <[email protected]> writes:
> On 2/12/26 14:12, Markus Armbruster wrote:
>>>> We know the stack is at most 1024 deep (MAX_NESTING). Have you
>>>> considered using an array instead of GQueue?
>>>
>>> It'd be only 8K, but I preferred to keep MAX_NESTING hidden within
>>> json-streamer.c. Not having bounds checking makes me a bit nervous
>>> about having fixed-size arrays.
>>
>> Looks mostly harmless to me: there's exactly one spot where we push
>> something onto the stack, one where we pop, and one where we pop
>> everything. We only ever look at the top of the stack.
>
> Yes, code-wise it's not hard. But it's ugly/brittle to have two pieces
> of code that have to sync on MAX_NESTING.
>
>>>> The parser's interesting part follows. The parser is a pushdown
>>>> automaton. The pushdown automaton is coded in C. On the one hand,
>>>> d'oh! Of course it is. On the other hand, it's hard to see the actual
>>>> automaton in C. May I have a comment explaining state and state
>>>> transitions? Perhaps we better start with an informal description,
>>>> discuss it, then distill the discussion into a comment.
>>>
>>> That was the purpose of the mysterious comment above. If a mix between
>>> BNF and automaton is okay for you, it could be something like
>>>
>>> input := value -> END_OF_VALUE
>>> END_OF_INPUT -> check stack is empty, return parsed value
>>>
>>> // entered on BEFORE_VALUE; after any of these rules are processed, the
>>> // parser has completed a QObject and is in the END_OF_VALUE state.
>>
>> What do you mean by "after any of these rules are processed, ..."?
>
> That "literal", "'[' after_lsquare" and "'{' after_lcurly" both end up
> with the state machine in the END_OF_VALUE state.
So this is a "how to read the annotated BNF" comment?
>>> //
>>> // When the parser reaches the END_OF_VALUE state, it examines the
>>> // top of the stack to see if it's coming from "input" (stack empty),
>>> // "array_items" (TOS is a QList) or "dict_pairs" (TOS is a QString; the
>>> // item below will be a QDict). It then proceeds with the corresponding
>>> // actions, which will be one of:
>>> // - return parsed value
>>> // - add value to QList
>>> // - pop QString with the key, add key/value to the QDict
>>
>> Forward references to array_items and dict_pairs. Suggest to add a
>> suitable "below", or indicate the forward reference some other way.
>
> I can put instead after_lsquare and after_lcurly.
BNF usually reads easiest when written top down, i.e. start symbol
production first, other non-terminals' productions after their use.
I don't mind the forward references in the comment here, it just took me
a second to recognize them. No real problem, but perhaps adding "above"
/ "below" could help the reader.
>>> value := literal -> END_OF_VALUE
>>> | '[' -> push empty QList -> AFTER_LSQUARE
>>> after_lsquare -> END_OF_VALUE
>>> | '{' -> push empty QDict -> AFTER_LCURLY
>>> after_lcurly -> END_OF_VALUE
>>>
>>> // non-recursive values, entered on BEFORE_VALUE
>>> literal := INTEGER -> END_OF_VALUE
>>> | FLOAT -> END_OF_VALUE
>>> | KEYWORD -> END_OF_VALUE
>>> | STRING -> END_OF_VALUE
>>> | INTERP -> END_OF_VALUE
>>>
>>> // entered on AFTER_LSQUARE
>>> after_lsquare := ']' -> pop completed QList -> END_OF_VALUE
>>> | ϵ -> BEFORE_VALUE
>>> array_items -> END_OF_VALUE
>>>
>>> // entered on BEFORE_VALUE, with TOS being a QList
>>> array_items := value -> add value to QList -> END_OF_VALUE
>>> (']' -> pop completed QList -> END_OF_VALUE
>>> | ',' -> BEFORE_VALUE
>>> array_items) -> END_OF_VALUE
>>>
>>> // entered on AFTER_LCURLY
>>> after_lcurly := '}' -> pop completed QDict -> END_OF_VALUE
>>> | ϵ -> BEFORE_KEY
>>> dict_pairs -> END_OF_VALUE
>>>
>>> // entered on BEFORE_KEY, with TOS being a QDict
>>> dict_pairs := STRING -> push QString -> END_OF_KEY
>>> ':' -> BEFORE_VALUE
>>> value -> pop QString + add pair to QDict -> END_OF_VALUE
>>> ('}' -> pop completed QDict -> END_OF_VALUE
>>> | ',' -> BEFORE_KEY
>>> dict_pairs) -> END_OF_VALUE
>>
>> The BNF part is everything but the -> ... Easy enough. However, I'm
>> not quite certain how to interpret the --> ...
>>
>> The "// entered on STATE ..." I can read as assertion.
>
> Yes, that's a comment (with C++ syntax to have something familiar).
>
>> The text after "->" seems to be one ore more state / stack transitions.
>> Let me describe them in my own words to make sure I understand you.
>> Note that I had to also look at the source code to produce my
>> description.
>>
>> -> STATE
>>
>> Set top stack entry's state to STATE.
>>
>> -> push OBJ -> STATE
>>
>> Push (STATE, OBJ) onto the stack.
>
> Intended more like push (OBJ, undefined), set top stack entry state to
> STATE, but yes.
>
>> -> pop completed OBJ -> END_OF_VALUE
>>
>> Pop (_, OBJ) from the stack, set new top stack entry's state to
>> END_OF_VALUE. OBJ is the result of parsing non-terminal @value.
>>
>> -> add value to QList -> END_OF_VALUE
>>
>> Append the result of parsing non-terminal @value to the top stack
>> entry's QList. Set the top stack entry's state to END_OF_VALUE.
>>
>> -> pop QString + add pair to QDict -> END_OF_VALUE
>>
>> Pop (_, QString) from the stack, it's the key. Its value is the
>> result of parsing non-terminal value. Store this key: value pair in
>> the top stack entry's QDict. Set the top stack entry's state to
>> END_OF_VALUE.
>
> Correct. "-> STATE" generally means "go to state STATE", where teh state
> is stored in the top stack entry.
So I was able to work out how to read the annotations. How can we make
that easier for the next guy? Could well be me in six months...
>> Stack entry members @partial from bottom to top:
>>
>> * a QList element for an unclosed '[',
>>
>> * a QDict and a QString element for an unclosed '{', except the
>> innermost one need not have a QString. This is because the QString is
>> on the stack only while we're parsing ':' value.
>>
>> So, stack contents invariant:
>>
>> ( QList | QDict QString )* QDict?
>
> Makes sense.
Let's write it down where we describe the structure of the stack.
>> I found the QString business somewhat confusing, probably because I
>> expected one stack entry per nesting level, and tossing that invalid
>> assumption took mental effort.
>
> ... while I wrote the parser as one per pending nonterminal. Coming
> from the recursive descent probably suggested that approach, somewhat
> subconsciously.
>
>> I figure we could have one stack entry per: add a const char *key member
>> to JSONParserStackEntry, non-null exactly when we have a QString sitting
>> on a QDict now. We'd trade representable invalid state "QString
>> directly above non-QDict in the stack" for "QList stack entry has member
>> @key set". Would be slightly more local. Matter of taste, I guess.
>
> I can check what the code looks like, sure.
>
>> More representable invalid state, as far as I can tell (observation, not
>> objection):
>>
>> * a QList stack entry can only have state AFTER_LSQUARE, BEFORE_VALUE,
>> END_OF_VALUE, never AFTER_LCURLY, BEFORE_KEY, END_OF_KEY
>>
>> * a QDict stack entry can only have state AFTER_LCURLY, BEFORE_KEY,
>> END_OF_VALUE, never AFTER_LSQUARE, BEFORE_VALUE, END_OF_KEY
>>
>> * a QString stack entry can only have state BEFORE_VALUE and END_OF_KEY,
>> never AFTER_LCURLY, AFTER_LSQUARE, BEFORE_VALUE, END_OF_VALUE.
>>
>> Stack entries by @state instead of by @partial:
>>
>> * AFTER_LCURLY: @partial is QDict
>>
>> * AFTER_LSQUARE: @partial is QList
>>
>> * BEFORE_KEY: @partial is QDict
>>
>> * BEFORE_VALUE: @partial is QList or QString
>
> ... or empty:
>
> state = entry ? entry->state : BEFORE_VALUE;
Yes, forgot that one.
>> * END_OF_KEY: @partial is QString
>>
>> * END_OF_VALUE: @partial is QList or QString
>>
>> Invariants worth documenting, I think.
>
> Ok---as assertions in the code?
Making invalid states unrepresentable is best, but that's often not
practical in C.
I generally prefer assertions to comments as long as the assertions are
easy enough to read.
>> Here's my attempt at explaining the automaton and how states and stack
>> fit together: annotated JSON syntax diagrams. The diagrams are from
>> www.json.org with whitespace nonterminal omitted for brevity.
>>
>> Semantic actions are connected with dashed lines. State names are
>> abbreviated to ALC, ALS, BK, BV, EOK, EOV. (STATE, QTYPE) describes the
>> top stack entry. "assert (STATE, QTYPE)" asserts the top stack entry's
>> @state is STATE, and its @partial is of QTYPE. "set STATE" means the
>> top stack entry's state is set to STATE.
>>
>> set EOV
>> result is QObject
>> value ┆
>> >────────╮─── string ──╭───>
>> ┆ │ │
>> assert (BV, _) ╰─── number ──╯
>> │ │
>> ╰─── object ──╯
>> │ │
>> ╰─── array ───╯
>> │ │
>> ╰─── true ────╯
>> │ │
>> ╰─── false ───╯
>> │ │
>> ╰─── null ────╯
>>
>>
>> push (ALS, QList) pop (ALS, QList)
>> ┆ assert (BV, _)
>> ┆ set EOV
>> ┆ result is popped QList
>> array ┆ ┆
>> >─────── [ ─╮────────────╭─ ] ──>
>> ┆ │ │
>> assert (BV, _) │ ┄ set BV │
>> │ │
>> set BV │ │
>> ┆ │ │
>> ╭─ , ────╰── value ──╮╯
>> │ │ ┄ assert (EOV, QList)
>> ╰────────────────────╯ put value into QList
>>
>>
>>
>> push (ALC, QDict) pop (ALC, QDict)
>> ┆ assert (BV, _)
>> ┆ set EOV
>> ┆ result is popped QDict
>> object ┆ ┆
>> >─────── { ─╮───────────────────────────╭─ } ──>
>> ┆ │ │
>> assert (BV, _) │ ┄ set BK │
>> │ │
>> set BK │ push (EOK, QString) │
>> ┆ │ ┆ │
>> ╭─ , ────╰─ string ─── : ─── value ─╮╯
>> │ ┆ │ pop (BV, QString)
>> │ set BV │ ┄ assert (EOV, QDict)
>> │ │ put string: value
>> ╰───────────────────────────────────╯ into QDict
>>
>> Might be your turn to be confused ;)
>
> I'm not confused but maybe it is a bit too detailed.
We can elide detail. For instance, if we document the stack structure
elsewhere, assertions on stack contents are low value here.
>>>>> +static QObject *json_parser_parse_token(JSONParserContext *ctxt, const
>>>>> JSONToken *token)
>>>>
>>>> Rename to parse_token()? Or inline into json_parser_feed()?
>>>
>>> Renaming is good.
>>>
>>>>> + if (!key) {
>>>>> + return NULL;
>>>>> + }
>>>>
>>>> Key to understand the switch is the meaning of "break" and "return
>>>> NULL".
>>>>
>>>> We do the latter when we're done for this token.
>>>>
>>>> We do the former when this token completed a (sub-)value. @value holds
>>>> it. If the stack is empty, @value is the result of the parse, and the
>>>> code after the switch returns it. Else, @value is a sub-value, and the
>>>> code after the switch stores it in the object being constructed.
>>>
>>> Correct. Want me to include it somehow or is the grammar above fine?
>>
>> Whether the grammar makes the meaning of "break" and "return NULL"
>> obvious enough is hard to tell now: I know too much. Should fix itself
>> within a few days ;)
>
> Probably not. :) I'll see if I can move the post-pr
I can offer a pair of fresher eyes on v2.