Re: Parsing a language with optional spaces

Christian Schoenebeck Mon, 20 Jul 2020 05:53:01 -0700

On Samstag, 18. Juli 2020 08:47:07 CEST Akim Demaille wrote:
> > Le 14 juil. 2020 à 13:31, Christian Schoenebeck
> > <schoeneb...@crudebyte.com> a écrit :
> > 
> > On Montag, 13. Juli 2020 07:56:57 CEST Akim Demaille wrote:
> > 
> > For the 'married' parser, in the form imagined by me, there would be no
> > lookahead token, as a lookahead token implies a context-unaware scanner.
> > 
> > Instead, when the parser would get to a state where it had to decide
> > between reducing or shifting (depending on a next token), it would not
> > decide and fork the parser state instead, in order to allow
> > context-specific tokenization.
> That's indeed a possible choice in the implementation of GLR parser
> generators: to sit GLR on top of LR(0).  I've never seen performance
> reports that compare both approach though.
> 
> However, most implementations use SLR(1), because it is super simple to
> implement (especially compared to LALR(1)), yet make a significant
> difference.


Forking at this point would merely be an implementation trick to buy the 
powerful feature of context-aware scanning. Those forks would be very short-
lived though as they would usually already be collapsed in the next step.

So the theoretical overhead of this would probably be like O(c n^3) instead of 
O(n^3), and assuming that c would be constant, that overhead would probably be 
neglectable.

What I am wondering, independent of implementation aspects, would there be an 
automated way (from theoretical POV) to identity the minimum required amount 
of k (a.k.a. amount of lookahead tokens) for _individual_ parts of the 
language to allow a variable k for one language, in order to find a good 
compromise between runtime efficiency and flexibility, and yet being easy to 
use?

For that to happen, a specific value for k would need to be assigned to each 
parser state, but the total amount of parser states (and with it, transitions 
to other states) in turn would be dependent to the value of k, so that's 
probably where the dog bites its tail, right?

Another option would be automatically identifying at least a global, minimum k 
for an entire language instead.

Yet another option would be the user to define k (in some way) for individual 
language parts in BNF, but an automated way would obviously much more 
beneficial for user friendlyness and reliability.

> >> Yes, indeed.  That's something which Joel detailed in his PhD.  Note
> >> though that PSLR targets deterministic parsing.  Bison would (first)
> >> target that, not GLR.  But contributions are always welcome :)
> > 
> > The longer I think about it, the more I wonder whether it wouldn't make
> > sense to make a clear and radical cut with the past when developing a
> > 'married' parser to allow unfolding its full potential.
> > 
> > I mean I look at what you are working on Akim, and AFAICS the majority of
> > time you have to spend on maintaining (i.e. not breaking) backward
> > compatibility for ancient use cases. Most of Bison's conceptional design
> > and internals are still based on requirements and opinions from around
> > 1980 which simply no longer exist or proofed wrong.
> 
> I don't subscribe to this opinion :)  The core of Bison is well established,
> and I have very little maintenance to do on it.  As a matter of fact, if
> you look at the recent history, there's hardly any work in the automaton
> construction.  Most of my recent efforts are about diagnostics. 
> Diagnostics about the grammar (which is also what the current work on
> counterexamples is about), and in the generated parsers (to provide better
> error messages).

I did not mean the core automat of Bison. I simply had the impression that 
when it comes to new features, the large number of existing Bison use cases 
and all their differences in individual Bison configurations ("explosions") 
add such kind of maintenance complexity, that it seems to significantly 
constraint (or at least vastly slow down) the potential development of new 
Bison features.

> Many many people *want* a *deterministic* and *fast* parser.
> 
> They want determinism because with GLR (with conflicts) you just don't know
> if some day your parser will fail because of an unforeseen ambiguity.  I

With 'fail' in regards to GLR you mean what exactly, memory exhaustion or 
rather sequence of user actions being executed in wrong order, or something 
else?

> know people who build critical software with Bison, and "possible failure"
> on valid input is not an option.  Bison is strong on this regard.  IELR
> is arguably the most powerful form of LR(1) parsing you can find on the
> market with the size of LR(0).  And you can have canonical LR(1) if you
> want.

I am not questioning that designing a language to fit into LALR(1) does make 
sense, for the reasons you said.

But the reality is also a 'bit' different, as many languages are not truly 
LALR(1) with context-free tokens to full extent. So what happens is that 
people usually make hacks to fit their language into that concept (e.g. 
pushing/popping states on scanner side, bloating tokenization rules, hand 
crafted procedural tokenization code e.g. in C, etc). And that's where 
determinism often proofed to get broken, and is often hard to be fixed 
(manually) and maintained as well, as there is always a danger of other edge 
cases still being present.

I still prefer a machine handling complexities than relying on humans to 
understand and handle large number of possibilities reliably. Because the 
larger the amount of possibilities, the higher the chance human loses that 
battle against a machine.

> GLR is slow compared to LR.  People rely on our generated LR parsers being
> fast. That's a true feature of Bison's.  Note only does determinism provide
> people with guarantees ("your grammar is unambiguous"), it also provides
> them with efficiency.
> 
> Although I don't have figures about "blind GLR" (i.e., seated on LR(0)),
> I'm pretty sure it would disastrous for anything serious.  So you'd have to
> go at least to SLR(1).  But then, what would be the point of throwing away
> all the good technology we have for LALR(1), IELR(1) and canonical LR(1)?
> 
> If I were to build a GLR parser from scratch today, I would certainly
> aim at SLR(1), you are right, LALR(1) is a nightmare, but Robert Corbett
> fought that fight some 40 years ago, and IELR(1) was certainly super tricky,
> but Joel E. Denny won that battle more than 10 years ago.  Why would I
> discard so great achievements?

To sort out a misapprehension: I was not suggesting to ditch LALR(1) and/or 
Bison 3.x for good. Not at all. They are still the most important parsers for 
good reasons and will continue to do so.

But I do think that after 40 years, it is legitimate to consider conceptual 
alternatives alongside. Simply comparing O(n) vs. O(n^3) and that been taken 
as reason 40 years ago, should not be a reason to immediately discard 
questioning the status quo today. Those would be worst case considerations 
anyway.

For instance the concept of deep neuronal networks (AI) also existed decades 
ago, and despite of what individual companies claim, the theoretical concepts 
have not really improved much since then, and still we are now seeing real 
life applications for them everywhere. Simply because today we do have the 
appropriate hardware to run them, despite e.g. still having the same horrible 
complexity/inefficiency of training such networks.

Bison 4 is probably still far ahead. So some discussions in the meantime could 
be useful. I am also considering to write a proof of concept code, toying 
around with some (core) concept ideas and see where it goes. Maybe it just 
ends in trash, or maybe not. We will see.

> > Besides of what we already discussed, I think it would make sense to get
> > rid of a bunch of other things that Bison still retains:
> > 
> > 1. No longer multiple parser types, only one: a somewhat GLR-alike parser
> > like> 
> >   discussed as real superset of LALR, but without lookahead token (so not
> >   really LALR based under the hood). Theoretically there is no
> >   disadvantage
> >   by doing so, because if a developer feeds it with a LALR(1) capable
> >     
> >     grammer, it would still end up having the same parsing
> >     
> >   complexity=efficiency as an old-school LALR would do, probably just with
> >     
> >     slightly higher memory footprint.
> >     
> >   Advantages though: much more user friendly and powerful, and a lot less
> >   code to maintain.
> 
> I bet we are talking about making the parsers at least one order of
> magnitude slower here (blind GLR).  Of course this not an option.
> 
> Besides, if people want to look for brand new trendy techniques, why would
> they come to Bison?  There are plenty of alternatives out there, and if

For instance? I don't see many parser generators having popped up in many 
years. And AFAICS they did not really add new conceptual core features, except 
of some fixed higher (predefined) k maybe. The rest is more about things 
'around', like support for specific programming languages, or EBNF instead of 
BNF, etc.

> Bison still exists today, it's because there's already tons of software
> that rely on its current behavior.  I will *not* break that.  *Never*.
> That's our "raison d'être" (some English native should check my use of
> French expressions, I'm not I'm using it properly :-).
> 
> I can add *more* options, provide better alternatives, but Bison cannot
> afford to drop features just like that.

Please do, I would appreciate if you share ideas!

> > 2. No longer raw C tables as pregenerated parser and scanner tables,
> > instead> 
> >   of such C-tables: an intermediate, dynamic in-memory model with
> >   high-level
> >   API allowing to access and modify grammar and patterns on-the-fly.
> >   
> >   Advantage: it would allow self-extensible languages (e.g. an input that
> >   adds keywords, patterns, drops or adds grammar rules while being
> >   parsed).
> >     
> >     And implementation specific things would finally clearly be isolated by
> >     
> >   introducing such an API -> Easier to maintain.
> >     
> >     Disadvantages: Probably none, except of consuming slightly more memory
> >     for
> >     
> >   meta information.
> 
> This is interesting, but that's not Bison you are talking about.

Yeah I know, it would be too different.

> > 3. No longer aggressive compression of parser states (which is somewhat
> > 
> >   already implied by [2.]), because such compressions lead to important
> >   information loss when it comes to grammar debugging or auto completion
> >   features and other user relevant purposes. The motivation that lead to
> >   such
> >     
> >     compressions (very low RAM space) no longer exist in like >99% of use
> >     cases
> >     
> >   today.
> 
> Come on!  We all know that today the very same constraints ("be small")
> still apply, but indeed for completely different reasons.  Compressing used
> to be about fitting in the RAM, today it's about fitting in the cache.

Today's memory optimization strategy is usually not trying to fit everything 
into cache, but rather doing cache-line optimizations: Alligning data to 
certain memory boundaries, choosing appropriate layouts for data structures 
(e.g. appropriate dimensions for matrices) and taking care of code portions 
that are executed for a certain longer period of time of only accessing a 
limited set of cache lines during such individual time slices.

Whether these considerations make sense for a parser generator on modern (and 
not low end) devices is questionable though, as it probably does not affect 
much overall efficiency of applications in practice.

> Not to mention lower-end devices.

Low end devices would probably still use LALR(1) instead, yes. They 'could' 
benefit from this as well though, if their language is mostly LALR(1) and 
probably only a few, limited exceptions here and there, they could benefit 
from reduced maintenance complexity (e.g. their language still evolving) 
without (or little) negative runtime impact.

> I don't believe in the Unique True Omnipotent Parser Yielder (let's call it
> UTOPY for short).  There are far too many different use cases.  

Yep, I totally agree with that. Bison, as it is now, has and will have its 
place, but I can also consider that many use cases might prefer another 
solution in future. Because in many applications today, the runtime efficiency 
of the core parser is not the biggest factor of the application's overall 
runtime. For instance a modern compiler spends most of its execution time on 
its backend, or at least in general on tasks after an initial AST became 
available.

I rather got the impression, that many companies today have other problems: 
acquiring appropriate personnel being capable to deal with LALR(1) parser 
generators. I also see many developers still writing all their parsers 
manually, as they told me they found it too hard for them using generators. So 
these would probably the ones profiting the most.

But there would be many other use cases: prototyping, CI tools, natural 
language processing, ...

> Some want
> to change the language dynamically, some want determinism, some what
> something DOM-oriented, not SAX-like, some have strong CPU/time
> constraints, some don't care, some experiment others need to be production
> ready, etc., etc., etc.
> 
> Bison fits into a niche, and it would be an error to depart from this.

I wouldn't call it 'niche'. That would still be a big niche frankly. ;-)

Best regards,
Christian Schoenebeck

Re: Parsing a language with optional spaces

Reply via email to