On Samstag, 18. Juli 2020 08:47:07 CEST Akim Demaille wrote: > > Le 14 juil. 2020 à 13:31, Christian Schoenebeck > > <schoeneb...@crudebyte.com> a écrit : > > > > On Montag, 13. Juli 2020 07:56:57 CEST Akim Demaille wrote: > > > > For the 'married' parser, in the form imagined by me, there would be no > > lookahead token, as a lookahead token implies a context-unaware scanner. > > > > Instead, when the parser would get to a state where it had to decide > > between reducing or shifting (depending on a next token), it would not > > decide and fork the parser state instead, in order to allow > > context-specific tokenization. > That's indeed a possible choice in the implementation of GLR parser > generators: to sit GLR on top of LR(0). I've never seen performance > reports that compare both approach though. > > However, most implementations use SLR(1), because it is super simple to > implement (especially compared to LALR(1)), yet make a significant > difference.
Forking at this point would merely be an implementation trick to buy the powerful feature of context-aware scanning. Those forks would be very short- lived though as they would usually already be collapsed in the next step. So the theoretical overhead of this would probably be like O(c n^3) instead of O(n^3), and assuming that c would be constant, that overhead would probably be neglectable. What I am wondering, independent of implementation aspects, would there be an automated way (from theoretical POV) to identity the minimum required amount of k (a.k.a. amount of lookahead tokens) for _individual_ parts of the language to allow a variable k for one language, in order to find a good compromise between runtime efficiency and flexibility, and yet being easy to use? For that to happen, a specific value for k would need to be assigned to each parser state, but the total amount of parser states (and with it, transitions to other states) in turn would be dependent to the value of k, so that's probably where the dog bites its tail, right? Another option would be automatically identifying at least a global, minimum k for an entire language instead. Yet another option would be the user to define k (in some way) for individual language parts in BNF, but an automated way would obviously much more beneficial for user friendlyness and reliability. > >> Yes, indeed. That's something which Joel detailed in his PhD. Note > >> though that PSLR targets deterministic parsing. Bison would (first) > >> target that, not GLR. But contributions are always welcome :) > > > > The longer I think about it, the more I wonder whether it wouldn't make > > sense to make a clear and radical cut with the past when developing a > > 'married' parser to allow unfolding its full potential. > > > > I mean I look at what you are working on Akim, and AFAICS the majority of > > time you have to spend on maintaining (i.e. not breaking) backward > > compatibility for ancient use cases. Most of Bison's conceptional design > > and internals are still based on requirements and opinions from around > > 1980 which simply no longer exist or proofed wrong. > > I don't subscribe to this opinion :) The core of Bison is well established, > and I have very little maintenance to do on it. As a matter of fact, if > you look at the recent history, there's hardly any work in the automaton > construction. Most of my recent efforts are about diagnostics. > Diagnostics about the grammar (which is also what the current work on > counterexamples is about), and in the generated parsers (to provide better > error messages). I did not mean the core automat of Bison. I simply had the impression that when it comes to new features, the large number of existing Bison use cases and all their differences in individual Bison configurations ("explosions") add such kind of maintenance complexity, that it seems to significantly constraint (or at least vastly slow down) the potential development of new Bison features. > Many many people *want* a *deterministic* and *fast* parser. > > They want determinism because with GLR (with conflicts) you just don't know > if some day your parser will fail because of an unforeseen ambiguity. I With 'fail' in regards to GLR you mean what exactly, memory exhaustion or rather sequence of user actions being executed in wrong order, or something else? > know people who build critical software with Bison, and "possible failure" > on valid input is not an option. Bison is strong on this regard. IELR > is arguably the most powerful form of LR(1) parsing you can find on the > market with the size of LR(0). And you can have canonical LR(1) if you > want. I am not questioning that designing a language to fit into LALR(1) does make sense, for the reasons you said. But the reality is also a 'bit' different, as many languages are not truly LALR(1) with context-free tokens to full extent. So what happens is that people usually make hacks to fit their language into that concept (e.g. pushing/popping states on scanner side, bloating tokenization rules, hand crafted procedural tokenization code e.g. in C, etc). And that's where determinism often proofed to get broken, and is often hard to be fixed (manually) and maintained as well, as there is always a danger of other edge cases still being present. I still prefer a machine handling complexities than relying on humans to understand and handle large number of possibilities reliably. Because the larger the amount of possibilities, the higher the chance human loses that battle against a machine. > GLR is slow compared to LR. People rely on our generated LR parsers being > fast. That's a true feature of Bison's. Note only does determinism provide > people with guarantees ("your grammar is unambiguous"), it also provides > them with efficiency. > > Although I don't have figures about "blind GLR" (i.e., seated on LR(0)), > I'm pretty sure it would disastrous for anything serious. So you'd have to > go at least to SLR(1). But then, what would be the point of throwing away > all the good technology we have for LALR(1), IELR(1) and canonical LR(1)? > > If I were to build a GLR parser from scratch today, I would certainly > aim at SLR(1), you are right, LALR(1) is a nightmare, but Robert Corbett > fought that fight some 40 years ago, and IELR(1) was certainly super tricky, > but Joel E. Denny won that battle more than 10 years ago. Why would I > discard so great achievements? To sort out a misapprehension: I was not suggesting to ditch LALR(1) and/or Bison 3.x for good. Not at all. They are still the most important parsers for good reasons and will continue to do so. But I do think that after 40 years, it is legitimate to consider conceptual alternatives alongside. Simply comparing O(n) vs. O(n^3) and that been taken as reason 40 years ago, should not be a reason to immediately discard questioning the status quo today. Those would be worst case considerations anyway. For instance the concept of deep neuronal networks (AI) also existed decades ago, and despite of what individual companies claim, the theoretical concepts have not really improved much since then, and still we are now seeing real life applications for them everywhere. Simply because today we do have the appropriate hardware to run them, despite e.g. still having the same horrible complexity/inefficiency of training such networks. Bison 4 is probably still far ahead. So some discussions in the meantime could be useful. I am also considering to write a proof of concept code, toying around with some (core) concept ideas and see where it goes. Maybe it just ends in trash, or maybe not. We will see. > > Besides of what we already discussed, I think it would make sense to get > > rid of a bunch of other things that Bison still retains: > > > > 1. No longer multiple parser types, only one: a somewhat GLR-alike parser > > like> > > discussed as real superset of LALR, but without lookahead token (so not > > really LALR based under the hood). Theoretically there is no > > disadvantage > > by doing so, because if a developer feeds it with a LALR(1) capable > > > > grammer, it would still end up having the same parsing > > > > complexity=efficiency as an old-school LALR would do, probably just with > > > > slightly higher memory footprint. > > > > Advantages though: much more user friendly and powerful, and a lot less > > code to maintain. > > I bet we are talking about making the parsers at least one order of > magnitude slower here (blind GLR). Of course this not an option. > > Besides, if people want to look for brand new trendy techniques, why would > they come to Bison? There are plenty of alternatives out there, and if For instance? I don't see many parser generators having popped up in many years. And AFAICS they did not really add new conceptual core features, except of some fixed higher (predefined) k maybe. The rest is more about things 'around', like support for specific programming languages, or EBNF instead of BNF, etc. > Bison still exists today, it's because there's already tons of software > that rely on its current behavior. I will *not* break that. *Never*. > That's our "raison d'être" (some English native should check my use of > French expressions, I'm not I'm using it properly :-). > > I can add *more* options, provide better alternatives, but Bison cannot > afford to drop features just like that. Please do, I would appreciate if you share ideas! > > 2. No longer raw C tables as pregenerated parser and scanner tables, > > instead> > > of such C-tables: an intermediate, dynamic in-memory model with > > high-level > > API allowing to access and modify grammar and patterns on-the-fly. > > > > Advantage: it would allow self-extensible languages (e.g. an input that > > adds keywords, patterns, drops or adds grammar rules while being > > parsed). > > > > And implementation specific things would finally clearly be isolated by > > > > introducing such an API -> Easier to maintain. > > > > Disadvantages: Probably none, except of consuming slightly more memory > > for > > > > meta information. > > This is interesting, but that's not Bison you are talking about. Yeah I know, it would be too different. > > 3. No longer aggressive compression of parser states (which is somewhat > > > > already implied by [2.]), because such compressions lead to important > > information loss when it comes to grammar debugging or auto completion > > features and other user relevant purposes. The motivation that lead to > > such > > > > compressions (very low RAM space) no longer exist in like >99% of use > > cases > > > > today. > > Come on! We all know that today the very same constraints ("be small") > still apply, but indeed for completely different reasons. Compressing used > to be about fitting in the RAM, today it's about fitting in the cache. Today's memory optimization strategy is usually not trying to fit everything into cache, but rather doing cache-line optimizations: Alligning data to certain memory boundaries, choosing appropriate layouts for data structures (e.g. appropriate dimensions for matrices) and taking care of code portions that are executed for a certain longer period of time of only accessing a limited set of cache lines during such individual time slices. Whether these considerations make sense for a parser generator on modern (and not low end) devices is questionable though, as it probably does not affect much overall efficiency of applications in practice. > Not to mention lower-end devices. Low end devices would probably still use LALR(1) instead, yes. They 'could' benefit from this as well though, if their language is mostly LALR(1) and probably only a few, limited exceptions here and there, they could benefit from reduced maintenance complexity (e.g. their language still evolving) without (or little) negative runtime impact. > I don't believe in the Unique True Omnipotent Parser Yielder (let's call it > UTOPY for short). There are far too many different use cases. Yep, I totally agree with that. Bison, as it is now, has and will have its place, but I can also consider that many use cases might prefer another solution in future. Because in many applications today, the runtime efficiency of the core parser is not the biggest factor of the application's overall runtime. For instance a modern compiler spends most of its execution time on its backend, or at least in general on tasks after an initial AST became available. I rather got the impression, that many companies today have other problems: acquiring appropriate personnel being capable to deal with LALR(1) parser generators. I also see many developers still writing all their parsers manually, as they told me they found it too hard for them using generators. So these would probably the ones profiting the most. But there would be many other use cases: prototyping, CI tools, natural language processing, ... > Some want > to change the language dynamically, some want determinism, some what > something DOM-oriented, not SAX-like, some have strong CPU/time > constraints, some don't care, some experiment others need to be production > ready, etc., etc., etc. > > Bison fits into a niche, and it would be an error to depart from this. I wouldn't call it 'niche'. That would still be a big niche frankly. ;-) Best regards, Christian Schoenebeck