Parsing Timeline Suggestions

norswap Tue, 19 Feb 2019 02:47:15 -0800

(Following up from 
https://twitter.com/jeffreykegler/status/1097628806091816960)


Dear Jeffrey,

I was delighted when I found about your parsing timeline a couple of months 
ago.
I'm a PhD student researching parsing myself and it's great to see resources
like this bringing awareness to what happened in the field, and it's a great
reference for people in the field as well. It also got me more interested in
Earley and your own work of it, which is quite cool.

That being said, I do have some insatisfactions about the timeline, 
especially
the end of it. It's always easy to criticize, so I would like to offer my 
help
instead, to see if I can contribute something useful to the timeline.

According to me, there are two points that could use improvement. First, I 
feel
top-down recursive-descent formalizations, especially PEG, are being 
unfairly
dismissed. Second, it seems to me like there are quite a few new 
developments
that would deserve inclusion at the end of the timeline (especially by
comparison to the level of details towards the start of the timeline).

I don't pretend to have the objective truth, but the goal is to have a
productive conversation that can hopefully lead to improved accuracy.

---

Regarding PEG, I find the whole description minus the first paragraph to be
rather contestable and (I feel) uncharitable.

It's entirely correct that PEG does not cover BNF. But BNF is essentially a
notation for the CFG formalism, which PEG is not.

It seems your main beef with PEG is the problem that has been described as
"language hiding" or "prefix capture": the fact that a rule like `A ::= a | 
aa`
will not match the input "aa".

This is indeed an important problem of PEGs. But the way I see it, it's a 
dual
to the problem of ambiguity in CFGs. Both preclude one another: PEGs are
unambiguous but have prefix capture, CFGs do not have prefix capture but are
ambiguous. The detection of both issue is provably intractable (statically) 
and
may be more related than we know, as a relatively similar machinery can be 
put
to work to do partial detection of both ambiguity in CFGs and prefix 
capture in
PEGs, cf. "Modular Syntax Demands Verification" by Sylvain Schmitz (*).

(*) http://www.i3s.unice.fr/~mh/RR/2006/RR-06.32-S.SCHMITZ.pdf

I have written a fair deal of both PEGs and CFGs and both seem about equally
difficult to write. Non-general CFG formalism (LALR, LL) are much harder, 
and so
are PEG tools that have no support to build left-associative parse trees.

I also would never advise anyone not to test a grammar, whatever the 
formalism.
I've certainly never written a big chunk of grammar without making a mistake
whatever the formalism. And in practice, the overwhelming majority of these
mistakes weren't about either prefix capture nor ambiguity.

If your practical experience differs, I'd be interested to hear about it.

---

Regarding new developments, here are a couple of ideas of the top of my 
head.
The first two items of the list are not random, since those were things I 
worked
on in my research (http://norswap.com/publications/)

- Left-recursion and left-associativity handling in PEG, since it fixes the
  foremost problem (in my opinion) with PEGs

- Context-sensitive parsing

  You already already mention monadic parsing (but do not emphasize its
  context-sensitive properties). Another oldie that deserves to be 
mentionned
  are Prolog-style definite clause grammars (DCGs).

  There are some other works worth mentioning if this is a topic of 
interest for
  you.

  My own work in the area is essentially pushing an idea which expressed in 
a
  lovely manner:

  > Recursive descent does have a huge advantage, one which, despite its 
severe
  > limitations, will save it from obsolescence time and again. Hand-written
  > recursive descent is essentially calling subroutines. Adding custom
  > modification to recursive descent is very straight-forward.

  But making it so that these subroutines may manipulate state in a way 
that is
  safe in the presence of backtracking, enabling context-sensitive parsing.

- Packrat parsing, if only because it enables O(1) parsing of PEG, a class 
of
  languages that may still strictly include the whole of CFG — as far as I 
know,
  we do not know of a simple language that can be expressed in CFG but not 
in
  PEG (we do know language expressible in PEG but not in CFGs).

  Note that, even if PEG does indeed include CFG, that does not imply the
  existance of an algorithm to convert from one to the other.

- Tackling the problem of ambiguity formally,
  "Disambiguating Grammars with Tree Automata"
  https://michaeldadams.org/papers/disambiguating-grammars/

- GLL parsing probable deserves a mention for mention for taking a LL-style
  machinery to the point of finally parsing fully general CFGs through the 
use
  of Tomita-style graph structured stacks
  http://www.cs.rhul.ac.uk/research/languages/csle/GLLparsers.html

- Similarly, and while I'm not a fan of the way this work is marketed, 
parsing
  using the Brzozowski derivative is certainly an interesting direction in
  parsing: http://matt.might.net/articles/parsing-with-derivatives/

- This might be outside the scope of your timeline, but there is also a 
whole
  lot of work (some quite old) on "semi-parsing", i.e. fuzzy or partial 
parsing:
  http://grammarware.net/text/2014/semiparsing.pdf

---

There you have it. Let me know what you think, and I'm looking forward to
talking with you! I hope I can help the project in any capacity that I'm 
able.

Cheers,

Nicolas LAURENT

-- 
You received this message because you are subscribed to the Google Groups 
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to marpa-parser+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Parsing Timeline Suggestions

Reply via email to