parsing, PEG parsers, and backtracking HTML templating: toward a bright parsing future (examples in OCaml, C++, and Ruby)

Kragen Javier Sitaker Mon, 16 Apr 2012 00:38:36 -0700

On Fri, Apr 13, 2012 at 10:19:36PM -0400, Kragen Javier Sitaker wrote (on
kragen-discuss):
> No parser yet, but ...


(in
<http://lists.canonical.org/pipermail/kragen-discuss/2012-April/001227.html>,
referring to
<http://lists.canonical.org/pipermail/kragen-tol/2012-April/000951.html>.)

So this isn’t a parser either, but it’s some directions for parsing
research to go in.

A PEG for the backtracking HTML templating language
---------------------------------------------------

Here’s a PEG grammar for the template language, leaving out the 
`<if a == b>` construct for now until I can figure out how `a` and `b`
should be delimited. I wrote it down in my notebook waiting for the
bus home on Saturday night:

    doc <- seq loop?.
    loop <- loopstart doc ('<,>' doc)?.
    seq <- alt , '<;>'.
    alt <- jux , '<|>'.
    jux <- atom*.
    atom <- '<{>' doc '<}> / var / lit.

    lit <- ('\\' byte / !special byte)+.
    special <- '<' [{},;|] '>' / var / loopstart.

    var <- '$' name.
    loopstart <- '<@' name '>'.
    name <- (!asciipunct !space byte)+.

    asciipunct <- [][\\'".,{|}^~<>!?():;&%@`#$+/*=-]. # no _
    space <- [ \n\r\t].

Here, by `x , y`, I mean `x (y x)*`; this function of `,` in this PEG
is very similar to the function of `<,>` in the template language.
Also, `'\\'` means a single backslash, as usual. `?`, `*`, and `+`
have their usual regexp meanings (optional, optional and repeated, and
repeated, respectively) except that they only maintain one
backtracking position.

I think this grammar has the right precedence and associativity, and
should avoid the exponential-time worst case of PEGs on all inputs,
even with a naïve nonmemoizing (i.e. non-Packrat) implementation.  At
least the obvious cases of inefficiency in my first attempt have been
removed.

It’d be pretty awesome to have something that could take the above
grammar and produce some kind of usable parse tree from it, without
filthying up the host-language-independent grammar with semantic
actions.  That would allow you to write your target-language grammar
once, and then reuse it in many different host languages and for
applications with different requirements.

Automatically building a parse tree in OCaml
--------------------------------------------

Ideally the parse tree would even be statically typed:

    type doc = Doc of seq * loop option
    and loop = Loop of loopstart * doc * doc option
    and seq = Seq of alt list
    and alt = Alt of jux list
    and jux = Jux of atom list
    and atom = DocAtom of doc | VarAtom of var | LitAtom of lit

    and lit = Lit of byte list

    and var = Var of name
    and loopstart = Loopstart of name
    and name = Name of byte list

    and byte = char
    ;;

The above does parse in OCaml.  The rules I applied to get it:

1. Omit types for productions that only occur in negations, such as special,
   asciipunct, and space.
2. `?` applies "option" to everything inside of it.
3. Literal strings are omitted from the output.
4. Alternatives for a nonterminal are named by prepending the names of the
   contained nonterminals to the name of the nonterminal, plus some other
   undetermined hack if that’s insufficient.
5. `byte = char`.
6. Alternatives with the same set of nonterminals are merged.
7. `+`, `,`, and `*` apply "list" to everything they govern.

This kind of thing is adequate for most of the PEGs I’ve written, and the
others can be adapted to it relatively easily.

Anyway, given that, you can reduce it to the `Alt, `Concat, `Var, `Lit, and
`Loop used by the evaluator I posted at <http://ideone.com/TGqua> with some
straightforward functions, which would look something like this:

    let (++) a b = `Concat(a, b)
    and (|+) a b = `Alt(a, b)
    and e = `Lit "" ;;                  (* ε: empty string *)

    (* Convert a list of bytes to a string. *)
    let string_of bytes =
      let buf = String.create(List.length bytes)
      and remaining = ref bytes in
        for i = 0 to (List.length bytes) - 1 do
          match !remaining with
            byte :: tail -> buf.[i] <- byte; remaining := tail
          | [] -> raise (Invalid_argument buf)
        done;
        buf
    ;;

    let rec cdoc = function
      | Doc(seq, None) -> cseq seq
      | Doc(seq, Some(loop)) -> cseq seq ++ cloop loop
    and cloop = function
      | Loop(Loopstart (Name name), body, None) -> 
        `Loop(string_of name, cdoc body, e)
      | Loop(Loopstart (Name name), body, Some(comma)) -> 
        `Loop(string_of name, cdoc body, cdoc comma)
    and cseq = function
      | Seq [] -> e
      | Seq [alt] -> calt alt
      | Seq (alt :: alts) -> (calt alt |+ e) ++ cseq (Seq alts)
    and calt = function
      | Alt [] -> e
      | Alt [jux] -> cjux jux
      | Alt (jux :: juxes) -> cjux jux |+ calt (Alt juxes)
    and cjux = function
      | Jux [] -> e
      | Jux [atom] -> catom atom
      | Jux (atom :: atoms) -> catom atom ++ cjux (Jux atoms)
    and catom = function
      | DocAtom doc -> cdoc doc
      | VarAtom (Var (Name name)) -> `Var(string_of name)
      | LitAtom (Lit text) -> `Lit(string_of text)
    ;;

That parses and typechecks too. `string_of` probably exists in the
library somewhere.

All of the above seems like substantially more code than would be
optimal, though.  The cases for empty lists (guaranteed by the grammar
not to exist in the `Alt` and `Seq` cases, but included to avoid OCaml
warnings) are particularly annoying; they might go away with the use
of polymorphic variants.

Records with named fields instead of positional tuples: C++
-----------------------------------------------------------

In a language oriented to records rather than tuples, like C++, you’d end up
with something like this instead:

    typedef char Byte;
    struct Doc;
    struct Name { vector< shared_ptr<Byte> > bytes; };
    struct Loopstart { shared_ptr<Name> name; };
    struct Var { shared_ptr<Name> name; };

    struct Lit { vector< shared_ptr<Byte> > bytes; };

    struct Atom { virtual ~Atom(); };
    struct DocAtom: Atom { shared_ptr<Doc> doc; };
    struct VarAtom: Atom { shared_ptr<Var> var; };
    struct LitAtom: Atom { shared_ptr<Lit> lit; };

    struct Jux { vector< shared_ptr<Atom> > atoms; };
    struct Alt { vector< shared_ptr<Jux> > juxes; };
    struct Seq { vector< shared_ptr<Alt> > alts; };
    struct Loop {
      shared_ptr<Loopstart> loopstart;
      shared_ptr<Doc> doc;  
      shared_ptr<Doc> doc2; 
    };
    struct Doc { shared_ptr<Seq> seq;  shared_ptr<Loop> loop; };

This also compiles, at least if you’re `using namespace std;`, #including
`<memory>` and `<vector>`, and compiling with `-std=c++0x` or Boost, but it’s
also a lot more loosely typed.  The difference between `doc` and `doc option`
has become invisible; two of the fifteen pointers in the above are semantically
nullable, but all of them are nullable.

(You have to use RTTI with this approach, since there’s no other way
to treat the different kinds of atoms differently.)

I’ve pluralized, Rails-style, the child nonterminal names that refer
to vectors.  (Maybe I should do the same with nullables?)

I chose to make all of the connections between the types be (smart) pointers.
It would generally be more memory-efficient for them to be by value, so that
`VarAtom` directly contained a `Var` which contained a `Name`, but not all of
them can be; in particular, `Atom` has three variants implemented as derived
structs, so references to it can’t be by value unless we turn it into a
`union`, and the two nullable pointers (`Doc::loop` and `Loop::doc2`) can’t be
by-value inclusions, either, because then they wouldn’t be nullable and the
struct would have infinite size.  It would probably be better to make access to
the nullable pointers different, but having to access `atoms` differently just
because of that would just be confusing.

Three notes before passing on:

* You could maybe argue that `unique_ptr` would be a better fit than
  `shared_ptr`, since these are parse *trees*, after all.

* It would probably also be more efficient to allocate all of this in a region,
  aka an APR pool or obstack, and drop the useless reference-counting overhead.
  Or maybe a copying garbage collector.

* You might think that it would be time-inefficient to, say, put the `Seq` 
struct
  physically inside the `Doc` struct, because it would imply copying the `Seq`
  struct.  But in fact `parse_Doc` could just pass a member pointer to
  `parse_Seq`, avoiding the copying, and in any case none of the structs would
  be very long.

* Probably it would be better to have parse tree nodes include begin
  and end pointers into the original input text, for the sake of error
  reporting.  This kind of thing can be added more easily in a
  record-oriented language like C++ than a tuple-oriented one like
  OCaml.

Non-inline semantic actions in, say, Ruby
-----------------------------------------

The traditional `yacc` approach is that, instead of automatically
building a concrete parse tree in memory that must later be rewritten
into an abstract syntax tree, the parser invokes semantic actions
directly, which can build the AST directly or carry out whatever other
actions are preferred (e.g. directly emitting machine code).  It’s
possible to do this without intermixing your code with the grammar.

One tricky part is backtracking: unlike an LR or LL parser, a PEG
parser might parse the same text with several different nonterminals,
discarding all but the last result, or even all the results.  So
semantic actions must be prepared for the possibility that their
results will be discarded.  (The linear-time guarantee of Packrat
parsing isn’t like the linear-time guarantee of LALR parsing: its
worst case is proportional to the *product* of the grammar size and
the text size, because in the worst case, it can parse every
nonterminal at every input position.  It avoids exponential-time by
only attempting to parse a particular nonterminal at a particular
input position *once*.)

This also means that Packrat actions need to be prepared for their
results to be reused in a context different from the one where they
were originally invoked.

You could wait until the end, when you know you won’t backtrack, to
invoke the semantic actions, which amounts to building a concrete
parse tree in memory and then applying a Visitor to it.

A straightforward way to handle these problems is to limit your
semantic actions to building and returning an AST node, and use
garbage collection or `shared_ptr` to ensure that the nodes get
cleaned up if they’re not needed.  In many cases, it would be nicer to
be able to, say, emit bytecode into a buffer, but I’m not sure what
the interface to that would look like.

So, with those limitations aside, here’s what it might look like, say,
in Ruby:

    class TemplateNodeBase # base class of Lit, Var, Loop, Alt, Concat
      def +(other); Concat.new(self, other); end
      def |(other); Alt.new(self, other); end
    end

    class TemplateParser < TemplateParserBase
      E = Lit.new("")       # The empty string

      on(:doc) { loop ? seq + loop : seq }
      on(:loop) { Loop.new(loopstart, doc, doc2 || E) }
      on(:seq) { alts[0...-1].map { |a| a | E }.inject(E, :+) + alts[-1] }
      on(:alt) { juxes.inject(:|) }
      on(:jux) { atoms.empty? ? E : atoms.inject(:+) }

      on(:lit) { Lit.new(bytes.join) }

      on(:var) { Var.new(name) }
      on(:name) { bytes.join }
    end

Some of the nonterminals here don’t have associated actions; if you
don’t define a semantic action for a given nonterminal, and each of
its alternatives has only a single nonterminal, it just returns the
value given by that nonterminal.  `atom` and `loopstart` are handled
that way, for example.

I’m abusing Ruby’s ability to define methods on the fly (or to handle
the calling of undefined methods) to avoid having to name formal
parameters for the child nodes being passed to each `on` action.  So
when the `doc` handler calls `loop`, that returns the result of the
earlier `loop` handler that ran in its child node, or `nil`, if the
loop nonterminal failed.

Presumably this interface could also provide start and end pointers.

(When I wrote the above and comparing it to the OCaml version, all
day, all night, I thought to myself, what the fuck?  I couldn't
believe what I was writing.  Why did I think OCaml was a good language
for writing compilers in?  So I IMed my friend Amit and said,
“OCaml esta muy loca.”  Perhaps the answer will be apparent
once I am debugging.)

(Thanks to Paul Visscher for his help in designing this interface.)

Misc.
-----

From my point of view it would be nice to not have to write type
declarations in the actual program, but I thought they would be a
useful way to document things here, particularly in the C++ case.
(Perhaps that suggests that they have utility in improving the
readability of actual programs, too, contrary to lesscode/Python/Ruby
doctrine; as Brooks said, “Show me your flowchart and conceal your
tables, and I shall continue to be mystified. Show me your tables, and
I won’t usually need your flowchart; it’ll be obvious.”)

It would be nice to somehow avoid constructing a list of pointers to
all the bytes in a `lit`.

The multiplicity of kinds of parse tree nodes in the OCaml interface
is error-prone.  I wrote some bugs where I was using a Name instead of
a byte list, for example, and it’s pretty annoying to need to invent a
different compilation function name for every nonterminal.
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

parsing, PEG parsers, and backtracking HTML templating: toward a bright parsing future (examples in OCaml, C++, and Ruby)

Reply via email to