On Fri, Apr 13, 2012 at 10:19:36PM -0400, Kragen Javier Sitaker wrote (on kragen-discuss): > No parser yet, but ...
(in <http://lists.canonical.org/pipermail/kragen-discuss/2012-April/001227.html>, referring to <http://lists.canonical.org/pipermail/kragen-tol/2012-April/000951.html>.) So this isn’t a parser either, but it’s some directions for parsing research to go in. A PEG for the backtracking HTML templating language --------------------------------------------------- Here’s a PEG grammar for the template language, leaving out the `<if a == b>` construct for now until I can figure out how `a` and `b` should be delimited. I wrote it down in my notebook waiting for the bus home on Saturday night: doc <- seq loop?. loop <- loopstart doc ('<,>' doc)?. seq <- alt , '<;>'. alt <- jux , '<|>'. jux <- atom*. atom <- '<{>' doc '<}> / var / lit. lit <- ('\\' byte / !special byte)+. special <- '<' [{},;|] '>' / var / loopstart. var <- '$' name. loopstart <- '<@' name '>'. name <- (!asciipunct !space byte)+. asciipunct <- [][\\'".,{|}^~<>!?():;&%@`#$+/*=-]. # no _ space <- [ \n\r\t]. Here, by `x , y`, I mean `x (y x)*`; this function of `,` in this PEG is very similar to the function of `<,>` in the template language. Also, `'\\'` means a single backslash, as usual. `?`, `*`, and `+` have their usual regexp meanings (optional, optional and repeated, and repeated, respectively) except that they only maintain one backtracking position. I think this grammar has the right precedence and associativity, and should avoid the exponential-time worst case of PEGs on all inputs, even with a naïve nonmemoizing (i.e. non-Packrat) implementation. At least the obvious cases of inefficiency in my first attempt have been removed. It’d be pretty awesome to have something that could take the above grammar and produce some kind of usable parse tree from it, without filthying up the host-language-independent grammar with semantic actions. That would allow you to write your target-language grammar once, and then reuse it in many different host languages and for applications with different requirements. Automatically building a parse tree in OCaml -------------------------------------------- Ideally the parse tree would even be statically typed: type doc = Doc of seq * loop option and loop = Loop of loopstart * doc * doc option and seq = Seq of alt list and alt = Alt of jux list and jux = Jux of atom list and atom = DocAtom of doc | VarAtom of var | LitAtom of lit and lit = Lit of byte list and var = Var of name and loopstart = Loopstart of name and name = Name of byte list and byte = char ;; The above does parse in OCaml. The rules I applied to get it: 1. Omit types for productions that only occur in negations, such as special, asciipunct, and space. 2. `?` applies "option" to everything inside of it. 3. Literal strings are omitted from the output. 4. Alternatives for a nonterminal are named by prepending the names of the contained nonterminals to the name of the nonterminal, plus some other undetermined hack if that’s insufficient. 5. `byte = char`. 6. Alternatives with the same set of nonterminals are merged. 7. `+`, `,`, and `*` apply "list" to everything they govern. This kind of thing is adequate for most of the PEGs I’ve written, and the others can be adapted to it relatively easily. Anyway, given that, you can reduce it to the `Alt, `Concat, `Var, `Lit, and `Loop used by the evaluator I posted at <http://ideone.com/TGqua> with some straightforward functions, which would look something like this: let (++) a b = `Concat(a, b) and (|+) a b = `Alt(a, b) and e = `Lit "" ;; (* ε: empty string *) (* Convert a list of bytes to a string. *) let string_of bytes = let buf = String.create(List.length bytes) and remaining = ref bytes in for i = 0 to (List.length bytes) - 1 do match !remaining with byte :: tail -> buf.[i] <- byte; remaining := tail | [] -> raise (Invalid_argument buf) done; buf ;; let rec cdoc = function | Doc(seq, None) -> cseq seq | Doc(seq, Some(loop)) -> cseq seq ++ cloop loop and cloop = function | Loop(Loopstart (Name name), body, None) -> `Loop(string_of name, cdoc body, e) | Loop(Loopstart (Name name), body, Some(comma)) -> `Loop(string_of name, cdoc body, cdoc comma) and cseq = function | Seq [] -> e | Seq [alt] -> calt alt | Seq (alt :: alts) -> (calt alt |+ e) ++ cseq (Seq alts) and calt = function | Alt [] -> e | Alt [jux] -> cjux jux | Alt (jux :: juxes) -> cjux jux |+ calt (Alt juxes) and cjux = function | Jux [] -> e | Jux [atom] -> catom atom | Jux (atom :: atoms) -> catom atom ++ cjux (Jux atoms) and catom = function | DocAtom doc -> cdoc doc | VarAtom (Var (Name name)) -> `Var(string_of name) | LitAtom (Lit text) -> `Lit(string_of text) ;; That parses and typechecks too. `string_of` probably exists in the library somewhere. All of the above seems like substantially more code than would be optimal, though. The cases for empty lists (guaranteed by the grammar not to exist in the `Alt` and `Seq` cases, but included to avoid OCaml warnings) are particularly annoying; they might go away with the use of polymorphic variants. Records with named fields instead of positional tuples: C++ ----------------------------------------------------------- In a language oriented to records rather than tuples, like C++, you’d end up with something like this instead: typedef char Byte; struct Doc; struct Name { vector< shared_ptr<Byte> > bytes; }; struct Loopstart { shared_ptr<Name> name; }; struct Var { shared_ptr<Name> name; }; struct Lit { vector< shared_ptr<Byte> > bytes; }; struct Atom { virtual ~Atom(); }; struct DocAtom: Atom { shared_ptr<Doc> doc; }; struct VarAtom: Atom { shared_ptr<Var> var; }; struct LitAtom: Atom { shared_ptr<Lit> lit; }; struct Jux { vector< shared_ptr<Atom> > atoms; }; struct Alt { vector< shared_ptr<Jux> > juxes; }; struct Seq { vector< shared_ptr<Alt> > alts; }; struct Loop { shared_ptr<Loopstart> loopstart; shared_ptr<Doc> doc; shared_ptr<Doc> doc2; }; struct Doc { shared_ptr<Seq> seq; shared_ptr<Loop> loop; }; This also compiles, at least if you’re `using namespace std;`, #including `<memory>` and `<vector>`, and compiling with `-std=c++0x` or Boost, but it’s also a lot more loosely typed. The difference between `doc` and `doc option` has become invisible; two of the fifteen pointers in the above are semantically nullable, but all of them are nullable. (You have to use RTTI with this approach, since there’s no other way to treat the different kinds of atoms differently.) I’ve pluralized, Rails-style, the child nonterminal names that refer to vectors. (Maybe I should do the same with nullables?) I chose to make all of the connections between the types be (smart) pointers. It would generally be more memory-efficient for them to be by value, so that `VarAtom` directly contained a `Var` which contained a `Name`, but not all of them can be; in particular, `Atom` has three variants implemented as derived structs, so references to it can’t be by value unless we turn it into a `union`, and the two nullable pointers (`Doc::loop` and `Loop::doc2`) can’t be by-value inclusions, either, because then they wouldn’t be nullable and the struct would have infinite size. It would probably be better to make access to the nullable pointers different, but having to access `atoms` differently just because of that would just be confusing. Three notes before passing on: * You could maybe argue that `unique_ptr` would be a better fit than `shared_ptr`, since these are parse *trees*, after all. * It would probably also be more efficient to allocate all of this in a region, aka an APR pool or obstack, and drop the useless reference-counting overhead. Or maybe a copying garbage collector. * You might think that it would be time-inefficient to, say, put the `Seq` struct physically inside the `Doc` struct, because it would imply copying the `Seq` struct. But in fact `parse_Doc` could just pass a member pointer to `parse_Seq`, avoiding the copying, and in any case none of the structs would be very long. * Probably it would be better to have parse tree nodes include begin and end pointers into the original input text, for the sake of error reporting. This kind of thing can be added more easily in a record-oriented language like C++ than a tuple-oriented one like OCaml. Non-inline semantic actions in, say, Ruby ----------------------------------------- The traditional `yacc` approach is that, instead of automatically building a concrete parse tree in memory that must later be rewritten into an abstract syntax tree, the parser invokes semantic actions directly, which can build the AST directly or carry out whatever other actions are preferred (e.g. directly emitting machine code). It’s possible to do this without intermixing your code with the grammar. One tricky part is backtracking: unlike an LR or LL parser, a PEG parser might parse the same text with several different nonterminals, discarding all but the last result, or even all the results. So semantic actions must be prepared for the possibility that their results will be discarded. (The linear-time guarantee of Packrat parsing isn’t like the linear-time guarantee of LALR parsing: its worst case is proportional to the *product* of the grammar size and the text size, because in the worst case, it can parse every nonterminal at every input position. It avoids exponential-time by only attempting to parse a particular nonterminal at a particular input position *once*.) This also means that Packrat actions need to be prepared for their results to be reused in a context different from the one where they were originally invoked. You could wait until the end, when you know you won’t backtrack, to invoke the semantic actions, which amounts to building a concrete parse tree in memory and then applying a Visitor to it. A straightforward way to handle these problems is to limit your semantic actions to building and returning an AST node, and use garbage collection or `shared_ptr` to ensure that the nodes get cleaned up if they’re not needed. In many cases, it would be nicer to be able to, say, emit bytecode into a buffer, but I’m not sure what the interface to that would look like. So, with those limitations aside, here’s what it might look like, say, in Ruby: class TemplateNodeBase # base class of Lit, Var, Loop, Alt, Concat def +(other); Concat.new(self, other); end def |(other); Alt.new(self, other); end end class TemplateParser < TemplateParserBase E = Lit.new("") # The empty string on(:doc) { loop ? seq + loop : seq } on(:loop) { Loop.new(loopstart, doc, doc2 || E) } on(:seq) { alts[0...-1].map { |a| a | E }.inject(E, :+) + alts[-1] } on(:alt) { juxes.inject(:|) } on(:jux) { atoms.empty? ? E : atoms.inject(:+) } on(:lit) { Lit.new(bytes.join) } on(:var) { Var.new(name) } on(:name) { bytes.join } end Some of the nonterminals here don’t have associated actions; if you don’t define a semantic action for a given nonterminal, and each of its alternatives has only a single nonterminal, it just returns the value given by that nonterminal. `atom` and `loopstart` are handled that way, for example. I’m abusing Ruby’s ability to define methods on the fly (or to handle the calling of undefined methods) to avoid having to name formal parameters for the child nodes being passed to each `on` action. So when the `doc` handler calls `loop`, that returns the result of the earlier `loop` handler that ran in its child node, or `nil`, if the loop nonterminal failed. Presumably this interface could also provide start and end pointers. (When I wrote the above and comparing it to the OCaml version, all day, all night, I thought to myself, what the fuck? I couldn't believe what I was writing. Why did I think OCaml was a good language for writing compilers in? So I IMed my friend Amit and said, “OCaml esta muy loca.” Perhaps the answer will be apparent once I am debugging.) (Thanks to Paul Visscher for his help in designing this interface.) Misc. ----- From my point of view it would be nice to not have to write type declarations in the actual program, but I thought they would be a useful way to document things here, particularly in the C++ case. (Perhaps that suggests that they have utility in improving the readability of actual programs, too, contrary to lesscode/Python/Ruby doctrine; as Brooks said, “Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowchart; it’ll be obvious.”) It would be nice to somehow avoid constructing a list of pointers to all the bytes in a `lit`. The multiplicity of kinds of parse tree nodes in the OCaml interface is error-prone. I wrote some bugs where I was using a Name instead of a byte list, for example, and it’s pretty annoying to need to invent a different compilation function name for every nonterminal. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol