Thanks! One clarification: the lookahead+backreference technique is actually just a way to emulate atomic matches (i.e., independent matches) in Python's re module, which doesn't have the feature natively. I think I found that one on StackOverflow, maybe here: https://stackoverflow.com/a/13577411/1441112. If your regex implementation has atomic matches already, then you can just use those instead (the result will be more legible, at least!).
Looking again, the only citations I found were for the other direction, converting regexes to PEGs, but the ideas are the same. Here's a few: * Oikawa et al. 2010: http://www.inf.puc-rio.br/~roberto/docs/ry10-01.pdf * Medeiros et al. 2011: http://www.lua.inf.puc-rio.br/publications/medeiros11regular.pdf * Medeiros et al. 2014: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.763.1773&rep=rep1&type=pdf I found the conversion from PEG to regex fairly straightforward with three special cases: 1. Disabling backtracking with atomic matching (or similar) for repetitions and ordered choices, as above. 2. Making the dot operator match all, including newlines (may depend on implementation; with Python's re I used (?s:.) ). 3. Translating character classes; the PEG and regex ones look similar but have a few differences. E.g., [^-]] in PEG would be a range between ^ and ] (if it were a valid range), whereas in regex it's a negated character class over - followed by a literal ]. If you have some test cases for the bugs you're running into, I'd be happy to see them to check if I'm doing the right thing. Best, On Thu, Sep 30, 2021 at 5:10 PM Moss, Aaron <mo...@up.edu> wrote: > This looks really useful! > > > > I have a project going right now which uses regular expressions as a > sub-matcher for PEGs (I modified a GLL parser-generator to have PEG > semantics, and it uses a regular expression-based lexer), and I’ve been > running into some bugs with the interface between the two, so if you do > remember any of those citations I’d be interested to see them. > > > > Cheers, > Aaron Moss > > > > *From:* PEG [mailto:peg-boun...@lists.csail.mit.edu] * On Behalf Of * > goodman....@gmail.com > *Sent:* Wednesday, September 29, 2021 23:26 > *To:* peg@lists.csail.mit.edu > *Subject:* [PEG] pe: a new PEG implementation for Python > > > > *EXTERNAL EMAIL:* This email originated from outside the University of > Portland. Do not click links or open attachments unless you recognize the > sender and know the content is safe. > ------------------------------ > > Hello all, > > > > I'd like to announce the release of 'pe', a new PEG parser for Python. > While Python has no shortage of PEG parsers already, I think pe has enough > interesting features that it's worth mentioning here. > > > > https://github.com/goodmami/pe > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgoodmami%2Fpe&data=04%7C01%7Cmossa%40up.edu%7C4ffd36014b6b4c1da06208d983db3f44%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C0%7C637685800729543305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=Oj74pXG%2Fc6yjwibfTvhV9l7w10BQqLX4WFlndGt2JAc%3D&reserved=0> > > > > It's "new" in that I've never properly announced it anywhere, but I've > been developing and using it for several years. Here's a summary: > > > > * Low learning curve for users: the API is closely modeled after Python's > re module and the metasyntax is a superset of Ford's original with only a > few extensions > > > > * Two options for parser engines: (1) a pure-Python memoizing > recursive-descent parser and (2) a compiled Cython (currently > non-memoizing) state machine parser inspired by Medeiros and Ierusalimschy, > 2008 and LPeg > > > > * It's fast (by Python standards): aggressive grammar optimizations and > low overhead show it significantly faster than LALR (SLY, Lark) and PEG > (Parsimonious, PyParsing) parsers in several tests (todo: more tests) > > > > I achieved the speed in the compiled Cython parser rather conventionally: > static typing, reducing function calls, etc., along with some basic grammar > optimizations and some state machine optimizations from Medeiros and > Ierusalimschy. > > > > In the pure-Python parser, I sped things up primarily by replacing large > portions of the grammar with equivalent regular expressions, for which > Python has a backend written in C. However, constantly switching between > Python and C code adds overhead and can actually make things *slower*, so I > put as much into a single regex as I can by first inlining everything in > the grammar that isn't recursive and doesn't have a semantic effect on the > expression (captures, etc.). In addition, to replicate PEG behavior in a > regular expression, I avoided backtracking with a combination of lookaheads > and backreferences (I think there's an academic citation for this > technique, but I'm sorry I don't have it now). The resulting regexes (and > optimized grammar) are utterly illegible, but they perform well. Finally, > I'm not concerned with drastically changing the AST because I don't produce > ASTs: any grammar transformation that accepts the same strings and has the > same semantic effect is fair game. > > > > Thanks for reading. I welcome any comments or questions. > > > > -- > > -Michael Wayne Goodman > -- -Michael Wayne Goodman
_______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg