Yes, our problems are somewhat different in that I don't tokenize first. I haven't thought deeply about working with tokenized input, but if tokens are the alphabet for the PEG then it seems to me there's a commitment to use what the lexer provides, so when your first G rule is given an AB token as input instead of A and B, it seems to be doing the right thing. But, if I understood correctly, your system is automatically creating this regex-based lexer from a PEG grammar, which is why you want the grammar's behavior over tokens to match that over characters. If so, it sounds like you could either construct the lexer with enough context (e.g., in lookarounds or matching multiple tokens at once) so that it commits to the correct choice given the semantics of the original grammar, or you could make the parser back off to character-level scanning when there are potential conflicts between choices of tokens. Either way it sounds like you'd need to do some static analysis like computing FIRST sets.
I'm recalling a thread on this list from May 2019 ( https://lists.csail.mit.edu/pipermail/peg/2019-May/thread.html). I think the people who replied there would have more insightful things to say about this. In my case, with 'pe', it produces the following regex for G: (?=(?P<_2>(?=(?P<_1>(?:a)*))(?P=_1)|ab))(?P=_2) ...which should be equivalent to the following in PCRE: (?>(?>a*)|ab) (The outer atomic group is only necessary when G is part of some larger construction in the grammar) On Fri, Oct 1, 2021 at 11:20 AM Moss, Aaron <mo...@up.edu> wrote: > Thanks, I’ll need to take a look at those citations, but it’s possible > we’re solving different problems: > > > > As an example, I have a “test ordered choice” PEG that’s just this (should > not match “ab”): > > > > G := 'a'* / 'ab' > > > > Of course, if you run the input through a tokenizer first, it does match > “ab”, whether the tokens are 'a'* and 'ab' or 'a' and 'ab' (because maximal > munch takes the 'ab' token in both cases), the only way I could get it to > have the correct semantics was this (with tokens 'a' and 'b'): > > > > G := 'a'* / 'a' 'b' > > > > I ran into some similar issues with XML and overlapping character classes, > I don’t remember the exact details offhand but something like this would > get the sense of it: > > > > ALPHA := [A-Za-z_] > > DIGIT := [0-9] > > ALPHANUM := [A-Za-z_0-9] > > > > Even though ALPHANUM could be a regular expression, I needed to make it a > parsing expression over disjoint character classes `ALPHANUM := ALPHA / > DIGIT` because otherwise expressions looking for an ALPHANUM might fail > when they got an ALPHA token (or, depending on context, I might get an > ALPHANUM token when I was trying to match an ALPHA). > > > > I’m considering modifying my tokenizer to return sets of possibly-matched > tokens (rather than the single maximal munch), which seems possibly a > better fit for PEG semantics, but I’m worried that would lose the runtime > benefits of doing regular expression matching. I might also be able to do > some sort of left-factoring in the parser-generator based on FIRST sets, > but that sounds like a pretty heavy-weight change. > > > > --Aaron > > > > *From:* goodman....@gmail.com [mailto:goodman....@gmail.com] > *Sent:* Friday, October 1, 2021 00:21 > *To:* Moss, Aaron <mo...@up.edu> > *Cc:* peg@lists.csail.mit.edu > *Subject:* Re: [PEG] pe: a new PEG implementation for Python > > > > *EXTERNAL EMAIL:* This email originated from outside the University of > Portland. Do not click links or open attachments unless you recognize the > sender and know the content is safe. > ------------------------------ > > Thanks! > > > > One clarification: the lookahead+backreference technique is actually just > a way to emulate atomic matches (i.e., independent matches) in Python's re > module, which doesn't have the feature natively. I think I found that one > on StackOverflow, maybe here: > > https://stackoverflow.com/a/13577411/1441112 > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fa%2F13577411%2F1441112&data=04%7C01%7Cmossa%40up.edu%7Cc50fb1d3258b4298847208d984ac18e9%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C0%7C637686697845288109%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nO%2BfxqdQQWXnlfVSXTjy83uNEQUWu9M0WOP3zjS49U8%3D&reserved=0>. > If your regex implementation has atomic matches already, then you can just > use those instead (the result will be more legible, at least!). > > > > Looking again, the only citations I found were for the other direction, > converting regexes to PEGs, but the ideas are the same. Here's a few: > > > > * Oikawa et al. 2010: http://www.inf.puc-rio.br/~roberto/docs/ry10-01.pdf > <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.inf.puc-rio.br%2F~roberto%2Fdocs%2Fry10-01.pdf&data=04%7C01%7Cmossa%40up.edu%7Cc50fb1d3258b4298847208d984ac18e9%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C0%7C637686697845298062%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Yidu7DilHfycieaam2m4RcPJvB1LQClwFnCx%2BfY1JvY%3D&reserved=0> > > * Medeiros et al. 2011: > http://www.lua.inf.puc-rio.br/publications/medeiros11regular.pdf > <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.lua.inf.puc-rio.br%2Fpublications%2Fmedeiros11regular.pdf&data=04%7C01%7Cmossa%40up.edu%7Cc50fb1d3258b4298847208d984ac18e9%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C0%7C637686697845308032%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=9gl1vc1CbXLYgafOMfyQ3WX8aiLrL74XF35ANIgaDjs%3D&reserved=0> > > * Medeiros et al. 2014: > https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.763.1773&rep=rep1&type=pdf > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.763.1773%26rep%3Drep1%26type%3Dpdf&data=04%7C01%7Cmossa%40up.edu%7Cc50fb1d3258b4298847208d984ac18e9%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C0%7C637686697845308032%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rdRIaF1JEaY%2BrMQGpusPy3E3qGDZVUi1oLkqQabOffg%3D&reserved=0> > > > > I found the conversion from PEG to regex fairly straightforward with three > special cases: > > > > 1. Disabling backtracking with atomic matching (or similar) for > repetitions and ordered choices, as above. > > 2. Making the dot operator match all, including newlines (may depend on > implementation; with Python's re I used (?s:.) ). > > 3. Translating character classes; the PEG and regex ones look similar but > have a few differences. E.g., [^-]] in PEG would be a range between ^ and ] > (if it were a valid range), whereas in regex it's a negated character class > over - followed by a literal ]. > > > > If you have some test cases for the bugs you're running into, I'd be happy > to see them to check if I'm doing the right thing. > > > > Best, > > > > On Thu, Sep 30, 2021 at 5:10 PM Moss, Aaron <mo...@up.edu> wrote: > > This looks really useful! > > > > I have a project going right now which uses regular expressions as a > sub-matcher for PEGs (I modified a GLL parser-generator to have PEG > semantics, and it uses a regular expression-based lexer), and I’ve been > running into some bugs with the interface between the two, so if you do > remember any of those citations I’d be interested to see them. > > > > Cheers, > Aaron Moss > > > > *From:* PEG [mailto:peg-boun...@lists.csail.mit.edu] *On Behalf Of * > goodman....@gmail.com > *Sent:* Wednesday, September 29, 2021 23:26 > *To:* peg@lists.csail.mit.edu > *Subject:* [PEG] pe: a new PEG implementation for Python > > > > *EXTERNAL EMAIL:* This email originated from outside the University of > Portland. Do not click links or open attachments unless you recognize the > sender and know the content is safe. > ------------------------------ > > Hello all, > > > > I'd like to announce the release of 'pe', a new PEG parser for Python. > While Python has no shortage of PEG parsers already, I think pe has enough > interesting features that it's worth mentioning here. > > > > https://github.com/goodmami/pe > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgoodmami%2Fpe&data=04%7C01%7Cmossa%40up.edu%7Cc50fb1d3258b4298847208d984ac18e9%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C0%7C637686697845317988%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=6sxyt2Fowbjehh1HMh6C2Zky1xPnnH%2FolouxkJYvY24%3D&reserved=0> > > > > It's "new" in that I've never properly announced it anywhere, but I've > been developing and using it for several years. Here's a summary: > > > > * Low learning curve for users: the API is closely modeled after Python's > re module and the metasyntax is a superset of Ford's original with only a > few extensions > > > > * Two options for parser engines: (1) a pure-Python memoizing > recursive-descent parser and (2) a compiled Cython (currently > non-memoizing) state machine parser inspired by Medeiros and Ierusalimschy, > 2008 and LPeg > > > > * It's fast (by Python standards): aggressive grammar optimizations and > low overhead show it significantly faster than LALR (SLY, Lark) and PEG > (Parsimonious, PyParsing) parsers in several tests (todo: more tests) > > > > I achieved the speed in the compiled Cython parser rather conventionally: > static typing, reducing function calls, etc., along with some basic grammar > optimizations and some state machine optimizations from Medeiros and > Ierusalimschy. > > > > In the pure-Python parser, I sped things up primarily by replacing large > portions of the grammar with equivalent regular expressions, for which > Python has a backend written in C. However, constantly switching between > Python and C code adds overhead and can actually make things *slower*, so I > put as much into a single regex as I can by first inlining everything in > the grammar that isn't recursive and doesn't have a semantic effect on the > expression (captures, etc.). In addition, to replicate PEG behavior in a > regular expression, I avoided backtracking with a combination of lookaheads > and backreferences (I think there's an academic citation for this > technique, but I'm sorry I don't have it now). The resulting regexes (and > optimized grammar) are utterly illegible, but they perform well. Finally, > I'm not concerned with drastically changing the AST because I don't produce > ASTs: any grammar transformation that accepts the same strings and has the > same semantic effect is fair game. > > > > Thanks for reading. I welcome any comments or questions. > > > > -- > > -Michael Wayne Goodman > > > > -- > > -Michael Wayne Goodman > -- -Michael Wayne Goodman
_______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg