Thanks, I'll need to take a look at those citations, but it's possible we're 
solving different problems:

As an example, I have a "test ordered choice" PEG that's just this (should not 
match "ab"):

               G := 'a'* / 'ab'

Of course, if you run the input through a tokenizer first, it does match "ab", 
whether the tokens are 'a'* and 'ab' or 'a' and 'ab' (because maximal munch 
takes the 'ab' token in both cases), the only way I could get it to have the 
correct semantics was this (with tokens 'a' and 'b'):

               G := 'a'* / 'a' 'b'

I ran into some similar issues with XML and overlapping character classes, I 
don't remember the exact details offhand but something like this would get the 
sense of it:

               ALPHA := [A-Za-z_]
               DIGIT := [0-9]
               ALPHANUM := [A-Za-z_0-9]

Even though ALPHANUM could be a regular expression, I needed to make it a 
parsing expression over disjoint character classes `ALPHANUM := ALPHA / DIGIT` 
because otherwise expressions looking for an ALPHANUM might fail when they got 
an ALPHA token (or, depending on context, I might get an ALPHANUM token when I 
was trying to match an ALPHA).

I'm considering modifying my tokenizer to return sets of possibly-matched 
tokens (rather than the single maximal munch), which seems possibly a better 
fit for PEG semantics, but I'm worried that would lose the runtime benefits of 
doing regular expression matching. I might also be able to do some sort of 
left-factoring in the parser-generator based on FIRST sets, but that sounds 
like a pretty heavy-weight change.

--Aaron

From: goodman....@gmail.com [mailto:goodman....@gmail.com]
Sent: Friday, October 1, 2021 00:21
To: Moss, Aaron <mo...@up.edu>
Cc: peg@lists.csail.mit.edu
Subject: Re: [PEG] pe: a new PEG implementation for Python

EXTERNAL EMAIL: This email originated from outside the University of Portland. 
Do not click links or open attachments unless you recognize the sender and know 
the content is safe.
________________________________
Thanks!

One clarification: the lookahead+backreference technique is actually just a way 
to emulate atomic matches (i.e., independent matches) in Python's re module, 
which doesn't have the feature natively. I think I found that one on 
StackOverflow, maybe here:
https://stackoverflow.com/a/13577411/1441112<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fa%2F13577411%2F1441112&data=04%7C01%7Cmossa%40up.edu%7Cc50fb1d3258b4298847208d984ac18e9%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C0%7C637686697845288109%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nO%2BfxqdQQWXnlfVSXTjy83uNEQUWu9M0WOP3zjS49U8%3D&reserved=0>.
 If your regex implementation has atomic matches already, then you can just use 
those instead (the result will be more legible, at least!).

Looking again, the only citations I found were for the other direction, 
converting regexes to PEGs, but the ideas are the same. Here's a few:

* Oikawa et al. 2010: 
http://www.inf.puc-rio.br/~roberto/docs/ry10-01.pdf<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.inf.puc-rio.br%2F~roberto%2Fdocs%2Fry10-01.pdf&data=04%7C01%7Cmossa%40up.edu%7Cc50fb1d3258b4298847208d984ac18e9%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C0%7C637686697845298062%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Yidu7DilHfycieaam2m4RcPJvB1LQClwFnCx%2BfY1JvY%3D&reserved=0>
* Medeiros et al. 2011: 
http://www.lua.inf.puc-rio.br/publications/medeiros11regular.pdf<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.lua.inf.puc-rio.br%2Fpublications%2Fmedeiros11regular.pdf&data=04%7C01%7Cmossa%40up.edu%7Cc50fb1d3258b4298847208d984ac18e9%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C0%7C637686697845308032%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=9gl1vc1CbXLYgafOMfyQ3WX8aiLrL74XF35ANIgaDjs%3D&reserved=0>
* Medeiros et al. 2014: 
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.763.1773&rep=rep1&type=pdf<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.763.1773%26rep%3Drep1%26type%3Dpdf&data=04%7C01%7Cmossa%40up.edu%7Cc50fb1d3258b4298847208d984ac18e9%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C0%7C637686697845308032%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rdRIaF1JEaY%2BrMQGpusPy3E3qGDZVUi1oLkqQabOffg%3D&reserved=0>

I found the conversion from PEG to regex fairly straightforward with three 
special cases:

1. Disabling backtracking with atomic matching (or similar) for repetitions and 
ordered choices, as above.
2. Making the dot operator match all, including newlines (may depend on 
implementation; with Python's re I used (?s:.) ).
3. Translating character classes; the PEG and regex ones look similar but have 
a few differences. E.g., [^-]] in PEG would be a range between ^ and ] (if it 
were a valid range), whereas in regex it's a negated character class over - 
followed by a literal ].

If you have some test cases for the bugs you're running into, I'd be happy to 
see them to check if I'm doing the right thing.

Best,

On Thu, Sep 30, 2021 at 5:10 PM Moss, Aaron <mo...@up.edu<mailto:mo...@up.edu>> 
wrote:
This looks really useful!

I have a project going right now which uses regular expressions as a 
sub-matcher for PEGs (I modified a GLL parser-generator to have PEG semantics, 
and it uses a regular expression-based lexer), and I've been running into some 
bugs with the interface between the two, so if you do remember any of those 
citations I'd be interested to see them.

Cheers,
Aaron Moss

From: PEG 
[mailto:peg-boun...@lists.csail.mit.edu<mailto:peg-boun...@lists.csail.mit.edu>]
 On Behalf Of goodman....@gmail.com<mailto:goodman....@gmail.com>
Sent: Wednesday, September 29, 2021 23:26
To: peg@lists.csail.mit.edu<mailto:peg@lists.csail.mit.edu>
Subject: [PEG] pe: a new PEG implementation for Python

EXTERNAL EMAIL: This email originated from outside the University of Portland. 
Do not click links or open attachments unless you recognize the sender and know 
the content is safe.
________________________________
Hello all,

I'd like to announce the release of 'pe', a new PEG parser for Python. While 
Python has no shortage of PEG parsers already, I think pe has enough 
interesting features that it's worth mentioning here.

https://github.com/goodmami/pe<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgoodmami%2Fpe&data=04%7C01%7Cmossa%40up.edu%7Cc50fb1d3258b4298847208d984ac18e9%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C0%7C637686697845317988%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=6sxyt2Fowbjehh1HMh6C2Zky1xPnnH%2FolouxkJYvY24%3D&reserved=0>

It's "new" in that I've never properly announced it anywhere, but I've been 
developing and using it for several years. Here's a summary:

* Low learning curve for users: the API is closely modeled after Python's re 
module and the metasyntax is a superset of Ford's original with only a few 
extensions

* Two options for parser engines: (1) a pure-Python memoizing recursive-descent 
parser and (2) a compiled Cython (currently non-memoizing) state machine parser 
inspired by Medeiros and Ierusalimschy, 2008 and LPeg

* It's fast (by Python standards): aggressive grammar optimizations and low 
overhead show it significantly faster than LALR (SLY, Lark) and PEG 
(Parsimonious, PyParsing) parsers in several tests (todo: more tests)

I achieved the speed in the compiled Cython parser rather conventionally: 
static typing, reducing function calls, etc., along with some basic grammar 
optimizations and some state machine optimizations from Medeiros and 
Ierusalimschy.

In the pure-Python parser, I sped things up primarily by replacing large 
portions of the grammar with equivalent regular expressions, for which Python 
has a backend written in C. However, constantly switching between Python and C 
code adds overhead and can actually make things *slower*, so I put as much into 
a single regex as I can by first inlining everything in the grammar that isn't 
recursive and doesn't have a semantic effect on the expression (captures, 
etc.). In addition, to replicate PEG behavior in a regular expression, I 
avoided backtracking with a combination of lookaheads and backreferences (I 
think there's an academic citation for this technique, but I'm sorry I don't 
have it now). The resulting regexes (and optimized grammar) are utterly 
illegible, but they perform well. Finally, I'm not concerned with drastically 
changing the AST because I don't produce ASTs: any grammar transformation that 
accepts the same strings and has the same semantic effect is fair game.

Thanks for reading. I welcome any comments or questions.

--
-Michael Wayne Goodman


--
-Michael Wayne Goodman
_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Reply via email to