Re: Pegged, From EBNF to PEG
On Sat, Mar 17, 2012 at 15:48, Dmitry Olshansky wrote: >> PEG sequences don't backtrack. > > > I'd argue they do. As I see it as: > Expr <- As B C D / B C D > As <- A / A As That's what people doing Regex-to-PEG translations do, yes. But it's not the spontaneous behavior of A* B C D in PEG. But that means I could add a switch to transform the expressions that way. > (or use an epsilon production for As, is it allowed in pegged ?) I called it 'Eps', it's a predefined parser that always matches and consumes nothing. I used the greek epsilon at the beginning (ε) but thought that many people would shout at this :) > OK, back to C-T parsing, I have one crazy idea that I can't get away from - > add operator precedence grammar into the mix. From what I observe it should > integrate into PEG smoothly. Then it would make "military-grade" hybrid that > uses operator precedence parsing of expressions and the like. Few more > tricks and it may beat some > existing parser generators. > > See this post, where I tried to describe that idea early on: > http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=159562 I remember reading this. But I don't feel I'm up to it for now. > I might catch spare time to go about adding it myself, the only tricky thing > is to embed plain semantic actions, as AST generation would be more or less > the same. Cool!
Re: Pegged, From EBNF to PEG
On 17.03.2012 18:11, Philippe Sigaud wrote: On Sat, Mar 17, 2012 at 10:09, Dmitry Olshansky wrote: Ok, let's agree on fact that semantically a* is : As<- a As / a and a*? is this: As<- a / a As Now that's local ;) It's local, yes. But the pb is with Expr<- A* B C D, when D fails. PEG sequences don't backtrack. I'd argue they do. As I see it as: Expr <- As B C D / B C D As <- A / A As (or use an epsilon production for As, is it allowed in pegged ?) I remember trying compile-time backtracking in another similar library in D 1-2 years ago and getting lots of pb. I might add that in Pegged, but I don't know the consequences. How do you do that in std.regex? There are two fundamental ways to get around the problem (snip) Thanks for the explanations. Does std.regex implement this? It does the second way that I haven't told about, and had hard time to implement in C-T :) And the simple version of what I presented (i.e. stack stuff) is used in CT-regex and as fallback in R-T. The problem with doing a bitmap memoization is that regex often used to _search_ things in large inputs. Some compromise and/or thresholds have to be estimated and found. Parsing on the contrary guaranties that the whole input is used or in error, hence bitmap is not wasted. Now if we put aside all crap talk about mathematical purity and monads, and you name it, a packrat is essentially this, a dynamic programming trick applied to recursive top-down parsing. The trick could be extended even more to avoid extra checks on input characters, but the essence is this "memoization". I plan to store indices anyway. I might add this in the future. A read something on using PEGs to parse Java and C and there was no need to do a complete memoization: storing the last parse result as a temporary seemed to be enough. Well even straight no-memorization is somehow enough, it's just a way to ensure no extra work is done. C & Java are not much of backtracky grammars. (nice article btw, I learnt some about regexes) Thanks, I hope it makes them more popular. Might as well keep me busy fixing bugs :) As people said on Reddit, you should make more whooping sounds about CT-regex. That was what wowed me and pushed me to tackle CT-parsing again. It's just I'm also well aware of how much luck it (still) takes to fly ;) The asserts that compare C-T vs R-T go off too often for my taste, other then this the memory usage during compile is too high. I recall once dmd had GC re-enabled for brief period, dmd used no more then ~64Mb doing real crazy ct-regex stuff. OK, back to C-T parsing, I have one crazy idea that I can't get away from - add operator precedence grammar into the mix. From what I observe it should integrate into PEG smoothly. Then it would make "military-grade" hybrid that uses operator precedence parsing of expressions and the like. Few more tricks and it may beat some existing parser generators. See this post, where I tried to describe that idea early on: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=159562 I might catch spare time to go about adding it myself, the only tricky thing is to embed plain semantic actions, as AST generation would be more or less the same. -- Dmitry Olshansky
Re: Pegged, From EBNF to PEG
On Sat, Mar 17, 2012 at 10:09, Dmitry Olshansky wrote: > Ok, let's agree on fact that semantically > a* is : > As <- a As / a > > and a*? is this: > As <- a / a As > > Now that's local ;) It's local, yes. But the pb is with Expr <- A* B C D, when D fails. PEG sequences don't backtrack. >> I remember trying compile-time backtracking in another similar library >> in D 1-2 years ago and getting lots of pb. I might add that in Pegged, >> but I don't know the consequences. How do you do that in std.regex? > > > > There are two fundamental ways to get around the problem (snip) Thanks for the explanations. Does std.regex implement this? > Now if we put aside all crap talk about mathematical purity and monads, and > you name it, a packrat is essentially this, a dynamic programming trick > applied to recursive top-down parsing. The trick could be extended even more > to avoid extra checks on input characters, but the essence is this > "memoization". I plan to store indices anyway. I might add this in the future. A read something on using PEGs to parse Java and C and there was no need to do a complete memoization: storing the last parse result as a temporary seemed to be enough. >> (nice article btw, I learnt some about regexes) > > > Thanks, I hope it makes them more popular. > Might as well keep me busy fixing bugs :) As people said on Reddit, you should make more whooping sounds about CT-regex. That was what wowed me and pushed me to tackle CT-parsing again.
Re: Pegged, From EBNF to PEG
On 17.03.2012 10:59, Philippe Sigaud wrote: On Wed, Mar 14, 2012 at 10:48, Dmitry Olshansky wrote: That's one of the caveats on PEG. That and greedy operators. 'a*a' never succeeds because 'a*' consumes all the available a's. Hey, wait, I thought there has to be backtrack here, i.e. when second 'a' finally fails? PEG only do local backtracking in 'or' choices, not in sequences. I think that's because the original author was dealing with packrat parsing and its O(input-size) guarantee. Ok, let's agree on fact that semantically a* is : As <- a As / a and a*? is this: As <- a / a As Now that's local ;) I remember trying compile-time backtracking in another similar library in D 1-2 years ago and getting lots of pb. I might add that in Pegged, but I don't know the consequences. How do you do that in std.regex? There are two fundamental ways to get around the problem, the easiest to implement is to use a stack to record a position in input + number of alternative/production (I call it a PC - program counter) every time there is a "fork" like that. Then when the current path ultimately fails pop stack, reset input and go over again until you either match or empty the stack. That's how to avoid dp recursion that happens here. The problematic thing now is combinatorial explosion of a* a* a* //simplified, but you get the idea The trick to avoid this sort of crap is to 1) agree that whatever matches first has higher priority (left-most match) that's a simple disambiguation rule. 2) record what positions + pc you place on stack e.g. use a set, or in fact, a bitmap to push every unique combination (pc,index) only once. Voila! Now you have a linear worst-case algorithm with (M*N) complexity where M is number of all possible PCs, and N is number of all possible indexes in input. Now if we put aside all crap talk about mathematical purity and monads, and you name it, a packrat is essentially this, a dynamic programming trick applied to recursive top-down parsing. The trick could be extended even more to avoid extra checks on input characters, but the essence is this "memoization". (nice article btw, I learnt some about regexes) Thanks, I hope it makes them more popular. Might as well keep me busy fixing bugs :) -- Dmitry Olshansky
Re: Pegged, From EBNF to PEG
On Wed, Mar 14, 2012 at 10:48, Dmitry Olshansky wrote: >> That's one of the caveats on PEG. That and greedy operators. >> >> 'a*a' never succeeds because 'a*' consumes all the available a's. >> > > Hey, wait, I thought there has to be backtrack here, i.e. when second 'a' > finally fails? PEG only do local backtracking in 'or' choices, not in sequences. I think that's because the original author was dealing with packrat parsing and its O(input-size) guarantee. I remember trying compile-time backtracking in another similar library in D 1-2 years ago and getting lots of pb. I might add that in Pegged, but I don't know the consequences. How do you do that in std.regex? (nice article btw, I learnt some about regexes)
Re: Pegged, From EBNF to PEG
On 14.03.2012 2:49, Philippe Sigaud wrote: On Tue, Mar 13, 2012 at 17:17, Dmitry Olshansky wrote: PEG defines order of alternatives, that is pretty much like a top-down recursive descent parser would parse it. Alternatives are tried from left to right, if first one fails, it tries next and so on. Yes. In an example I give B is always picked first and so C is never ever looked at. That's one of the caveats on PEG. That and greedy operators. 'a*a' never succeeds because 'a*' consumes all the available a's. Hey, wait, I thought there has to be backtrack here, i.e. when second 'a' finally fails? Literal<- ("'" .+ "'") / ('"' .+ '"') This needs escaping. Plain '.+' in pattern asks for trouble 99% of time. This is an example were PEG would munch anything till the end, and fail (since """ is not found in an empty string). The standard PEG way to do that is (!EndMarker .)+ EndMarker It's a common enough pattern I tend to define a parameterized rule for that: Until(Expr)<- (!Expr .)* Expr Gotta love parametrized rules! 15' to code, a neverending stream of joy. Nice! In fact grammars are usually devised the other way around, e.g. Start: Program -> ... Ehm... what the whole program is exactly ? Ok, let it be Declaration* for now. What kind of declarations do we have? ... and so on. Latter grammars get tweaked and extended numerous times. At any rate production order has no effect on the grammar, it's still the same. The only thing of importance is what non-terminal considered final (or start if you are LL-centric). Yes. The PEG standard seem to be that the first rule is the start (root) rule. Pegged let you decide which rule you use for a start. The rest is automatic. I might change that. -- Dmitry Olshansky
Re: Pegged, From EBNF to PEG
On Mon, Mar 12, 2012 at 13:43, bls wrote: > Just WOW! Thanks! Don't be too excited, it's still quite slow as a parser. But that is a fun project :) > Nice to have on your WIKI would be a EBNF to PEG sheet. > > Wirth EBNF Pegged > A = BC. A <- B C > A = B|C. A <- C / C > A = [B]. A <- B? > A = {B}. A <- B* fact is, I don't know EBNF that much. I basically learned everything I know about parsing or grammars while coding Pegged in February :) I probably made every mistakes in the book. Hey, it's a github public wiki, I guess you can create a new page?
Re: Pegged, From EBNF to PEG
On Tue, Mar 13, 2012 at 17:17, Dmitry Olshansky wrote: > PEG defines order of alternatives, that is pretty much like a top-down > recursive descent parser would parse it. Alternatives are tried from left to > right, if first one fails, it tries next and so on. Yes. > In an example I give B is always picked first and so C is never ever looked > at. That's one of the caveats on PEG. That and greedy operators. 'a*a' never succeeds because 'a*' consumes all the available a's. >> Literal <- ("'" .+ "'") / ('"' .+ '"') >> > This needs escaping. Plain '.+' in pattern asks for trouble 99% of time. This is an example were PEG would munch anything till the end, and fail (since """ is not found in an empty string). The standard PEG way to do that is (!EndMarker .)+ EndMarker It's a common enough pattern I tend to define a parameterized rule for that: Until(Expr) <- (!Expr .)* Expr Gotta love parametrized rules! 15' to code, a neverending stream of joy. > In fact grammars are usually devised the other way around, e.g. > Start: > Program -> ... > Ehm... what the whole program is exactly ? Ok, let it be Declaration* for > now. What kind of declarations do we have? ... and so on. Latter grammars > get tweaked and extended numerous times. > > At any rate production order has no effect on the grammar, it's still the > same. The only thing of importance is what non-terminal considered final (or > start if you are LL-centric). Yes. The PEG standard seem to be that the first rule is the start (root) rule. Pegged let you decide which rule you use for a start. The rest is automatic. I might change that.
Re: Pegged, From EBNF to PEG
On Tue, Mar 13, 2012 at 18:05, Alex Rønne Petersen wrote: >>> lowerCase <- [a-z] >>> upperCase <- [A-Z] >>> Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* >> >> >> Why not: >> Identifier <- [a-zA-Z]+ > > > That was an illustrative example from the Pegged docs. But yeah, you should > just use a range; reads nicer. The docs are for teaching PEG :) (btw, it's the docs describe C-like identifiers, that's why I chose a longer approach) It's always this 'tension', between inlining and refactoring. [a-zA-Z]+ is shorter and more readable. But If you decide to extend your grammar to UTF-32, it'd be easier to just change the 'letter' rule.
Re: Pegged, From EBNF to PEG
On 13-03-2012 17:17, Dmitry Olshansky wrote: On 12.03.2012 17:45, bls wrote: On 03/13/2012 04:28 AM, Dmitry Olshansky wrote: On 12.03.2012 16:43, bls wrote: On 03/10/2012 03:28 PM, Philippe Sigaud wrote: Hello, I created a new Github project, Pegged, a Parsing Expression Grammar (PEG) generator in D. https://github.com/PhilippeSigaud/Pegged docs: https://github.com/PhilippeSigaud/Pegged/wiki Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C. A <- C / C Maybe A <- B / C. And even then it's not exactly equivalent if the grammar was ambiguous. Imagine: B <- a, C <- aa PEG is pretty new to me. Can you elaborate a bit ? PEG defines order of alternatives, that is pretty much like a top-down recursive descent parser would parse it. Alternatives are tried from left to right, if first one fails, it tries next and so on. In an example I give B is always picked first and so C is never ever looked at. Somewhat less artificial example: Literal <- IntL| FloatL FloatL <- [0-9]+(.[0-9]+)? IntL <- [0-9]+ If you change it to: Literal <- FloatL| IntL then integer literals would get parsed as floating point. My mistake.. cleaned up stuff.. Pegged Wirth EBNF Sequence A <- B C A = BC. B or C A <- B / C A = B|C. Zero or one B A <- B? A = [B]. Zero or more Bs A <- B* A = {B}. One or more Bs A <- B+ Not available PEG description of EBNF EBNF <- Procuction+ Production <- Identifier '=' Expression '.' Expression <- Term ( '|' Term)* Term <- Factor Factor* Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}' / '(' Expression ')' lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* Why not: Identifier <- [a-zA-Z]+ That was an illustrative example from the Pegged docs. But yeah, you should just use a range; reads nicer. Literal <- ("'" .+ "'") / ('"' .+ '"') This needs escaping. Plain '.+' in pattern asks for trouble 99% of time. Still not sure if this is correct. Especially : Term <- Factor Factor* Another thing I never really understand is the "production" order, In other words : Why not top down .. Start : lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* End : EBNF <- Procuction+ where End is Root.. In fact grammars are usually devised the other way around, e.g. Start: Program -> ... Ehm... what the whole program is exactly ? Ok, let it be Declaration* for now. What kind of declarations do we have? ... and so on. Latter grammars get tweaked and extended numerous times. At any rate production order has no effect on the grammar, it's still the same. The only thing of importance is what non-terminal considered final (or start if you are LL-centric). TIA, Bjoern -- - Alex
Re: Pegged, From EBNF to PEG
On 12.03.2012 17:45, bls wrote: On 03/13/2012 04:28 AM, Dmitry Olshansky wrote: On 12.03.2012 16:43, bls wrote: On 03/10/2012 03:28 PM, Philippe Sigaud wrote: Hello, I created a new Github project, Pegged, a Parsing Expression Grammar (PEG) generator in D. https://github.com/PhilippeSigaud/Pegged docs: https://github.com/PhilippeSigaud/Pegged/wiki Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C. A <- C / C Maybe A <- B / C. And even then it's not exactly equivalent if the grammar was ambiguous. Imagine: B <- a, C <- aa PEG is pretty new to me. Can you elaborate a bit ? PEG defines order of alternatives, that is pretty much like a top-down recursive descent parser would parse it. Alternatives are tried from left to right, if first one fails, it tries next and so on. In an example I give B is always picked first and so C is never ever looked at. Somewhat less artificial example: Literal <- IntL| FloatL FloatL <- [0-9]+(.[0-9]+)? IntL <- [0-9]+ If you change it to: Literal <- FloatL| IntL then integer literals would get parsed as floating point. My mistake.. cleaned up stuff.. Pegged Wirth EBNF Sequence A <- B C A = BC. B or C A <- B / C A = B|C. Zero or one B A <- B? A = [B]. Zero or more Bs A <- B* A = {B}. One or more Bs A <- B+ Not available PEG description of EBNF EBNF <- Procuction+ Production <- Identifier '=' Expression '.' Expression <- Term ( '|' Term)* Term <- Factor Factor* Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}' / '(' Expression ')' lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* Why not: Identifier <- [a-zA-Z]+ Literal <- ("'" .+ "'") / ('"' .+ '"') This needs escaping. Plain '.+' in pattern asks for trouble 99% of time. Still not sure if this is correct. Especially : Term <- Factor Factor* Another thing I never really understand is the "production" order, In other words : Why not top down .. Start : lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* End : EBNF <- Procuction+ where End is Root.. In fact grammars are usually devised the other way around, e.g. Start: Program -> ... Ehm... what the whole program is exactly ? Ok, let it be Declaration* for now. What kind of declarations do we have? ... and so on. Latter grammars get tweaked and extended numerous times. At any rate production order has no effect on the grammar, it's still the same. The only thing of importance is what non-terminal considered final (or start if you are LL-centric). TIA, Bjoern -- Dmitry Olshansky
Re: Pegged, From EBNF to PEG
On 12-03-2012 13:43, bls wrote: On 03/10/2012 03:28 PM, Philippe Sigaud wrote: Hello, I created a new Github project, Pegged, a Parsing Expression Grammar (PEG) generator in D. https://github.com/PhilippeSigaud/Pegged docs: https://github.com/PhilippeSigaud/Pegged/wiki Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C. A <- C / C A = [B]. A <- B? A = {B}. A <- B* Having EBNF expressed in Pegged ... EBNF <- Procuction+ Production <- Identifier '=' Expression '.' Expression <- Term ( '|' Term)* Term <- Factor Factor* Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression }'/ '(' Expression ')' lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* Literal <- ("'" .+ "'" / '"' .+ '"') Due to the fact that EBNF can be expressed in EBNF it should be possible to parse an arbitrary EBNF file and generate PEG output. What do you think ? BTW.. Is my EBNF PEG description correct ? TIA Bjoern The thing is, PEG cannot represent an ambiguous grammar. It is fully ordered, so it's just plain impossible. That's not to say you can't turn an EBNF grammar into PEG, but it won't necessarily be useful in all cases. -- - Alex
Re: Pegged, From EBNF to PEG
On 03/13/2012 04:28 AM, Dmitry Olshansky wrote: On 12.03.2012 16:43, bls wrote: On 03/10/2012 03:28 PM, Philippe Sigaud wrote: Hello, I created a new Github project, Pegged, a Parsing Expression Grammar (PEG) generator in D. https://github.com/PhilippeSigaud/Pegged docs: https://github.com/PhilippeSigaud/Pegged/wiki Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C. A <- C / C Maybe A <- B / C. And even then it's not exactly equivalent if the grammar was ambiguous. Imagine: B <- a, C <- aa PEG is pretty new to me. Can you elaborate a bit ? My mistake.. cleaned up stuff.. Pegged Wirth EBNF Sequence A <- B C A = BC. B or C A <- B / C A = B|C. Zero or one B A <- B? A = [B]. Zero or more Bs A <- B* A = {B}. One or more Bs A <- B+ Not available PEG description of EBNF EBNF <- Procuction+ Production <- Identifier '=' Expression '.' Expression <- Term ( '|' Term)* Term <- Factor Factor* Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}' / '(' Expression ')' lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* Literal <- ("'" .+ "'") / ('"' .+ '"') Still not sure if this is correct. Especially : Term <- Factor Factor* Another thing I never really understand is the "production" order, In other words : Why not top down .. Start : lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* End : EBNF <- Procuction+ where End is Root.. TIA, Bjoern
Re: Pegged, From EBNF to PEG
On 12.03.2012 16:43, bls wrote: On 03/10/2012 03:28 PM, Philippe Sigaud wrote: Hello, I created a new Github project, Pegged, a Parsing Expression Grammar (PEG) generator in D. https://github.com/PhilippeSigaud/Pegged docs: https://github.com/PhilippeSigaud/Pegged/wiki Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C. A <- C / C Maybe A <- B / C. And even then it's not exactly equivalent if the grammar was ambiguous. Imagine: B <- a, C <- aa -- Dmitry Olshansky
Re: Pegged, From EBNF to PEG
On 03/10/2012 03:28 PM, Philippe Sigaud wrote: Hello, I created a new Github project, Pegged, a Parsing Expression Grammar (PEG) generator in D. https://github.com/PhilippeSigaud/Pegged docs: https://github.com/PhilippeSigaud/Pegged/wiki Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C.A <- C / C A = [B].A <- B? A = {B}.A <- B* Having EBNF expressed in Pegged ... EBNF <- Procuction+ Production <- Identifier '=' Expression '.' Expression <- Term ( '|' Term)* Term <- Factor Factor* Factor <-Identifier / Literal / '[' Expression ']' / '{' Expression }'/ '(' Expression ')' lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* Literal <- ("'" .+ "'" / '"' .+ '"') Due to the fact that EBNF can be expressed in EBNF it should be possible to parse an arbitrary EBNF file and generate PEG output. What do you think ? BTW.. Is my EBNF PEG description correct ? TIA Bjoern