On 1/18/19 6:00 PM, guile-user-requ...@gnu.org wrote: > For example, what is the difference >> between PEG parsing and parser combinators? >> > Both of them do top-down parsing (try to match the top-level grammar rule, > which is > done by trying to match the lower-level rules which make it up, and so on > until we > get down to single tokens. They differ in the amount of lookahead that's > provided. > > Parser combinators are a facade over recursive-descent parsing, where you > write > a procedure corresponding to each rule in the parser, and you can only look > ahead > one token to guide parsing. This is known as LL(1) parsing. It's very easy > to write > these procedures, so you don't really need an engine like yacc to do the > job. > One character of lookahead isn't usually enough, so there is a lower-level > lexer > that uses (typically) regular expressions to aggregate things like numbers > and > identifiers into single parsing tokens. > > PEG parsing allows *infinite* lookahead, all the way to the end of the > input. > However, it memoizes the results of parsing to avoid reparsing repeatedly. > To make this possible, alternatives are ordered: instead of rules like A | > B, > where the first token has to be enough to tell you if you have an A or a B, > you have rules like A / B, which means "assume it's an A, and only if it > fails, assume it's a B". If A and B overlap, non-PEG parsing has to treat > A | B as a grammar ambiguity and recover, but A / B is never ambiguous.
Hi swedebugia, Thanks for that explanation. I understand lookahead, but wait, somehow this description seems to go against what I have seen with parser combinators so far: I could have "or-connected" procedures where each of them requires more than one character or else the parser fails for the string. For example one expects "aa0" the other "aa1", but if neither is present, then the parsing fails. This seems like a lookahead of more than one character is possible with parser combinators. In code using Dave Thompson's parser-combinators library: ~~~ (load "parser-combinators.scm") (use-modules (parser-combinators) (srfi srfi-41)) (define string->stream (compose list->stream string->list)) ((parse-any (parse-string "aa0") (parse-string "aa1")) (string->stream "aa2")) ; fails ((parse-any (parse-string "aa0") (parse-string "aa1")) (string->stream "aa1")) ; succeeds ((parse-any (parse-each (parse-string "aa") (parse-string "0")) (parse-each (parse-string "aa") (parse-string "1"))) (string->stream "aa1")) ; succeeds ~~~ Instead of parsing a string, I could also have a complex parser there, which parses all kinds of numbers and the second thing could be something else, which itself contains an "or-connected" thing inside. It seems that the abstract examples you wrote are already possible with the parser combinators then. I am probably misunderstanding it somewhere. In the examples A and B overlap, so maybe it already has to "recover" (make a decision which branch to go next)? But then again the part after "aa" is not ambiguous, so it does not seem like a problem in the example. Maybe if I had an example like the following: The string "aaa" could become the tokens "aa" and "a" or could become "a" and "aa", basically A is "aa" and B is "a" and so the parsing process has to decide between the two. Would that be such a case, where in PEG parsing there is no ambiguity, but in parser-combinators approach there is? I think the recovery here is, that the first mentioned parser in the code is the first one that is tried and if it fails the next one is tried. First one who succeeds "wins". This seems to be a backtracking approach. The wiki (https://en.wikipedia.org/wiki/Recursive_descent_parser) article says: "Recursive descent with backtracking is a technique that determines which production to use by trying each production in turn. Recursive descent with backtracking is not limited to LL(k) grammars, but is not guaranteed to terminate unless the grammar is LL(k). Even when they terminate, parsers that use recursive descent with backtracking may require exponential time." So I guess "Yes, I can handle these cases using parser-combinators, however, in the general case it may take exponential time." -- Which is usually bad. I wonder how common ambiguity in languages we use are. For example if I wanted to parse markdown, would there be difficult to handle ambiguities? I think there might be some about indentation, for example a block quote inside a list inside a list or things like that, but I guess those could be easily handled by putting in an assumption, that the indentation is always foremost for the list and only after that for anythind indented inside the list, like a blockquote. One could argue, that the specification has a different than or no assumption like that (I did not actually go and read it now) and that the language being parsed is slightly different. Maybe if one wanted 100% compliance PEG parsing might be necessary? Would that be such a case? Regards, Zelphir