Re: [sympy] feedback for GSOC 2012 idea

Joachim Durchholz Tue, 13 Mar 2012 00:43:05 -0700

Am 11.03.2012 21:40, schrieb Sergiu Ivanov:

I've never had any serious experience with operator precedence
parsing, but I have the intuition that this technique is going to be
quite unwieldy if we would like to go beyond simple expressions like
the ones you have shown.

It's a trade-off. The decision should be driven by the project goals,since no parsing technique can cover all.


Here's a run-down of available techniques:

Operator-precedence:
Somewhat limited (but not as badly as many people think).

Extending the grammar requires the ability to grasp the concepts ofprecedence and associativity.

The parser can easily generate useful error messages.
Extremely easy to implement.

LL(k):

Surprisingly limited, it cannot parse any input that starts with k ormore left parentheses; without parentheses, it cannot parse any grammarthat has more than k precedence levels with infix or postfix operators.Extending the grammar requires the ability to compute the sets ofprefixes, which is moderately difficult and mostly doable in one's head.

The parser has a hard time generating useful error messages.
Implementation is quite complex.

SLR:
Only marginally less limited than operator-precedence.
Otherwise, the same advantages and disadvantages as LR(k) hold.

LALR(k):
Almost as powerful as LR(k+1).
Generating error messages is somewhat more difficult than with LR(k).
Otherwise, the same advantages and disadvantages as LR(k) hold.

LR(k):

This is the most powerful widely-known algorithm for which ambiguitiescan be detected by inspecting the grammar.Extending the grammar requires intimate knowledge of the parsingalgorithm; otherwise, you'll be unable to interpret the "shift-reduce"and "reduce-reduce" conflict error messages.

Generating useful error messages is moderately difficult.
Implementation is complex.

Earley:

Can handle arbitrary context-free grammars. The downside is that thereis no way to check whether a grammar is ambiguous. However, thealgorithm will detect if a given input is ambiguous.Extending the grammar is trivial, checking it for ambiguity requires anarsenal of techniques (in practice, most teams just hope that there isno problem there and tweak the grammar until practical usage does notthrow ambiguity errors anymore).I don't know how hard generating useful error messages is. I suspectit's hard or impossible, but I'd have to ask a search engine to verify that.Implementation complexity is almost as complex as LR(k) (the parsingalgorithm is more complex but you do not need to preprocess the grammar).The algorithm is worst-case polynomial in input length: quadratic if theinput is unambiguous, cubic if it is ambiguous. This is not a seriousissue for small inputs. There are variants that have better worst-casecomplexities, but these are more complex.

However, I'd recommend using a parser generator instead of writing a
parser from scratch yourself.  It will introduce some slowdown, but
will instead make the whole thing a lot easier to write, maintain and,
more importantly, to extend.


A parser generator is required for LL and LR variants.

The problem with these is that they usually come with their own syntax,so using them requires learning new skills.

 I'm not sure about the state of parser
generators for Python, but this page
http://wiki.python.org/moin/LanguageParsing may provide some
information.


A parser that's embedded into Python as a DSL would be an option.

I can see a different problem here though: choosing a parsing method
and producing the grammar of the language (or what information the
parser would require) may not be enough to get the desired
functionality.  What we are trying to get at should incorporate
something similar to natural language processing.  Therefore, before
actually starting the parsing process, I think the input should be
brought to some kind of canonical form.  I am strongly inclined to
believe that Wolfram|Alpha takes exactly this way of doing things (but
obviously with a lot more details).

The first step to transformation would be a parse, so I'm not reallysure what point you're making here.

One simple thing I can think of is detecting synonyms.  For example,
"roots x**2 == 1" and "solutions x**2 == 1" and "solve x**2 == 1" are
one and the same thing.  Therefore, it may be useful to start with
substituting each of these synonyms with some canonical word ("solve",
for example).


This is stuff that is easiest to do *after* a parse.

It is possible to argue that this can be solved by
introducing additional rules into the grammar.

It's easier to build alternate definitions into the semantic processingthat happens after the parse.

However, I am inclined
to believe that this will make the grammar unnecessarily large and,
even worse, slow down the parser.

The parsing algorithms enumerated above do not care much about whethereach function and operator name is done in a separate grammar rule, orif there is one rule for function calls and one rule for all operators.Initialization the tables requires time roughly linear with the numberof grammar symbols, and that's all.

Having a larger grammar isn't so much a maintenance problem either ifeach rule follows a pattern from a small set of standard patterns.What's making trouble is having many different patterns. Mostly becauserules interact, and the interactions become more complex if there aremany rule patterns - you get "conflicts", and these are generally hardto resolve, and adding a new rule can create a conflict in a farawayrule (unless you're using operator precedence, where the number ofpatterns is restricted, or Earley, where there is no checking on thegrammar).

Another simple thing is detecting spelling errors.  Suppose I type
"intgrate" or any of the other (numerous) wrong possibilities.  I
think it would be very nice to have such mistakes detected and
corrected automatically.  This http://packages.python.org/pyenchant/
seems on topic.

Spell checking, again, is something that is easily done on the parsetree after it has been created.Just make sure that the parse tree contains all the information neededto generate a good message: Each symbol should carry not just the textbut also originating line and column number.

I'd also recommend showing the substitutions and the resulting
canonical form.  In this way the user will be able to see how their
input was interpreted and, maybe, change their formulation to get
nearer to what they wanted.

+1.

The list of actions the preprocessor should can be extended
arbitrarily, I guess.


Yes.

For example, it could try to fix wrongly
balanced parentheses.

That one would be very hard to do well. There is too little redundancyin the input to guess the proper place for a parenthese.For an example of how it doesn't work, try entering unbalancedparentheses in Excel.

It might also try to cope with a different
order of keywords in the string, like in "sin(x) integral".  It would
be nice to parse single-argument functions written without parentheses
("sin x" instead of "sin(x)").

That one can be done on the grammar level: just make a function namebind tighter than any operator (this is the typical of how functionalprogramming languages have their grammar defined).

Then,
  sin x * y
will parse as
  (sin x) * y

Of course, this is not Python's grammar anymore. (It does followmathematical convention more closely though.)

The preprocessor could also drop
incomprehensible (and thus supposedly meaningless) words, like in
"find the integral of x^2".

Experience with this kind of approach was that it tends to reducepredictability. In this case, the user might consider a word meaninglessbut the program has some obscure definition for it and suddenly spitsout error messages that refer to stuff that the user doesn't know about- say, if there is an "of" function somewhere in a category-theoreticmodule.


--
You received this message because you are subscribed to the Google Groups 
"sympy" group.
To post to this group, send email to sympy@googlegroups.com.
To unsubscribe from this group, send email to 
sympy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sympy?hl=en.

Re: [sympy] feedback for GSOC 2012 idea

Reply via email to