Am 11.03.2012 21:40, schrieb Sergiu Ivanov:
I've never had any serious experience with operator precedence
parsing, but I have the intuition that this technique is going to be
quite unwieldy if we would like to go beyond simple expressions like
the ones you have shown.

It's a trade-off. The decision should be driven by the project goals, since no parsing technique can cover all.

Here's a run-down of available techniques:

Operator-precedence:
Somewhat limited (but not as badly as many people think).
Extending the grammar requires the ability to grasp the concepts of precedence and associativity.
The parser can easily generate useful error messages.
Extremely easy to implement.

LL(k):
Surprisingly limited, it cannot parse any input that starts with k or more left parentheses; without parentheses, it cannot parse any grammar that has more than k precedence levels with infix or postfix operators. Extending the grammar requires the ability to compute the sets of prefixes, which is moderately difficult and mostly doable in one's head.
The parser has a hard time generating useful error messages.
Implementation is quite complex.

SLR:
Only marginally less limited than operator-precedence.
Otherwise, the same advantages and disadvantages as LR(k) hold.

LALR(k):
Almost as powerful as LR(k+1).
Generating error messages is somewhat more difficult than with LR(k).
Otherwise, the same advantages and disadvantages as LR(k) hold.

LR(k):
This is the most powerful widely-known algorithm for which ambiguities can be detected by inspecting the grammar. Extending the grammar requires intimate knowledge of the parsing algorithm; otherwise, you'll be unable to interpret the "shift-reduce" and "reduce-reduce" conflict error messages.
Generating useful error messages is moderately difficult.
Implementation is complex.

Earley:
Can handle arbitrary context-free grammars. The downside is that there is no way to check whether a grammar is ambiguous. However, the algorithm will detect if a given input is ambiguous. Extending the grammar is trivial, checking it for ambiguity requires an arsenal of techniques (in practice, most teams just hope that there is no problem there and tweak the grammar until practical usage does not throw ambiguity errors anymore). I don't know how hard generating useful error messages is. I suspect it's hard or impossible, but I'd have to ask a search engine to verify that. Implementation complexity is almost as complex as LR(k) (the parsing algorithm is more complex but you do not need to preprocess the grammar). The algorithm is worst-case polynomial in input length: quadratic if the input is unambiguous, cubic if it is ambiguous. This is not a serious issue for small inputs. There are variants that have better worst-case complexities, but these are more complex.

However, I'd recommend using a parser generator instead of writing a
parser from scratch yourself.  It will introduce some slowdown, but
will instead make the whole thing a lot easier to write, maintain and,
more importantly, to extend.

A parser generator is required for LL and LR variants.
The problem with these is that they usually come with their own syntax, so using them requires learning new skills.

 I'm not sure about the state of parser
generators for Python, but this page
http://wiki.python.org/moin/LanguageParsing may provide some
information.

A parser that's embedded into Python as a DSL would be an option.

I can see a different problem here though: choosing a parsing method
and producing the grammar of the language (or what information the
parser would require) may not be enough to get the desired
functionality.  What we are trying to get at should incorporate
something similar to natural language processing.  Therefore, before
actually starting the parsing process, I think the input should be
brought to some kind of canonical form.  I am strongly inclined to
believe that Wolfram|Alpha takes exactly this way of doing things (but
obviously with a lot more details).

The first step to transformation would be a parse, so I'm not really sure what point you're making here.

One simple thing I can think of is detecting synonyms.  For example,
"roots x**2 == 1" and "solutions x**2 == 1" and "solve x**2 == 1" are
one and the same thing.  Therefore, it may be useful to start with
substituting each of these synonyms with some canonical word ("solve",
for example).

This is stuff that is easiest to do *after* a parse.

It is possible to argue that this can be solved by
introducing additional rules into the grammar.

It's easier to build alternate definitions into the semantic processing that happens after the parse.

However, I am inclined
to believe that this will make the grammar unnecessarily large and,
even worse, slow down the parser.

The parsing algorithms enumerated above do not care much about whether each function and operator name is done in a separate grammar rule, or if there is one rule for function calls and one rule for all operators. Initialization the tables requires time roughly linear with the number of grammar symbols, and that's all.

Having a larger grammar isn't so much a maintenance problem either if each rule follows a pattern from a small set of standard patterns. What's making trouble is having many different patterns. Mostly because rules interact, and the interactions become more complex if there are many rule patterns - you get "conflicts", and these are generally hard to resolve, and adding a new rule can create a conflict in a faraway rule (unless you're using operator precedence, where the number of patterns is restricted, or Earley, where there is no checking on the grammar).

Another simple thing is detecting spelling errors.  Suppose I type
"intgrate" or any of the other (numerous) wrong possibilities.  I
think it would be very nice to have such mistakes detected and
corrected automatically.  This http://packages.python.org/pyenchant/
seems on topic.

Spell checking, again, is something that is easily done on the parse tree after it has been created. Just make sure that the parse tree contains all the information needed to generate a good message: Each symbol should carry not just the text but also originating line and column number.

I'd also recommend showing the substitutions and the resulting
canonical form.  In this way the user will be able to see how their
input was interpreted and, maybe, change their formulation to get
nearer to what they wanted.

+1.

The list of actions the preprocessor should can be extended
arbitrarily, I guess.

Yes.

For example, it could try to fix wrongly
balanced parentheses.

That one would be very hard to do well. There is too little redundancy in the input to guess the proper place for a parenthese. For an example of how it doesn't work, try entering unbalanced parentheses in Excel.

It might also try to cope with a different
order of keywords in the string, like in "sin(x) integral".  It would
be nice to parse single-argument functions written without parentheses
("sin x" instead of "sin(x)").

That one can be done on the grammar level: just make a function name bind tighter than any operator (this is the typical of how functional programming languages have their grammar defined).
Then,
  sin x * y
will parse as
  (sin x) * y

Of course, this is not Python's grammar anymore. (It does follow mathematical convention more closely though.)

The preprocessor could also drop
incomprehensible (and thus supposedly meaningless) words, like in
"find the integral of x^2".

Experience with this kind of approach was that it tends to reduce predictability. In this case, the user might consider a word meaningless but the program has some obscure definition for it and suddenly spits out error messages that refer to stuff that the user doesn't know about - say, if there is an "of" function somewhere in a category-theoretic module.

--
You received this message because you are subscribed to the Google Groups 
"sympy" group.
To post to this group, send email to sympy@googlegroups.com.
To unsubscribe from this group, send email to 
sympy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sympy?hl=en.

Reply via email to