Re: [sympy] feedback for GSOC 2012 idea

Sergiu Ivanov Tue, 13 Mar 2012 09:07:16 -0700

On Tue, Mar 13, 2012 at 9:42 AM, Joachim Durchholz <j...@durchholz.org> wrote:
> Am 11.03.2012 21:40, schrieb Sergiu Ivanov:
>
> Here's a run-down of available techniques:
>
> Operator-precedence:
> Somewhat limited (but not as badly as many people think).
> Extending the grammar requires the ability to grasp the concepts of
> precedence and associativity.
> The parser can easily generate useful error messages.
> Extremely easy to implement.
>
> LL(k):
> Surprisingly limited, it cannot parse any input that starts with k or more
> left parentheses; without parentheses, it cannot parse any grammar that has
> more than k precedence levels with infix or postfix operators.
> Extending the grammar requires the ability to compute the sets of prefixes,
> which is moderately difficult and mostly doable in one's head.
> The parser has a hard time generating useful error messages.
> Implementation is quite complex.
>
> SLR:
> Only marginally less limited than operator-precedence.
> Otherwise, the same advantages and disadvantages as LR(k) hold.
>
> LALR(k):
> Almost as powerful as LR(k+1).
> Generating error messages is somewhat more difficult than with LR(k).
> Otherwise, the same advantages and disadvantages as LR(k) hold.
>
> LR(k):
> This is the most powerful widely-known algorithm for which ambiguities can
> be detected by inspecting the grammar.
> Extending the grammar requires intimate knowledge of the parsing algorithm;
> otherwise, you'll be unable to interpret the "shift-reduce" and
> "reduce-reduce" conflict error messages.
> Generating useful error messages is moderately difficult.
> Implementation is complex.
>
> Earley:
> Can handle arbitrary context-free grammars. The downside is that there is no
> way to check whether a grammar is ambiguous. However, the algorithm will
> detect if a given input is ambiguous.
> Extending the grammar is trivial, checking it for ambiguity requires an
> arsenal of techniques (in practice, most teams just hope that there is no
> problem there and tweak the grammar until practical usage does not throw
> ambiguity errors anymore).
> I don't know how hard generating useful error messages is. I suspect it's
> hard or impossible, but I'd have to ask a search engine to verify that.
> Implementation complexity is almost as complex as LR(k) (the parsing
> algorithm is more complex but you do not need to preprocess the grammar).
> The algorithm is worst-case polynomial in input length: quadratic if the
> input is unambiguous, cubic if it is ambiguous. This is not a serious issue
> for small inputs. There are variants that have better worst-case
> complexities, but these are more complex.


Thank you for the very good overview; I wish I had been given
something like this in my course on compilers.

> A parser generator is required for LL and LR variants.
> The problem with these is that they usually come with their own syntax, so
> using them requires learning new skills.

I think the effort required to acquire these new skills is still less
than that of mastering a certain parsing technique and then
implementing it.  (Just for reference, I think I can remember I once
learnt the basics of JavaCC in a day.)

On the other hand, you can switch the parsing techniques much easier
if you are using parser generators.

>>  I'm not sure about the state of parser
>> generators for Python, but this page
>> http://wiki.python.org/moin/LanguageParsing may provide some
>> information.
>
>
> A parser that's embedded into Python as a DSL would be an option.

I'm not sure I can properly understand your idea; why do we need to
*embed* the DSL into Python?

I've checked the link I provided one more time; it lists a number of
implementations that look quite appealing, including
http://www.acooke.org/lepl/ and http://pypi.python.org/pypi/modgrammar
.

>> I can see a different problem here though: choosing a parsing method
>> and producing the grammar of the language (or what information the
>> parser would require) may not be enough to get the desired
>> functionality.  What we are trying to get at should incorporate
>> something similar to natural language processing.  Therefore, before
>> actually starting the parsing process, I think the input should be
>> brought to some kind of canonical form.  I am strongly inclined to
>> believe that Wolfram|Alpha takes exactly this way of doing things (but
>> obviously with a lot more details).
>
>
> The first step to transformation would be a parse, so I'm not really sure
> what point you're making here.

Hm, well, in my imagination the transformations I listed were to be
solved by doing find-and-replace, so no, the first step in the
transformations I suggested would not be parsing.  Maybe lexing, but
not parsing.

>> One simple thing I can think of is detecting synonyms.  For example,
>> "roots x**2 == 1" and "solutions x**2 == 1" and "solve x**2 == 1" are
>> one and the same thing.  Therefore, it may be useful to start with
>> substituting each of these synonyms with some canonical word ("solve",
>> for example).
>
>
> This is stuff that is easiest to do *after* a parse.

How so?  I can't see how it is hard to replace ("roots" or
"solutions") with "solve" before parsing.

>> It is possible to argue that this can be solved by
>> introducing additional rules into the grammar.
>
>
> It's easier to build alternate definitions into the semantic processing that
> happens after the parse.

Hm, I think I can see your point now.  It is indeed more correct to
process alternate definitions into the after-parse semantic
processing.  However, I'm not sure the resulting code is going to be
simpler.

However, I am not going to fight for this point of mine; your idea
looks more correct.

>> However, I am inclined
>> to believe that this will make the grammar unnecessarily large and,
>> even worse, slow down the parser.
>
>
> The parsing algorithms enumerated above do not care much about whether each
> function and operator name is done in a separate grammar rule, or if there
> is one rule for function calls and one rule for all operators.
> Initialization the tables requires time roughly linear with the number of
> grammar symbols, and that's all.

Yes, sure.

> Having a larger grammar isn't so much a maintenance problem either if each
> rule follows a pattern from a small set of standard patterns.

I think I don't understand this.  Could you please give more details?

By the way, I can see reason for the difference between your
suggestion and mine.  I am too used to (theoretical) rewriting rules,
which made me forget about the lexing phase completely.

>> Another simple thing is detecting spelling errors.  Suppose I type
>> "intgrate" or any of the other (numerous) wrong possibilities.  I
>> think it would be very nice to have such mistakes detected and
>> corrected automatically.  This http://packages.python.org/pyenchant/
>> seems on topic.
>
>
> Spell checking, again, is something that is easily done on the parse tree
> after it has been created.

I'm not sure I agree.  Consider the (supposedly valid) sentences
"integrate sin(x) by x" and "limit sin(x) when x goes to zero".  I
don't think I'd recommend parsing these two sentences with one
(general) rule, which means that the words "integrate" and "limit"
actually determine which of the rules to use.  If spell checking
doesn't happen before lexing, the necessary difference between
"integrate" and "limit" may not be detected.

>> For example, it could try to fix wrongly
>> balanced parentheses.
>
> That one would be very hard to do well. There is too little redundancy in
> the input to guess the proper place for a parenthese.

Yes, of course.  That was just a wild suggestion with brainstorming
mode on.

> For an example of how it doesn't work, try entering unbalanced parentheses
> in Excel.

LibreOffice Spreadsheet it seems to cope with the simplest cases, but
I guess it only works for the simplest cases.

>> It might also try to cope with a different
>> order of keywords in the string, like in "sin(x) integral".  It would
>> be nice to parse single-argument functions written without parentheses
>> ("sin x" instead of "sin(x)").
>
> That one can be done on the grammar level: just make a function name bind
> tighter than any operator (this is the typical of how functional programming
> languages have their grammar defined).
> Then,
>  sin x * y
> will parse as
>  (sin x) * y

Indeed.

>> The preprocessor could also drop
>> incomprehensible (and thus supposedly meaningless) words, like in
>> "find the integral of x^2".
>
>
> Experience with this kind of approach was that it tends to reduce
> predictability. In this case, the user might consider a word meaningless but
> the program has some obscure definition for it and suddenly spits out error
> messages that refer to stuff that the user doesn't know about

I don't think this is a problem because the user is not supposed to
purposefully input meaningless words in normal scenarios.

> - say, if
> there is an "of" function somewhere in a category-theoretic module.

:-D

Sergiu

-- 
You received this message because you are subscribed to the Google Groups 
"sympy" group.
To post to this group, send email to sympy@googlegroups.com.
To unsubscribe from this group, send email to 
sympy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sympy?hl=en.

Re: [sympy] feedback for GSOC 2012 idea

Reply via email to