Re: [HACKERS] Status report: getting plpgsql to use the core lexer

2009-07-16 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote: 
 
 One problem that wasn't obvious when I started is that if you are
 trying to use a reentrant lexer, Bison insists on including its
 YYSTYPE union in the call signature of the lexer.  Of course,
 YYSTYPE means different things to the core grammar and plpgsql's
 grammar.  I tried to work around that by having an interface layer
 that would (among other duties) translate as needed.  It turned out
 to be a real PITA, not least because you can't include both
 definitions into the same C file.  The scheme I have has more or
 less failed --- I think I'd need *two* interface layers to make it
 work without unmaintainable kluges.  It would probably be better to
 try to adjust the core lexer's API some more so that it does not
 depend on the core YYSTYPE, but I'm not sure yet how to get Bison to
 play along without injecting an interface layer (and hence wasted
 cycles) into the core grammar/lexer interface.
 
 Another pretty serious issue is that the current plpgsql lexer
 treats various sorts of qualified names as single tokens.  I had
 thought this could be worked around in the interface layer by doing
 more lookahead.  You can do that, and it mostly works, but it's
 mighty tedious.  The big problem is that yytext gets out of step
 --- it will point at the last token the core lexer has processed,
 and there's no good way to back it up after lookahead.  I spent a
 fair amount of time trying to work around that by eliminating uses
 of yytext in plpgsql, and mostly succeeded, but there are still
 some left.  (Some of the remaining regression failures are error
 messages that point at the wrong token because they rely on yytext.)
 
 Now, having name lookup happen at the lexical level is pretty bogus
 anyhow.  The long-term solution here is probably to avoid doing
 lookup in the plpgsql lexer and move it into some sort of callback
 hook in the main parser, as we've discussed before.  I didn't want
 to get into that right away, but I'm now thinking it has to happen
 before not after refactoring the lexer code.  One issue that has to
 be surmounted before that can happen is that plpgsql currently
 throws away all knowledge of syntactic scope after initial
 processing of a function --- the name stack is no longer available
 when we want to parse individual SQL commands.  We can probably
 rearrange that design but it's another bit of work I don't have time
 for right now.
 
All of this sounds pretty familiar to me.  As you may recall, our
framework includes a SQL parser which parses the subset of standard
SQL we feel is portable enough, and generates Java classes to
implement the code in lowest common denominator SQL with all
procedural code for triggers and stored procedures handled in Java
(which runs in our middle tier database service).  We use ANTLR, and
initially had a three-phase process: lexer, parser, and tree-walkers
to generate code.  We were doing way too much in the parser phase --
checking for table names, column names, data types, etc.  The syntax
of SQL forced us to do a lot of scanning forward and remembering where
we were (especially to get the FROM clause information so we could
process columns in the result list).
 
We were able to get to much cleaner code by rewriting the parser to
have a dumb phase to get the overall structure into an AST, and then
use a tree-walker phase to do all the lookups and type resolution
after we had the rough structure, writing another AST to walk for code
generation.  Besides making the code cleaner and easier to maintain,
it helped us give better error messages pointing more accurately to
the source of the problem.  I don't know if a similar approach is
feasible in flex/bison, but if it is, refactoring for an extra pass
might be worth the trouble.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Status report: getting plpgsql to use the core lexer

2009-07-16 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 ...
 We were able to get to much cleaner code by rewriting the parser to
 have a dumb phase to get the overall structure into an AST, and then
 use a tree-walker phase to do all the lookups and type resolution
 after we had the rough structure, writing another AST to walk for code
 generation.  Besides making the code cleaner and easier to maintain,
 it helped us give better error messages pointing more accurately to
 the source of the problem.  I don't know if a similar approach is
 feasible in flex/bison, but if it is, refactoring for an extra pass
 might be worth the trouble.

That's actually what we have in the core parser.  plpgsql is trying to
take shortcuts, and this whole project is exactly about weaning it away
from that.  The bottom line is I tried to tackle the sub-projects in the
wrong order...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Status report: getting plpgsql to use the core lexer

2009-07-15 Thread Alvaro Herrera
Tom Lane wrote:

 Another pretty serious issue is that the current plpgsql lexer treats
 various sorts of qualified names as single tokens.  I had thought this
 could be worked around in the interface layer by doing more lookahead.
 You can do that, and it mostly works, but it's mighty tedious.  The big
 problem is that yytext gets out of step --- it will point at the last
 token the core lexer has processed, and there's no good way to back it up
 after lookahead.  I spent a fair amount of time trying to work around that
 by eliminating uses of yytext in plpgsql, and mostly succeeded, but
 there are still some left.  (Some of the remaining regression failures are
 error messages that point at the wrong token because they rely on yytext.)

Just wondering if there are additional regressions not detected due to
pg_regress using the ignore-whitespace option to diff.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Status report: getting plpgsql to use the core lexer

2009-07-15 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Tom Lane wrote:
 ...  I spent a fair amount of time trying to work around that
 by eliminating uses of yytext in plpgsql, and mostly succeeded, but
 there are still some left.  (Some of the remaining regression failures are
 error messages that point at the wrong token because they rely on yytext.)

 Just wondering if there are additional regressions not detected due to
 pg_regress using the ignore-whitespace option to diff.

Good question, but I doubt it --- those messages all use double quotes
around the yytext string, and I believe that eg. foo and  foo are
different even under --ignore-whitespace.

I just finished wiping all that stuff from my work directory, or I'd
be able to give you a non-guesswork answer :-( ... but it's not
worth regenerating the build for.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers