Re: [HACKERS] scanner/parser minimization

2013-03-14 Thread Heikki Linnakangas

On 13.03.2013 10:50, Simon Riggs wrote:

On 2 March 2013 18:47, Heikki Linnakangashlinnakan...@vmware.com  wrote:

Interestingly, the yy_transition array generated by flex used to be much
smaller:

8.3: 22072 elements
8.4: 62623 elements
master: 64535 elements

The big jump between 8.3 and 8.4 was caused by introduction of the unicode
escapes: U'foo' [UESCAPE 'x'] . And in particular, the error rule for the
UESCAPE, which we use to avoid backtracking.

I experimented with a patch that uses two extra flex states to shorten the
error rules, see attached. The idea is that after lexing a unicode literal
like U'foo', you enter a new state, in which you check whether an
UESCAPE 'x' follows. This slashes the size of the array to 36581 elements.


+1 to do this sooner rather than later


I hear no objection, so committed. (after fixing some small bugs in the 
patch, and adding some comments)


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] scanner/parser minimization

2013-03-14 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 I hear no objection, so committed. (after fixing some small bugs in the 
 patch, and adding some comments)

Please keep psqlscan.l in sync with scan.l.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] scanner/parser minimization

2013-03-13 Thread Simon Riggs
On 2 March 2013 18:47, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 Uh ... no.  I haven't looked into why the flex tables are so large,
 but this theory is just wrong.  See ScanKeywordLookup().


 Interestingly, the yy_transition array generated by flex used to be much
 smaller:

 8.3: 22072 elements
 8.4: 62623 elements
 master: 64535 elements

 The big jump between 8.3 and 8.4 was caused by introduction of the unicode
 escapes: U'foo' [UESCAPE 'x'] . And in particular, the error rule for the
 UESCAPE, which we use to avoid backtracking.

 I experimented with a patch that uses two extra flex states to shorten the
 error rules, see attached. The idea is that after lexing a unicode literal
 like U'foo', you enter a new state, in which you check whether an
 UESCAPE 'x' follows. This slashes the size of the array to 36581 elements.

+1 to do this sooner rather than later

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] scanner/parser minimization

2013-03-08 Thread Bruce Momjian
On Thu, Feb 28, 2013 at 04:09:11PM -0500, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  A whole lot of those state transitions are attributable to states
  which have separate transitions for each of many keywords.
 
 Yeah, that's no surprise.
 
 The idea that's been in the back of my mind for awhile is to try to
 solve the problem at the lexer level not the parser level: that is,
 have the lexer return IDENT whenever a keyword appears in a context
 where it should be interpreted as unreserved.  You suggest somehow
 driving that off mid-rule actions, but that seems to be to be probably
 a nonstarter from a code complexity and maintainability standpoint.
 
 I believe however that it's possible to extract an idea of which
 tokens the parser believes it can see next at any given parse state.
 (I've seen code for this somewhere on the net, but am too lazy to go
 searching for it again right now.)  So we could imagine a rule along

I believe tokenizing of typedefs requries the lexer to peek at the
parser state:

http://calculist.blogspot.com/2009/02/c-typedef-parsing-problem.html

The well-known typedef problem with parsing C is that the standard C
grammar is ambiguous unless the lexer distinguishes identifiers bound by
typedef and other identifiers as two separate lexical classes. This
means that the parser needs to feed scope information to the lexer
during parsing. One upshot is that lexing must be done concurrently with
parsing. 

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] scanner/parser minimization

2013-03-02 Thread Greg Stark
Regarding yytransition I think the problem is we're using flex to
implement keyword recognition which is usually not what it's used for.
Usually people use flex to handle syntax things like quoting and
numeric formats. All identifiers are handled by flex as equivalent.
Then the last step in the scanner for identifiers is to look up the
identifier in a hash table and return the keyword token if it's a
keyword. That would massively simplify the scanner tables.

This hash table can be heavily optimized because it's a static lookup.
c.f. http://www.gnu.org/software/gperf/

In theory this is more expensive since it needs to do a strcmp in
addition to scanning the identifier to determine whether the token
ends. But I suspect in practice the smaller tables might outweight
that cost.


On Thu, Feb 28, 2013 at 9:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I believe however that it's possible to extract an idea of which
 tokens the parser believes it can see next at any given parse state.
 (I've seen code for this somewhere on the net, but am too lazy to go
 searching for it again right now.)  So we could imagine a rule along
 the lines of if IDENT is allowed as a next token, and $KEYWORD is
 not, then return IDENT not the keyword's own token.

That's a pretty cool idea. I'm afraid it might be kind of slow to
produce that list and load it into a hash table for every ident token
though. I suppose if you can request it only if the ident is a keyword
of some type then it wouldn't actually kick in often.

That would also imply we could simply use IDENT everywhere we
currently have col_name_keyword or type_function_name. Just by having
a rule that accepts a keyword that would implicitly make it not be
accepted as an IDENT when unquoted in that location. That might make
the documentation a bit trickier and it would make it harder for users
to make their code forward-compatible. It would also make it harder
for hackers to determine when we've accidentally narrowed the
allowable identifiers for more cases than they expect.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] scanner/parser minimization

2013-03-02 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 Regarding yytransition I think the problem is we're using flex to
 implement keyword recognition which is usually not what it's used for.
 Usually people use flex to handle syntax things like quoting and
 numeric formats. All identifiers are handled by flex as equivalent.
 Then the last step in the scanner for identifiers is to look up the
 identifier in a hash table and return the keyword token if it's a
 keyword. That would massively simplify the scanner tables.

Uh ... no.  I haven't looked into why the flex tables are so large,
but this theory is just wrong.  See ScanKeywordLookup().

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] scanner/parser minimization

2013-03-02 Thread Robert Haas
On Thu, Feb 28, 2013 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I believe however that it's possible to extract an idea of which
 tokens the parser believes it can see next at any given parse state.
 (I've seen code for this somewhere on the net, but am too lazy to go
 searching for it again right now.)  So we could imagine a rule along
 the lines of if IDENT is allowed as a next token, and $KEYWORD is
 not, then return IDENT not the keyword's own token.

 This might be unworkable from a speed standpoint, depending on how
 expensive it is to make the determination about allowable next symbols.
 But it seems worth looking into.

Interesting idea.  But wouldn't that change the semantics of the
grammar in some places?  In particular, keywords would generally
become less-reserved than they are now, but in a context-dependent
way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] scanner/parser minimization

2013-03-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Feb 28, 2013 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I believe however that it's possible to extract an idea of which
 tokens the parser believes it can see next at any given parse state.
 (I've seen code for this somewhere on the net, but am too lazy to go
 searching for it again right now.)  So we could imagine a rule along
 the lines of if IDENT is allowed as a next token, and $KEYWORD is
 not, then return IDENT not the keyword's own token.
 
 This might be unworkable from a speed standpoint, depending on how
 expensive it is to make the determination about allowable next symbols.
 But it seems worth looking into.

 Interesting idea.  But wouldn't that change the semantics of the
 grammar in some places?  In particular, keywords would generally
 become less-reserved than they are now, but in a context-dependent
 way.

Yeah, Greg already raised that point.  It shouldn't change the semantics
of the grammar *today*, but it would become much harder to tell whether
future changes create any conflicts; we'd lose the mechanical
verification we get from bison with the current method.  That might be
sufficient reason not to pursue the idea.

It's possible that static analysis of the grammar could replace bison
verification; that is, assuming we have a way to identify all states in
which both IDENT and some-keyword are allowable, we could generate a
report listing which keywords are potentially going to be considered
non-reserved, and then watch for changes in that report.  But this would
require writing even more code that we don't have today.

I suspect also that we'd still need a notion of fully reserved keywords
that don't get flipped to IDENT even if that's a legal next symbol.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] scanner/parser minimization

2013-03-02 Thread Heikki Linnakangas

On 02.03.2013 17:09, Tom Lane wrote:

Greg Starkst...@mit.edu  writes:

Regarding yytransition I think the problem is we're using flex to
implement keyword recognition which is usually not what it's used for.
Usually people use flex to handle syntax things like quoting and
numeric formats. All identifiers are handled by flex as equivalent.
Then the last step in the scanner for identifiers is to look up the
identifier in a hash table and return the keyword token if it's a
keyword. That would massively simplify the scanner tables.


Uh ... no.  I haven't looked into why the flex tables are so large,
but this theory is just wrong.  See ScanKeywordLookup().


Interestingly, the yy_transition array generated by flex used to be much 
smaller:


8.3: 22072 elements
8.4: 62623 elements
master: 64535 elements

The big jump between 8.3 and 8.4 was caused by introduction of the 
unicode escapes: U'foo' [UESCAPE 'x'] . And in particular, the error 
rule for the UESCAPE, which we use to avoid backtracking.


I experimented with a patch that uses two extra flex states to shorten 
the error rules, see attached. The idea is that after lexing a unicode 
literal like U'foo', you enter a new state, in which you check 
whether an UESCAPE 'x' follows. This slashes the size of the array to 
36581 elements.


- Heikki
*** a/src/backend/parser/scan.l
--- b/src/backend/parser/scan.l
***
*** 162,168 
--- 162,170 
  %x xq
  %x xdolq
  %x xui
+ %x xuiend
  %x xus
+ %x xusend
  %x xeu
  
  /*
***
*** 279,295 
  /* Unicode escapes */
  uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
  /* error rule to avoid backup */
! uescapefail		(-|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*-|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU])
  
  /* Quoted identifier with Unicode escapes */
  xuistart		[uU]{dquote}
- xuistop1		{dquote}{whitespace}*{uescapefail}?
- xuistop2		{dquote}{whitespace}*{uescape}
  
  /* Quoted string with Unicode escapes */
  xusstart		[uU]{quote}
! xusstop1		{quote}{whitespace}*{uescapefail}?
! xusstop2		{quote}{whitespace}*{uescape}
  
  /* error rule to avoid backup */
  xufailed		[uU]
--- 281,297 
  /* Unicode escapes */
  uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
  /* error rule to avoid backup */
! uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*-|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
  
  /* Quoted identifier with Unicode escapes */
  xuistart		[uU]{dquote}
  
  /* Quoted string with Unicode escapes */
  xusstart		[uU]{quote}
! 
! xustop1		{uescapefail}?
! xustop2		{uescape}
! 
  
  /* error rule to avoid backup */
  xufailed		[uU]
***
*** 536,549 
  	yylval-str = litbufdup(yyscanner);
  	return SCONST;
  }
! xus{xusstop1} {
! 	/* throw back all but the quote */
  	yyless(1);
  	BEGIN(INITIAL);
  	yylval-str = litbuf_udeescape('\\', yyscanner);
  	return SCONST;
  }
! xus{xusstop2} {
  	BEGIN(INITIAL);
  	yylval-str = litbuf_udeescape(yytext[yyleng-2], yyscanner);
  	return SCONST;
--- 538,558 
  	yylval-str = litbufdup(yyscanner);
  	return SCONST;
  }
! xus{quotestop} |
! xus{quotefail} {
  	yyless(1);
+ 	BEGIN(xusend);
+ }
+ xusend{whitespace}
+ xusend{other} |
+ xusend{xustop1} {
+ 	/* throw back everything */
+ 	yyless(0);
  	BEGIN(INITIAL);
  	yylval-str = litbuf_udeescape('\\', yyscanner);
  	return SCONST;
  }
! xusend{xustop2} {
  	BEGIN(INITIAL);
  	yylval-str = litbuf_udeescape(yytext[yyleng-2], yyscanner);
  	return SCONST;
***
*** 702,708 
  	yylval-str = ident;
  	return IDENT;
  }
! xui{xuistop1}	{
  	char		   *ident;
  
  	BEGIN(INITIAL);
--- 711,723 
  	yylval-str = ident;
  	return IDENT;
  }
! xui{dquote} {
! 	yyless(1);
! 	BEGIN(xuiend);
! }
! xuiend{whitespace} { }
! xuiend{other} |
! xuiend{xustop1}	{
  	char		   *ident;
  
  	BEGIN(INITIAL);
***
*** 712,722 
  	if (yyextra-literallen = NAMEDATALEN)
  		truncate_identifier(ident, yyextra-literallen, true);
  	yylval-str = ident;
! 	/* throw back all but the quote */
! 	yyless(1);
  	return IDENT;
  }
! xui{xuistop2}	{
  	char		   *ident;
  
  	BEGIN(INITIAL);
--- 727,736 
  	if (yyextra-literallen = NAMEDATALEN)
  		truncate_identifier(ident, yyextra-literallen, true);
  	yylval-str = ident;
! 	/* throw back everything that follows the end quote */
  	return IDENT;
  }
! xuiend{xustop2}	{
  	char		  

Re: [HACKERS] scanner/parser minimization

2013-03-01 Thread Peter Eisentraut
On 2/28/13 3:34 PM, Robert Haas wrote:
 It's
 possible to vastly reduce the size of the scanner output, and
 therefore of gram.c, by running flex with -Cf rather than -CF, which
 changes the table representation completely.  I assume there is a
 sound performance reason why we don't do this, but it might be worth
 checking that if we haven't lately.  When compiled with -Cf, the size
 of gram.o drops from 1019260 bytes to 703600, which is a large
 savings.

The option choice is based on the recommendation in the flex manual.  It
wouldn't hurt to re-test it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] scanner/parser minimization

2013-02-28 Thread Robert Haas
Today's b^Hdiscussion on materialized views reminded me that I spent a
little bit of time looking at gram.y and thinking about what we might
be able to do to reduce the amount of bloat it spits out.   On my
system, without debugging symbols, gram.o is 1019260 bytes.  Using nm
gram.o | sort | less to compare the starting addresses of each symbol
with the next symbol, I figured out the size of some of the larger
symbols: yy_transition is 516288 bytes, and yycheck and yytable are
each 145984 bytes.  Thus, anything which doesn't reduce the size of
one of those three symbols isn't going to help very much.

yy_transition appears in scan.c - that is, it's part of the flex
output, not the yacc output.  It is an array of 64,535 yy_trans_info
structures, each containing two flex_int32_t members.  Off-hand the
values all look small enough to fit in 2-byte integers rather than
4-byte integers; it's not clear to me why flex doesn't do that.  It's
possible to vastly reduce the size of the scanner output, and
therefore of gram.c, by running flex with -Cf rather than -CF, which
changes the table representation completely.  I assume there is a
sound performance reason why we don't do this, but it might be worth
checking that if we haven't lately.  When compiled with -Cf, the size
of gram.o drops from 1019260 bytes to 703600, which is a large
savings.

yytable and yycheck are both part of the parser; that is, they appear
in gram.c.  Each is an array of 2-byte integers.  I'm not clear on
exactly how these relate to the size of the state table, but it seems
that the parser has the idea that each state has a list of actions
associated with specific next-tokens, and then it also has a default
action which is followed for next-tokens for which no specific rule
is present.  I believe that yytable and yycheck somehow encode these
transition tables, but the details are not altogether clear to me.  If
that view of the situation is correct, then a reasonable way of
approaching the task of reducing the parser size is to look for states
in gram.output (generated by running bison with the -v option) that
have a lot of non-default rules and mulling over how we might reduce
that number.

A whole lot of those state transitions are attributable to states
which have separate transitions for each of many keywords.  They
transition to states which then reduce the corresponding keyword to
unreserved_keyword, col_name_keyword, type_func_keyword, or
reserved_keyword.  So it would seem that if we could reduce those
transitions in some way, it would help a lot.  I experimented with
this a little.  Consider the following patch:

--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -12489,8 +12489,6 @@ SignedIconst: Iconst
 { $$ = $1; }
 /* Column identifier --- names that can be column, table, etc names.
  */
 ColId: IDENT
 { $$ = $1; }
-   | unreserved_keyword
 { $$ = pstrdup($1); }
-   | col_name_keyword
 { $$ = pstrdup($1); }
;

 /* Type/function identifier --- names that can be type or function names.

On my machine, that two-line patch has the effect of reducing the size
of gram.o from 1019260 bytes to 844844 bytes, a savings of 164kB.
Pretty good for ripping out two lines of code.  Applying
similarly-crude hacks to type_function_name or ColLabel is not nearly
as effective.  If I hack up all three, gram.o drops to 794612 bytes, a
savings of 49 additional kB.  If I hack up ONLY one of the other two
and NOT ColId, the savings are much less.  Clearly ColId is the big
offender by far.  I suspect, but am not positive, that this is simply
because ColId is more widely-used throughout the grammar.  For
example, the following patch actually makes gram.o about 4kB larger:

--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -5381,7 +5381,7 @@ SecLabelStmt:
}
;

-opt_provider:  FOR ColId_or_Sconst { $$ = $2; }
+opt_provider:  FOR ColLabel { $$ = $2; }
| /* empty */   { $$ = NULL; }
;

Accepting more rather than fewer keywords in that position makes
things worse, which makes sense.

Interestingly, this analysis suggests that while reserved keywords and
type_func_name keywords are surely worse from an application
compatibility standpoint, unreserved and col_name_keywords are worse
from a parser bloat standpoint, because there are more contexts where
we have to laboriously ignore their status as keywords, via additional
parser state transitions.

I don't have a brilliant idea on what to do about this, but one
thought that did occur to me is that we might be able to use mid-rule
actions to change the scanner behavior.  In other words, suppose we
reach a point in the input string where we know that we no longer care
about parsing unreserved keywords as keywords, but are instead happy
to have 

Re: [HACKERS] scanner/parser minimization

2013-02-28 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 A whole lot of those state transitions are attributable to states
 which have separate transitions for each of many keywords.

Yeah, that's no surprise.

The idea that's been in the back of my mind for awhile is to try to
solve the problem at the lexer level not the parser level: that is,
have the lexer return IDENT whenever a keyword appears in a context
where it should be interpreted as unreserved.  You suggest somehow
driving that off mid-rule actions, but that seems to be to be probably
a nonstarter from a code complexity and maintainability standpoint.

I believe however that it's possible to extract an idea of which
tokens the parser believes it can see next at any given parse state.
(I've seen code for this somewhere on the net, but am too lazy to go
searching for it again right now.)  So we could imagine a rule along
the lines of if IDENT is allowed as a next token, and $KEYWORD is
not, then return IDENT not the keyword's own token.

This might be unworkable from a speed standpoint, depending on how
expensive it is to make the determination about allowable next symbols.
But it seems worth looking into.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers