Re: [HACKERS] scanner/parser minimization
On 13.03.2013 10:50, Simon Riggs wrote: On 2 March 2013 18:47, Heikki Linnakangashlinnakan...@vmware.com wrote: Interestingly, the yy_transition array generated by flex used to be much smaller: 8.3: 22072 elements 8.4: 62623 elements master: 64535 elements The big jump between 8.3 and 8.4 was caused by introduction of the unicode escapes: U'foo' [UESCAPE 'x'] . And in particular, the error rule for the UESCAPE, which we use to avoid backtracking. I experimented with a patch that uses two extra flex states to shorten the error rules, see attached. The idea is that after lexing a unicode literal like U'foo', you enter a new state, in which you check whether an UESCAPE 'x' follows. This slashes the size of the array to 36581 elements. +1 to do this sooner rather than later I hear no objection, so committed. (after fixing some small bugs in the patch, and adding some comments) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] scanner/parser minimization
Heikki Linnakangas hlinnakan...@vmware.com writes: I hear no objection, so committed. (after fixing some small bugs in the patch, and adding some comments) Please keep psqlscan.l in sync with scan.l. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] scanner/parser minimization
On 2 March 2013 18:47, Heikki Linnakangas hlinnakan...@vmware.com wrote: Uh ... no. I haven't looked into why the flex tables are so large, but this theory is just wrong. See ScanKeywordLookup(). Interestingly, the yy_transition array generated by flex used to be much smaller: 8.3: 22072 elements 8.4: 62623 elements master: 64535 elements The big jump between 8.3 and 8.4 was caused by introduction of the unicode escapes: U'foo' [UESCAPE 'x'] . And in particular, the error rule for the UESCAPE, which we use to avoid backtracking. I experimented with a patch that uses two extra flex states to shorten the error rules, see attached. The idea is that after lexing a unicode literal like U'foo', you enter a new state, in which you check whether an UESCAPE 'x' follows. This slashes the size of the array to 36581 elements. +1 to do this sooner rather than later -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] scanner/parser minimization
On Thu, Feb 28, 2013 at 04:09:11PM -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: A whole lot of those state transitions are attributable to states which have separate transitions for each of many keywords. Yeah, that's no surprise. The idea that's been in the back of my mind for awhile is to try to solve the problem at the lexer level not the parser level: that is, have the lexer return IDENT whenever a keyword appears in a context where it should be interpreted as unreserved. You suggest somehow driving that off mid-rule actions, but that seems to be to be probably a nonstarter from a code complexity and maintainability standpoint. I believe however that it's possible to extract an idea of which tokens the parser believes it can see next at any given parse state. (I've seen code for this somewhere on the net, but am too lazy to go searching for it again right now.) So we could imagine a rule along I believe tokenizing of typedefs requries the lexer to peek at the parser state: http://calculist.blogspot.com/2009/02/c-typedef-parsing-problem.html The well-known typedef problem with parsing C is that the standard C grammar is ambiguous unless the lexer distinguishes identifiers bound by typedef and other identifiers as two separate lexical classes. This means that the parser needs to feed scope information to the lexer during parsing. One upshot is that lexing must be done concurrently with parsing. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] scanner/parser minimization
Regarding yytransition I think the problem is we're using flex to implement keyword recognition which is usually not what it's used for. Usually people use flex to handle syntax things like quoting and numeric formats. All identifiers are handled by flex as equivalent. Then the last step in the scanner for identifiers is to look up the identifier in a hash table and return the keyword token if it's a keyword. That would massively simplify the scanner tables. This hash table can be heavily optimized because it's a static lookup. c.f. http://www.gnu.org/software/gperf/ In theory this is more expensive since it needs to do a strcmp in addition to scanning the identifier to determine whether the token ends. But I suspect in practice the smaller tables might outweight that cost. On Thu, Feb 28, 2013 at 9:09 PM, Tom Lane t...@sss.pgh.pa.us wrote: I believe however that it's possible to extract an idea of which tokens the parser believes it can see next at any given parse state. (I've seen code for this somewhere on the net, but am too lazy to go searching for it again right now.) So we could imagine a rule along the lines of if IDENT is allowed as a next token, and $KEYWORD is not, then return IDENT not the keyword's own token. That's a pretty cool idea. I'm afraid it might be kind of slow to produce that list and load it into a hash table for every ident token though. I suppose if you can request it only if the ident is a keyword of some type then it wouldn't actually kick in often. That would also imply we could simply use IDENT everywhere we currently have col_name_keyword or type_function_name. Just by having a rule that accepts a keyword that would implicitly make it not be accepted as an IDENT when unquoted in that location. That might make the documentation a bit trickier and it would make it harder for users to make their code forward-compatible. It would also make it harder for hackers to determine when we've accidentally narrowed the allowable identifiers for more cases than they expect. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] scanner/parser minimization
Greg Stark st...@mit.edu writes: Regarding yytransition I think the problem is we're using flex to implement keyword recognition which is usually not what it's used for. Usually people use flex to handle syntax things like quoting and numeric formats. All identifiers are handled by flex as equivalent. Then the last step in the scanner for identifiers is to look up the identifier in a hash table and return the keyword token if it's a keyword. That would massively simplify the scanner tables. Uh ... no. I haven't looked into why the flex tables are so large, but this theory is just wrong. See ScanKeywordLookup(). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] scanner/parser minimization
On Thu, Feb 28, 2013 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote: I believe however that it's possible to extract an idea of which tokens the parser believes it can see next at any given parse state. (I've seen code for this somewhere on the net, but am too lazy to go searching for it again right now.) So we could imagine a rule along the lines of if IDENT is allowed as a next token, and $KEYWORD is not, then return IDENT not the keyword's own token. This might be unworkable from a speed standpoint, depending on how expensive it is to make the determination about allowable next symbols. But it seems worth looking into. Interesting idea. But wouldn't that change the semantics of the grammar in some places? In particular, keywords would generally become less-reserved than they are now, but in a context-dependent way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] scanner/parser minimization
Robert Haas robertmh...@gmail.com writes: On Thu, Feb 28, 2013 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote: I believe however that it's possible to extract an idea of which tokens the parser believes it can see next at any given parse state. (I've seen code for this somewhere on the net, but am too lazy to go searching for it again right now.) So we could imagine a rule along the lines of if IDENT is allowed as a next token, and $KEYWORD is not, then return IDENT not the keyword's own token. This might be unworkable from a speed standpoint, depending on how expensive it is to make the determination about allowable next symbols. But it seems worth looking into. Interesting idea. But wouldn't that change the semantics of the grammar in some places? In particular, keywords would generally become less-reserved than they are now, but in a context-dependent way. Yeah, Greg already raised that point. It shouldn't change the semantics of the grammar *today*, but it would become much harder to tell whether future changes create any conflicts; we'd lose the mechanical verification we get from bison with the current method. That might be sufficient reason not to pursue the idea. It's possible that static analysis of the grammar could replace bison verification; that is, assuming we have a way to identify all states in which both IDENT and some-keyword are allowable, we could generate a report listing which keywords are potentially going to be considered non-reserved, and then watch for changes in that report. But this would require writing even more code that we don't have today. I suspect also that we'd still need a notion of fully reserved keywords that don't get flipped to IDENT even if that's a legal next symbol. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] scanner/parser minimization
On 02.03.2013 17:09, Tom Lane wrote: Greg Starkst...@mit.edu writes: Regarding yytransition I think the problem is we're using flex to implement keyword recognition which is usually not what it's used for. Usually people use flex to handle syntax things like quoting and numeric formats. All identifiers are handled by flex as equivalent. Then the last step in the scanner for identifiers is to look up the identifier in a hash table and return the keyword token if it's a keyword. That would massively simplify the scanner tables. Uh ... no. I haven't looked into why the flex tables are so large, but this theory is just wrong. See ScanKeywordLookup(). Interestingly, the yy_transition array generated by flex used to be much smaller: 8.3: 22072 elements 8.4: 62623 elements master: 64535 elements The big jump between 8.3 and 8.4 was caused by introduction of the unicode escapes: U'foo' [UESCAPE 'x'] . And in particular, the error rule for the UESCAPE, which we use to avoid backtracking. I experimented with a patch that uses two extra flex states to shorten the error rules, see attached. The idea is that after lexing a unicode literal like U'foo', you enter a new state, in which you check whether an UESCAPE 'x' follows. This slashes the size of the array to 36581 elements. - Heikki *** a/src/backend/parser/scan.l --- b/src/backend/parser/scan.l *** *** 162,168 --- 162,170 %x xq %x xdolq %x xui + %x xuiend %x xus + %x xusend %x xeu /* *** *** 279,295 /* Unicode escapes */ uescape [uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote} /* error rule to avoid backup */ ! uescapefail (-|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*-|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]) /* Quoted identifier with Unicode escapes */ xuistart [uU]{dquote} - xuistop1 {dquote}{whitespace}*{uescapefail}? - xuistop2 {dquote}{whitespace}*{uescape} /* Quoted string with Unicode escapes */ xusstart [uU]{quote} ! xusstop1 {quote}{whitespace}*{uescapefail}? ! xusstop2 {quote}{whitespace}*{uescape} /* error rule to avoid backup */ xufailed [uU] --- 281,297 /* Unicode escapes */ uescape [uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote} /* error rule to avoid backup */ ! uescapefail [uU][eE][sS][cC][aA][pP][eE]{whitespace}*-|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU] /* Quoted identifier with Unicode escapes */ xuistart [uU]{dquote} /* Quoted string with Unicode escapes */ xusstart [uU]{quote} ! ! xustop1 {uescapefail}? ! xustop2 {uescape} ! /* error rule to avoid backup */ xufailed [uU] *** *** 536,549 yylval-str = litbufdup(yyscanner); return SCONST; } ! xus{xusstop1} { ! /* throw back all but the quote */ yyless(1); BEGIN(INITIAL); yylval-str = litbuf_udeescape('\\', yyscanner); return SCONST; } ! xus{xusstop2} { BEGIN(INITIAL); yylval-str = litbuf_udeescape(yytext[yyleng-2], yyscanner); return SCONST; --- 538,558 yylval-str = litbufdup(yyscanner); return SCONST; } ! xus{quotestop} | ! xus{quotefail} { yyless(1); + BEGIN(xusend); + } + xusend{whitespace} + xusend{other} | + xusend{xustop1} { + /* throw back everything */ + yyless(0); BEGIN(INITIAL); yylval-str = litbuf_udeescape('\\', yyscanner); return SCONST; } ! xusend{xustop2} { BEGIN(INITIAL); yylval-str = litbuf_udeescape(yytext[yyleng-2], yyscanner); return SCONST; *** *** 702,708 yylval-str = ident; return IDENT; } ! xui{xuistop1} { char *ident; BEGIN(INITIAL); --- 711,723 yylval-str = ident; return IDENT; } ! xui{dquote} { ! yyless(1); ! BEGIN(xuiend); ! } ! xuiend{whitespace} { } ! xuiend{other} | ! xuiend{xustop1} { char *ident; BEGIN(INITIAL); *** *** 712,722 if (yyextra-literallen = NAMEDATALEN) truncate_identifier(ident, yyextra-literallen, true); yylval-str = ident; ! /* throw back all but the quote */ ! yyless(1); return IDENT; } ! xui{xuistop2} { char *ident; BEGIN(INITIAL); --- 727,736 if (yyextra-literallen = NAMEDATALEN) truncate_identifier(ident, yyextra-literallen, true); yylval-str = ident; ! /* throw back everything that follows the end quote */ return IDENT; } ! xuiend{xustop2} { char
Re: [HACKERS] scanner/parser minimization
On 2/28/13 3:34 PM, Robert Haas wrote: It's possible to vastly reduce the size of the scanner output, and therefore of gram.c, by running flex with -Cf rather than -CF, which changes the table representation completely. I assume there is a sound performance reason why we don't do this, but it might be worth checking that if we haven't lately. When compiled with -Cf, the size of gram.o drops from 1019260 bytes to 703600, which is a large savings. The option choice is based on the recommendation in the flex manual. It wouldn't hurt to re-test it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] scanner/parser minimization
Robert Haas robertmh...@gmail.com writes: A whole lot of those state transitions are attributable to states which have separate transitions for each of many keywords. Yeah, that's no surprise. The idea that's been in the back of my mind for awhile is to try to solve the problem at the lexer level not the parser level: that is, have the lexer return IDENT whenever a keyword appears in a context where it should be interpreted as unreserved. You suggest somehow driving that off mid-rule actions, but that seems to be to be probably a nonstarter from a code complexity and maintainability standpoint. I believe however that it's possible to extract an idea of which tokens the parser believes it can see next at any given parse state. (I've seen code for this somewhere on the net, but am too lazy to go searching for it again right now.) So we could imagine a rule along the lines of if IDENT is allowed as a next token, and $KEYWORD is not, then return IDENT not the keyword's own token. This might be unworkable from a speed standpoint, depending on how expensive it is to make the determination about allowable next symbols. But it seems worth looking into. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers