subject:"reducing the footprint of ScanKeyword \(was Re\: Large writable variables\)"

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-03-21 Thread Joel Jacobson

On Wed, Mar 20, 2019 at 9:24 PM Tom Lane  wrote:

> Joel Jacobson  writes:
> > I've seen a performance trick in other hash functions [1]
> > to instead read multiple bytes in each iteration,
> > and then handle the remaining bytes after the loop.
> > [1] https://github.com/wangyi-fudan/wyhash/blob/master/wyhash.h#L29
>
> I can't get very excited about this, seeing that we're only going to
> be hashing short strings.  I don't really believe your 30% number
> for short strings; and even if I did, there's no evidence that the
> hash functions are worth any further optimization in terms of our
> overall performance.
>

I went ahead and tested this approach anyway, since I need this algorithm
in a completely different project.

The benchmark below shows stats for three different keywords per length,
compiled with -O2:

$ c++ -O2 -std=c++14 -o bench_perfect_hash bench_perfect_hash.cc
$ ./bench_perfect_hash

keyword  length char-a-time (ns) word-a-time (ns) diff (%)
as   2  3.30 2.62 -0.21
at   2  3.54 2.66 -0.25
by   2  3.30 2.59 -0.22
add  3  4.01 3.15 -0.21
all  3  4.04 3.11 -0.23
and  3  3.84 3.11 -0.19
also 4  4.50 3.17 -0.30
both 4  4.49 3.06 -0.32
call 4  4.95 3.42 -0.31
abort5  6.09 4.02 -0.34
admin5  5.26 3.65 -0.31
after5  5.18 3.76 -0.27
access   6  5.97 3.91 -0.34
action   6  5.86 3.89 -0.34
always   6  6.10 3.77 -0.38
analyse  7  6.67 4.64 -0.30
analyze  7  7.09 4.87 -0.31
between  7  7.02 4.66 -0.34
absolute 8  7.49 3.82 -0.49
backward 8  7.13 3.88 -0.46
cascaded 8  7.23 4.17 -0.42
aggregate9  8.04 4.49 -0.44
assertion9  7.98 4.52 -0.43
attribute9  8.03 4.44 -0.45
assignment   10 8.58 4.67 -0.46
asymmetric   10 9.07 4.57 -0.50
checkpoint   10 9.15 4.53 -0.51
constraints  11 9.58 5.14 -0.46
insensitive  11 9.62 5.30 -0.45
publication  11 10.305.60 -0.46
concurrently 12 10.364.81 -0.54
current_date 12 11.175.48 -0.51
current_role 12 11.155.10 -0.54
authorization13 11.875.50 -0.54
configuration13 11.505.51 -0.52
xmlattributes13 11.725.66 -0.52
current_schema   14 12.175.58 -0.54
localtimestamp   14 11.785.46 -0.54
characteristics  15 12.775.97 -0.53
current_catalog  15 12.655.87 -0.54
current_timestamp17 14.196.12 -0.57


> Also, as best I can tell, the approach you propose would result
> in an endianness dependence, meaning we'd have to have separate
> lookup tables for BE and LE machines.  That's not a dealbreaker
> perhaps, but it is certainly another point on the "it's not worth it"
> side of the argument.
>

I can see how the same problem has been worked-around in e.g. pg_crc32.h:

#ifdef WORDS_BIGENDIAN
#define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0x)
#else
#define FIN_CRC32C(crc) ((crc) ^= 0x)
#endif

So I used the same trick in PerfectHash.pm:

$f .= sprintf "#ifdef WORDS_BIGENDIAN\n";
$f .= sprintf "\t\tc4 = pg_bswap32(c4);\n";
$f .= sprintf "#endif\n";

I've also tried to measure the overall effect by hacking postgres.c:

+   struct timespec start, stop;
+   clock_gettime( CLOCK_REALTIME, &start);
+   for (int i=0; i<10; i++)
+  {
+   List   *parsetree_list2;
+   MemoryContext oldcontext2;
+
+   oldcontext2 = MemoryContextSwitchTo(MessageContext);
+   parsetree_list2 = pg_parse_query(query_string);
+   MemoryContextSwitchTo(oldcontext2);
+// MemoryContextReset(MessageContext);
+   CHECK_F

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-03-20 Thread Tom Lane

Joel Jacobson  writes:
> I've seen a performance trick in other hash functions [1]
> to instead read multiple bytes in each iteration,
> and then handle the remaining bytes after the loop.
> [1] https://github.com/wangyi-fudan/wyhash/blob/master/wyhash.h#L29

I can't get very excited about this, seeing that we're only going to
be hashing short strings.  I don't really believe your 30% number
for short strings; and even if I did, there's no evidence that the
hash functions are worth any further optimization in terms of our
overall performance.

Also, as best I can tell, the approach you propose would result
in an endianness dependence, meaning we'd have to have separate
lookup tables for BE and LE machines.  That's not a dealbreaker
perhaps, but it is certainly another point on the "it's not worth it"
side of the argument.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-03-20 Thread Joel Jacobson

Many thanks for working on this, amazing work, really nice you made it a
separate reusable Perl-module.

The generated hash functions reads one character at a time.
I've seen a performance trick in other hash functions [1]
to instead read multiple bytes in each iteration,
and then handle the remaining bytes after the loop.

[1] https://github.com/wangyi-fudan/wyhash/blob/master/wyhash.h#L29

I've done some testing and it looks like a ~30% speed-up of the generated
ScanKeywords_hash_func() function would be possible.

If you think this approach is promising, I would be happy to prepare a
patch for it,
but I wanted to check with the project this idea has not already been
considered and ruled out
for some technical reasons I've failed to see, is there any?

For this to work you would need to use larger constants for $hash_mult1
and $hash_mult2 though.
I've successfully used these values:
$hash_mult1 0x2c1b3c6d
$hash_mult2 (0x297a2d39, 0x85ebca6b, 0xc2b2ae35, 0x7feb352d, 0x846ca68b)

Here is the idea:

Generated C-code:

  for (; keylen >= 4; keylen -= 4, k += 4)
  {
uint32_t v;
memcpy(&v, k, 4);
v |= 0x20202020;
a = a * 739982445 + v;
b = b * 2246822507 + v;
  }
  uint32_t v = 0;
  switch (keylen)
  {
  case 3:
memcpy(&v, k, 3);
v |= 0x202020;
break;
  case 2:
memcpy(&v, k, 2);
v |= 0x2020;
break;
  case 1:
memcpy(&v, k, 1);
v |= 0x20;
break;
  }
  a = a * 739982445 + v;
  b = b * 2246822507 + v;
  return h[a % 883] + h[b % 883];

(Reding 8 bytes a time instead would perhaps be a win since some keywords
are quite long.)

Perl-code:

sub _calc_hash
{
my ($key, $mult, $seed) = @_;

my $result = $seed;
my $i=0;
my $keylen = length($key);

for (; $keylen>=4; $keylen-=4, $i+=4) {
my $cn = (ord(substr($key,$i+0,1)) << 0)
   | (ord(substr($key,$i+1,1)) << 8)
| (ord(substr($key,$i+2,1)) << 16)
| (ord(substr($key,$i+3,1)) << 24);
$cn |= 0x20202020 if $case_fold;
$result = ($result * $mult + $cn) % 4294967296;
}

my $cn = 0;
if ($keylen == 3) {
$cn = (ord(substr($key,$i+0,1)) << 0)
   | (ord(substr($key,$i+1,1)) << 8)
| (ord(substr($key,$i+2,1)) << 16);
$cn |= 0x202020 if $case_fold;
} elsif ($keylen == 2) {
$cn = (ord(substr($key,$i+0,1)) << 0)
   | (ord(substr($key,$i+1,1)) << 8);
$cn |= 0x2020 if $case_fold;
} elsif ($keylen == 1) {
$cn = (ord(substr($key,$i+0,1)) << 0);
$cn |= 0x20 if $case_fold;
}
$result = ($result * $mult + $cn) % 4294967296;

return $result;
}

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Tom Lane

John Naylor  writes:
> On Wed, Jan 9, 2019 at 5:33 PM Tom Lane  wrote:
>> really need here.  We could go with "[no-]case-insensitive", perhaps.
>> Or "[no-]case-fold", which is at least a little shorter and less
>> double-negative-y.

> I'd be in favor of --[no-]case-fold.

Yeah, I like that better too; I've been having to stop and think
every time as to which direction is which with the [in]sensitive
terminology.  I'll make it "case-fold" throughout.

> This comment in PerfectHash.pm:

> (It does change '_', else we could just skip adjusting
> # $cn here at all, for typical keyword strings.)

> ...seems a bit out of place in the module, because of its reference to
> keywords, of interest right now to its only caller. Maybe a bit of
> context here. (I also don't quite understand why we could
> hypothetically skip the adjustment.)

Were it not for the underscore case, we could plausibly assume that
the supplied keywords are already all-lower-case and don't need any
further folding.  But I agree that this comment is probably more
confusing than helpful; it's easier just to see that the code is
applying the same transform as the runtime lookup will do.

> Lastly, the keyword headers still have a dire warning about ASCII
> order and binary search. Those could be softened to match the comment
> in gen_keywordlist.pl.

Agreed, will do.

Thanks for reviewing!

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread John Naylor

On Wed, Jan 9, 2019 at 5:33 PM Tom Lane  wrote:
> really need here.  We could go with "[no-]case-insensitive", perhaps.
> Or "[no-]case-fold", which is at least a little shorter and less
> double-negative-y.

I'd be in favor of --[no-]case-fold.

On Tue, Jan 8, 2019 at 5:53 PM Tom Lane  wrote:
> I improved the comment about how come the hash table entry assignment
> works.

I've gone over the algorithm in more detail and I don't see any nicer
way to write it. This comment in PerfectHash.pm:

(It does change '_', else we could just skip adjusting
# $cn here at all, for typical keyword strings.)

...seems a bit out of place in the module, because of its reference to
keywords, of interest right now to its only caller. Maybe a bit of
context here. (I also don't quite understand why we could
hypothetically skip the adjustment.)

Lastly, the keyword headers still have a dire warning about ASCII
order and binary search. Those could be softened to match the comment
in gen_keywordlist.pl.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Tom Lane

John Naylor  writes:
> On Tue, Jan 8, 2019 at 5:31 PM Tom Lane  wrote:
>> The length macro was readily available there so I used it.  AFAIR
>> that wasn't true elsewhere, though I might've missed something.
>> It's pretty much just belt-and-suspenders coding anyway, since all
>> those arrays are machine generated ...

> I tried using the available num_keywords macro in plpgsql and it
> worked fine, but it makes the lines really long. Alternatively, as in
> the attached, we could remove the single use of the core macro and
> maybe add comments to the generated magic numbers.

Meh, I'm not excited about removing the option just because there's
only one use of it now.  There might be more-compelling uses later.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Tom Lane

John Naylor  writes:
> On Wed, Jan 9, 2019 at 2:04 PM Tom Lane  wrote:
>> Now the impedance mismatch about case sensitivity is handled with
>> my $f = PerfectHash::generate_hash_function(\@keywords, $funcname,
>>  case_insensitive => !$case_sensitive);
>> which is at least a little clearer than before, though I'm not sure
>> if it entirely solves the problem.

> It's a bit clearer, but thinking about this some more, it makes sense
> for gen_keywordlist.pl to use $case_insensitive, because right now
> every instance of the var is "!$case_sensitive". In the attached (on
> top of v4), I change the command line option to --citext, and add the
> ability to negate it within the option, as '--no-citext'. It's kind of
> a double negative for the C-keywords invocation, but we can have the
> option for both cases, so we don't need to worry about what the
> default is (which is case_insensitive=1).

Ah, I didn't realize that Getopt allows having a boolean option
defaulting to "on".  That makes it more practical to do something here.

I'm not in love with "citext" as the option name, though ... that has
little to recommend it except brevity, which is not a property we
really need here.  We could go with "[no-]case-insensitive", perhaps.
Or "[no-]case-fold", which is at least a little shorter and less
double-negative-y.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Tom Lane

John Naylor  writes:
> On Wed, Jan 9, 2019 at 2:44 PM Tom Lane  wrote:
>> [patch to shrink oid index]

> It would help maintaining its newfound sveltness if we warned if a
> higher oid was assigned, as in the attached. I used 6200 as a soft
> limit, but that could be anything similiar.

I think the reason we have this issue is that people tend to use
high OIDs during development of a patch, so that their elbows won't
be joggled by unrelated changes.  Then sometimes they forget to
renumber them down before committing.  A warning like this would
lead to lots of noise during the development stage, which nobody
would thank us for.  If we could find a way to notice this only
when we were about to commit, it'd be good .. but I don't have an
idea about a nice way to do that.  (No, I don't want a commit hook
on gitmaster; that's warning too late, which is not better.)

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread John Naylor

On Wed, Jan 9, 2019 at 2:44 PM Tom Lane  wrote:
> [patch to shrink oid index]

It would help maintaining its newfound sveltness if we warned if a
higher oid was assigned, as in the attached. I used 6200 as a soft
limit, but that could be anything similiar.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index e7ead7dcc0..201343c2f6 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -74,6 +74,7 @@ my %catalog_data;
 my @toast_decls;
 my @index_decls;
 my %oidcounts;
+my $max_proc_oid = 6200;
 
 foreach my $header (@input_files)
 {
@@ -230,6 +231,13 @@ foreach my $row (@{ $catalog_data{pg_opfamily} })
 my %procoids;
 foreach my $row (@{ $catalog_data{pg_proc} })
 {
+	# Warn if a large OID has been newly assigned.
+	if ($row->{oid} > $max_proc_oid)
+	{
+		warn sprintf "Consider using an OID smaller than %d for %s (see commit 8ff5f824dca75)\n",
+			$max_proc_oid, $row->{proname};
+	}
+
 	# Generate an entry under just the proname (corresponds to regproc lookup)
 	my $prokey = $row->{proname};
 	if (defined $procoids{$prokey})

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread John Naylor

On Tue, Jan 8, 2019 at 5:31 PM Tom Lane  wrote:
>
> John Naylor  writes:
> > In the committed keyword patch, I noticed that in common/keywords.c,
> > the array length is defined with
> > ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS]
> > but other keyword arrays just have ...[]. Is there a reason for the 
> > difference?
>
> The length macro was readily available there so I used it.  AFAIR
> that wasn't true elsewhere, though I might've missed something.
> It's pretty much just belt-and-suspenders coding anyway, since all
> those arrays are machine generated ...

I tried using the available num_keywords macro in plpgsql and it
worked fine, but it makes the lines really long. Alternatively, as in
the attached, we could remove the single use of the core macro and
maybe add comments to the generated magic numbers.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/common/keywords.c b/src/common/keywords.c
index 84f779feb9..edec4a5094 100644
--- a/src/common/keywords.c
+++ b/src/common/keywords.c
@@ -26,7 +26,7 @@
 
 #define PG_KEYWORD(kwname, value, category) category,
 
-const uint8 ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS] = {
+const uint8 ScanKeywordCategories[] = {
 #include "parser/kwlist.h"
 };
 
diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl
index 927511d7c4..8c4c5da828 100644
--- a/src/tools/gen_keywordlist.pl
+++ b/src/tools/gen_keywordlist.pl
@@ -147,11 +147,6 @@ foreach my $name (@keywords)
 
 print $kwdef "};\n\n";
 
-# Emit a macro defining the number of keywords.
-# (In some places it's useful to have access to that as a constant.)
-
-printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords;
-
 # Emit the definition of the hash function.
 
 my $funcname = $varname . "_hash_func";
@@ -168,8 +163,8 @@ printf $kwdef "const ScanKeywordList %s = {\n", $varname;
 printf $kwdef qq|\t%s_kw_string,\n|, $varname;
 printf $kwdef qq|\t%s_kw_offsets,\n|, $varname;
 printf $kwdef qq|\t%s,\n|, $funcname;
-printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname;
-printf $kwdef qq|\t%d\n|, $max_len;
+printf $kwdef qq|\t%d,\t/* number of keywords */\n|, scalar @keywords;
+printf $kwdef qq|\t%d\t/* max keyword length */\n|, $max_len;
 printf $kwdef "};\n\n";
 
 printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename;

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread John Naylor

On Wed, Jan 9, 2019 at 2:04 PM Tom Lane  wrote:
>
> I wrote:
> > John Naylor  writes:
> >> -There is a bit of a cognitive clash between $case_sensitive in
> >> gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each
> >> make sense in their own file, but might it be worth using one or the
> >> other?

> Working on the fmgr-oid-lookup idea gave me the thought that
> PerfectHash.pm ought to support fixed-length keys.  Rather than start
> adding random parameters to the function, I borrowed an idea from
> PostgresNode.pm and made the options be keyword-style parameters.  Now
> the impedance mismatch about case sensitivity is handled with
>
> my $f = PerfectHash::generate_hash_function(\@keywords, $funcname,
> case_insensitive => !$case_sensitive);
>
> which is at least a little clearer than before, though I'm not sure
> if it entirely solves the problem.

It's a bit clearer, but thinking about this some more, it makes sense
for gen_keywordlist.pl to use $case_insensitive, because right now
every instance of the var is "!$case_sensitive". In the attached (on
top of v4), I change the command line option to --citext, and add the
ability to negate it within the option, as '--no-citext'. It's kind of
a double negative for the C-keywords invocation, but we can have the
option for both cases, so we don't need to worry about what the
default is (which is case_insensitive=1).

--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/common/Makefile b/src/common/Makefile
index d0c2b970eb..5f9327de59 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -124,7 +124,7 @@ libpgcommon_srv.a: $(OBJS_SRV)
 
 # generate SQL keyword lookup table to be included into keywords*.o.
 kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(GEN_KEYWORDLIST_DEPS)
-	$(GEN_KEYWORDLIST) --extern $<
+	$(GEN_KEYWORDLIST) --extern --citext $<
 
 # Dependencies of keywords*.o need to be managed explicitly to make sure
 # that you don't get broken parsing code, even in a non-enable-depend build.
diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile
index 6c02f97622..96dd76317f 100644
--- a/src/interfaces/ecpg/preproc/Makefile
+++ b/src/interfaces/ecpg/preproc/Makefile
@@ -60,10 +60,10 @@ preproc.y: ../../../backend/parser/gram.y parse.pl ecpg.addons ecpg.header ecpg.
 
 # generate keyword headers
 c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST_DEPS)
-	$(GEN_KEYWORDLIST) --varname ScanCKeywords --case $<
+	$(GEN_KEYWORDLIST) --varname ScanCKeywords --no-citext  $<
 
 ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST_DEPS)
-	$(GEN_KEYWORDLIST) --varname ScanECPGKeywords $<
+	$(GEN_KEYWORDLIST) --varname ScanECPGKeywords --citext $<
 
 # Force these dependencies to be known even without dependency info built:
 ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h
diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile
index 8a0f294323..bc4ee50682 100644
--- a/src/pl/plpgsql/src/Makefile
+++ b/src/pl/plpgsql/src/Makefile
@@ -80,10 +80,10 @@ plerrcodes.h: $(top_srcdir)/src/backend/utils/errcodes.txt generate-plerrcodes.p
 
 # generate keyword headers for the scanner
 pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST_DEPS)
-	$(GEN_KEYWORDLIST) --varname ReservedPLKeywords $<
+	$(GEN_KEYWORDLIST) --varname ReservedPLKeywords --citext $<
 
 pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST_DEPS)
-	$(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $<
+	$(GEN_KEYWORDLIST) --varname UnreservedPLKeywords --citext $<
 
 
 check: submake
diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl
index 2744e1dc26..927511d7c4 100644
--- a/src/tools/gen_keywordlist.pl
+++ b/src/tools/gen_keywordlist.pl
@@ -35,13 +35,13 @@ use PerfectHash;
 
 my $output_path = '';
 my $extern = 0;
-my $case_sensitive = 0;
+my $case_insensitive = 1;
 my $varname = 'ScanKeywords';
 
 GetOptions(
 	'output:s'   => \$output_path,
 	'extern' => \$extern,
-	'case-sensitive' => \$case_sensitive,
+	'citext!'=> \$case_insensitive,
 	'varname:s'  => \$varname) || usage();
 
 my $kw_input_file = shift @ARGV || die "No input file.\n";
@@ -97,7 +97,7 @@ while (<$kif>)
 }
 
 # When being case-insensitive, insist that the input be all-lower-case.
-if (!$case_sensitive)
+if ($case_insensitive)
 {
 	foreach my $kw (@keywords)
 	{
@@ -157,7 +157,7 @@ printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords;
 my $funcname = $varname . "_hash_func";
 
 my $f = PerfectHash::generate_hash_function(\@keywords, $funcname,
-	case_insensitive => !$case_sensitive);
+	case_insensitive => $case_insensitive);
 
 printf $kwdef qq|static %s\n|, $f;
 
@@ -178,11 +178,11 @@ printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename;
 sub usage
 {
 	die <] [--varname/-v ] [--extern/-e] input_file
+Usage: gen_keywordlist.pl [--output

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Joerg Sonnenberger

On Wed, Jan 09, 2019 at 02:04:15PM -0500, Tom Lane wrote:
> Also, in view of finding that the original multiplier choices failed
> on the fmgr oid problem, I spent a little effort making the code
> able to try more combinations of hash multipliers and seeds.  It'd
> be nice to have some theory rather than just heuristics about what
> will work, though ...

The theory is that the code needs two families of pair-wise independent
hash functions to give the O(1) number of tries. For practical purposes,
the Jenkins hash easily qualifies or any cryptographic hash function.
The downside is that those hash functions are very expensive though.
A multiplicative hash, especially using the high bits of the results
(e.g. the top 32bit of a 32x32->64 multiplication) qualifies for the
requirements, but for strings of input it would need a pair of constant
per word. So the choice of a hash function family for a performance
sensitive part is a heuristic in the sense of trying to get away with as
simple a function as possible.

Joerg

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Tom Lane

Andres Freund  writes:
> On 2019-01-09 15:03:35 -0500, Tom Lane wrote:
>> (Speaking of which, I've been wondering for awhile if libpq ought not
>> obtain the OIDs of lo_create and friends by #including fmgroids.h
>> instead of doing a runtime query on every connection.  If we did that,
>> we'd be forever giving up the option to renumber them ... but do you
>> really want to bet that nobody else has done this already in some
>> other client code?)

> I'm not enthusiastic about that. I kinda hope we're going to evolve that
> interface further, which'd make it version dependent anyway (we don't
> require all of them right now...). And it's not that expensive to query
> their oids once.

Version dependency doesn't seem like much of an argument: we'd just teach
libpq to pay attention to the server version, which it knows anyway (and
uses for other purposes already).

But this is a bit off-topic for this thread, perhaps.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Andres Freund

Hi,

On 2019-01-09 15:03:35 -0500, Tom Lane wrote:
> Alvaro Herrera  writes:
> > On 2019-Jan-09, Tom Lane wrote:
> >> We could make the index table still smaller if we wanted to reassign
> >> a couple dozen high-numbered functions down to lower OIDs, but I dunno
> >> if it's worth the trouble.  It certainly isn't from a performance
> >> standpoint, because those unused entry ranges will never be touched
> >> in normal usage; but it'd make the server executable a couple KB smaller.
> 
> > Or two couples KB smaller, if we abandoned the idea that pg_proc OIDs
> > must not collide with those in any other catalog, and we renumbered all
> > functions to start at OID 1 or so.  duplicate_oids would complain about
> > that, though, I suppose ... and nobody who has ever hardcoded a function
> > OID would love this idea much.
> 
> I think that'd be a nonstarter for commonly-used functions.  I'm guessing
> that pg_replication_origin_create() and so on, which are the immediate
> problem, haven't been around long enough or get used often enough for
> someone to have hard-coded their OIDs.  But I could be wrong.

I don't think it's likely that it'd be useful to hardcode them, and
therefore hope that nobody would do so.

I personally feel limited sympathy to people hardcoding oids across
major versions. The benefits of making pg easier to maintain and more
efficient seem higher than allowing for that.


> (Speaking of which, I've been wondering for awhile if libpq ought not
> obtain the OIDs of lo_create and friends by #including fmgroids.h
> instead of doing a runtime query on every connection.  If we did that,
> we'd be forever giving up the option to renumber them ... but do you
> really want to bet that nobody else has done this already in some
> other client code?)

I'm not enthusiastic about that. I kinda hope we're going to evolve that
interface further, which'd make it version dependent anyway (we don't
require all of them right now...). And it's not that expensive to query
their oids once.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Tom Lane

Alvaro Herrera  writes:
> On 2019-Jan-09, Tom Lane wrote:
>> We could make the index table still smaller if we wanted to reassign
>> a couple dozen high-numbered functions down to lower OIDs, but I dunno
>> if it's worth the trouble.  It certainly isn't from a performance
>> standpoint, because those unused entry ranges will never be touched
>> in normal usage; but it'd make the server executable a couple KB smaller.

> Or two couples KB smaller, if we abandoned the idea that pg_proc OIDs
> must not collide with those in any other catalog, and we renumbered all
> functions to start at OID 1 or so.  duplicate_oids would complain about
> that, though, I suppose ... and nobody who has ever hardcoded a function
> OID would love this idea much.

I think that'd be a nonstarter for commonly-used functions.  I'm guessing
that pg_replication_origin_create() and so on, which are the immediate
problem, haven't been around long enough or get used often enough for
someone to have hard-coded their OIDs.  But I could be wrong.

(Speaking of which, I've been wondering for awhile if libpq ought not
obtain the OIDs of lo_create and friends by #including fmgroids.h
instead of doing a runtime query on every connection.  If we did that,
we'd be forever giving up the option to renumber them ... but do you
really want to bet that nobody else has done this already in some
other client code?)

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Tom Lane

Andres Freund  writes:
> On 2019-01-09 14:44:24 -0500, Tom Lane wrote:
>> /* fast lookup only possible if original oid still assigned */
>> -if (id >= FirstGenbkiObjectId)
>> +if (id > fmgr_last_builtin_oid)
>>  return NULL;

> An extern reference here will make the code a bit less efficient, but
> it's probably not worth generating a header with a define for it
> instead...

Yeah, also that would be significantly more fragile, in that it'd
be hard to be sure where that OID had propagated to when rebuilding.
We haven't chosen to make fmgr_nbuiltins a #define either, and I think
it's best to treat this the same way.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Andres Freund

Hi,

On 2019-01-09 14:44:24 -0500, Tom Lane wrote:
> I wrote:
> > Also, I fail to understand why fmgr_builtin_oid_index has 1 entries
> > anyway.  We could easily have fmgrtab.c expose the last actually assigned
> > builtin function OID (presently 6121) and make the index array only
> > that big, which just about eliminates the space advantage completely.
> 
> Concretely, like the attached.

Seems like a good improvement.


> We could make the index table still smaller if we wanted to reassign
> a couple dozen high-numbered functions down to lower OIDs, but I dunno
> if it's worth the trouble.  It certainly isn't from a performance
> standpoint, because those unused entry ranges will never be touched
> in normal usage; but it'd make the server executable a couple KB smaller.

Probably indeed not worth it. I'm not 100% convinced on the performance
POV, but in contrast to the earlier binary search either approach is
fast enough that it probably hard to measure any difference.


> diff --git a/src/backend/utils/fmgr/fmgindex b41649f..506eeef 100644
> --- a/src/backend/utils/fmgr/fmgr.c
> +++ b/src/backend/utils/fmgr/fmgr.c
> @@ -75,12 +75,12 @@ fmgr_isbuiltin(Oid id)
>   uint16  index;
>  
>   /* fast lookup only possible if original oid still assigned */
> - if (id >= FirstGenbkiObjectId)
> + if (id > fmgr_last_builtin_oid)
>   return NULL;

An extern reference here will make the code a bit less efficient, but
it's probably not worth generating a header with a define for it
instead...

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Alvaro Herrera

On 2019-Jan-09, Tom Lane wrote:

> We could make the index table still smaller if we wanted to reassign
> a couple dozen high-numbered functions down to lower OIDs, but I dunno
> if it's worth the trouble.  It certainly isn't from a performance
> standpoint, because those unused entry ranges will never be touched
> in normal usage; but it'd make the server executable a couple KB smaller.

Or two couples KB smaller, if we abandoned the idea that pg_proc OIDs
must not collide with those in any other catalog, and we renumbered all
functions to start at OID 1 or so.  duplicate_oids would complain about
that, though, I suppose ... and nobody who has ever hardcoded a function
OID would love this idea much.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Tom Lane

I wrote:
> Also, I fail to understand why fmgr_builtin_oid_index has 1 entries
> anyway.  We could easily have fmgrtab.c expose the last actually assigned
> builtin function OID (presently 6121) and make the index array only
> that big, which just about eliminates the space advantage completely.

Concretely, like the attached.

We could make the index table still smaller if we wanted to reassign
a couple dozen high-numbered functions down to lower OIDs, but I dunno
if it's worth the trouble.  It certainly isn't from a performance
standpoint, because those unused entry ranges will never be touched
in normal usage; but it'd make the server executable a couple KB smaller.

regards, tom lane

diff --git a/src/backend/utils/Gen_fmgrtab.pl b/src/backend/utils/Gen_fmgrtab.pl
index cafe408..f970940 100644
*** a/src/backend/utils/Gen_fmgrtab.pl
--- b/src/backend/utils/Gen_fmgrtab.pl
*** foreach my $datfile (@input_files)
*** 80,90 
  	$catalog_data{$catname} = Catalog::ParseData($datfile, $schema, 0);
  }
  
- # Fetch some values for later.
- my $FirstGenbkiObjectId =
-   Catalog::FindDefinedSymbol('access/transam.h', $include_path,
- 	'FirstGenbkiObjectId');
- 
  # Collect certain fields from pg_proc.dat.
  my @fmgr = ();
  
--- 80,85 
*** my %bmap;
*** 225,230 
--- 220,226 
  $bmap{'t'} = 'true';
  $bmap{'f'} = 'false';
  my @fmgr_builtin_oid_index;
+ my $last_builtin_oid = 0;
  my $fmgr_count = 0;
  foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr)
  {
*** foreach my $s (sort { $a->{oid} <=> $b->
*** 232,237 
--- 228,234 
  	  "  { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }";
  
  	$fmgr_builtin_oid_index[ $s->{oid} ] = $fmgr_count++;
+ 	$last_builtin_oid = $s->{oid};
  
  	if ($fmgr_count <= $#fmgr)
  	{
*** foreach my $s (sort { $a->{oid} <=> $b->
*** 244,274 
  }
  print $tfh "};\n";
  
! print $tfh qq|
  const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
! |;
  
  
  # Create fmgr_builtins_oid_index table.
! #
! # Note that the array has to be filled up to FirstGenbkiObjectId,
! # as we can't rely on zero initialization as 0 is a valid mapping.
! print $tfh qq|
! const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId] = {
! |;
  
! for (my $i = 0; $i < $FirstGenbkiObjectId; $i++)
  {
  	my $oid = $fmgr_builtin_oid_index[$i];
  
! 	# fmgr_builtin_oid_index is sparse, map nonexistant functions to
  	# InvalidOidBuiltinMapping
  	if (not defined $oid)
  	{
  		$oid = 'InvalidOidBuiltinMapping';
  	}
  
! 	if ($i + 1 == $FirstGenbkiObjectId)
  	{
  		print $tfh "  $oid\n";
  	}
--- 241,270 
  }
  print $tfh "};\n";
  
! printf $tfh qq|
  const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
! 
! const Oid fmgr_last_builtin_oid = %u;
! |, $last_builtin_oid;
  
  
  # Create fmgr_builtins_oid_index table.
! printf $tfh qq|
! const uint16 fmgr_builtin_oid_index[%u] = {
! |, $last_builtin_oid + 1;
  
! for (my $i = 0; $i <= $last_builtin_oid; $i++)
  {
  	my $oid = $fmgr_builtin_oid_index[$i];
  
! 	# fmgr_builtin_oid_index is sparse, map nonexistent functions to
  	# InvalidOidBuiltinMapping
  	if (not defined $oid)
  	{
  		$oid = 'InvalidOidBuiltinMapping';
  	}
  
! 	if ($i == $last_builtin_oid)
  	{
  		print $tfh "  $oid\n";
  	}
diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index b41649f..506eeef 100644
*** a/src/backend/utils/fmgr/fmgr.c
--- b/src/backend/utils/fmgr/fmgr.c
*** fmgr_isbuiltin(Oid id)
*** 75,86 
  	uint16		index;
  
  	/* fast lookup only possible if original oid still assigned */
! 	if (id >= FirstGenbkiObjectId)
  		return NULL;
  
  	/*
  	 * Lookup function data. If there's a miss in that range it's likely a
! 	 * nonexistant function, returning NULL here will trigger an ERROR later.
  	 */
  	index = fmgr_builtin_oid_index[id];
  	if (index == InvalidOidBuiltinMapping)
--- 75,86 
  	uint16		index;
  
  	/* fast lookup only possible if original oid still assigned */
! 	if (id > fmgr_last_builtin_oid)
  		return NULL;
  
  	/*
  	 * Lookup function data. If there's a miss in that range it's likely a
! 	 * nonexistent function, returning NULL here will trigger an ERROR later.
  	 */
  	index = fmgr_builtin_oid_index[id];
  	if (index == InvalidOidBuiltinMapping)
diff --git a/src/include/utils/fmgrtab.h b/src/include/utils/fmgrtab.h
index a778f88..e981f34 100644
*** a/src/include/utils/fmgrtab.h
--- b/src/include/utils/fmgrtab.h
*** extern const FmgrBuiltin fmgr_builtins[]
*** 36,46 
  
  extern const int fmgr_nbuiltins;	/* number of entries in table */
  
  /*
!  * Mapping from a builtin function's oid to the index in the fmgr_builtins
!  * array.
   */
  #define InvalidOidBuiltinMapping PG_UINT16_MAX
! extern const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId];
  
  #endif			/* FMGRTAB_H */
--- 36,48

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-09 Thread Tom Lane

I wrote:
> John Naylor  writes:
>> -There is a bit of a cognitive clash between $case_sensitive in
>> gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each
>> make sense in their own file, but might it be worth using one or the
>> other?

> Yeah, dunno.  It seems to make sense for the command-line-level default of
> gen_keywordlist.pl to be "case insensitive", since most users want that.
> But that surely shouldn't be the default in PerfectHash.pm, and I'm not
> very sure how to reconcile the discrepancy.

Working on the fmgr-oid-lookup idea gave me the thought that
PerfectHash.pm ought to support fixed-length keys.  Rather than start
adding random parameters to the function, I borrowed an idea from
PostgresNode.pm and made the options be keyword-style parameters.  Now
the impedance mismatch about case sensitivity is handled with 

my $f = PerfectHash::generate_hash_function(\@keywords, $funcname,
case_insensitive => !$case_sensitive);

which is at least a little clearer than before, though I'm not sure
if it entirely solves the problem.

Also, in view of finding that the original multiplier choices failed
on the fmgr oid problem, I spent a little effort making the code
able to try more combinations of hash multipliers and seeds.  It'd
be nice to have some theory rather than just heuristics about what
will work, though ...

Barring objections or further review, I plan to push this soon.

regards, tom lane

diff --git a/src/common/Makefile b/src/common/Makefile
index 317b071..d0c2b97 100644
*** a/src/common/Makefile
--- b/src/common/Makefile
*** OBJS_FRONTEND = $(OBJS_COMMON) fe_memuti
*** 63,68 
--- 63,73 
  OBJS_SHLIB = $(OBJS_FRONTEND:%.o=%_shlib.o)
  OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)
  
+ # where to find gen_keywordlist.pl and subsidiary files
+ TOOLSDIR = $(top_srcdir)/src/tools
+ GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl
+ GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm
+ 
  all: libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a
  
  distprep: kwlist_d.h
*** libpgcommon_srv.a: $(OBJS_SRV)
*** 118,125 
  	$(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@
  
  # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(top_srcdir)/src/tools/gen_keywordlist.pl
! 	$(PERL) $(top_srcdir)/src/tools/gen_keywordlist.pl --extern $<
  
  # Dependencies of keywords*.o need to be managed explicitly to make sure
  # that you don't get broken parsing code, even in a non-enable-depend build.
--- 123,130 
  	$(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@
  
  # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(GEN_KEYWORDLIST_DEPS)
! 	$(GEN_KEYWORDLIST) --extern $<
  
  # Dependencies of keywords*.o need to be managed explicitly to make sure
  # that you don't get broken parsing code, even in a non-enable-depend build.
diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index d72842e..6545480 100644
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***
*** 35,94 
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *text,
    const ScanKeywordList *keywords)
  {
! 	int			len,
! i;
! 	char		word[NAMEDATALEN];
! 	const char *kw_string;
! 	const uint16 *kw_offsets;
! 	const uint16 *low;
! 	const uint16 *high;
! 
! 	len = strlen(text);
  
  	if (len > keywords->max_kw_len)
! 		return -1;/* too long to be any keyword */
! 
! 	/* We assume all keywords are shorter than NAMEDATALEN. */
! 	Assert(len < NAMEDATALEN);
  
  	/*
! 	 * Apply an ASCII-only downcasing.  We must not use tolower() since it may
! 	 * produce the wrong translation in some locales (eg, Turkish).
  	 */
! 	for (i = 0; i < len; i++)
! 	{
! 		char		ch = text[i];
  
! 		if (ch >= 'A' && ch <= 'Z')
! 			ch += 'a' - 'A';
! 		word[i] = ch;
! 	}
! 	word[len] = '\0';
  
  	/*
! 	 * Now do a binary search using plain strcmp() comparison.
  	 */
! 	kw_string = keywords->kw_string;
! 	kw_offsets = keywords->kw_offsets;
! 	low = kw_offsets;
! 	high = kw_offsets + (keywords->num_keywords - 1);
! 	while (low <= high)
  	{
! 		const uint16 *middle;
! 		int			difference;
  
! 		middle = low + (high - low) / 2;
! 		difference = strcmp(kw_string + *middle, word);
! 		if (difference == 0)
! 			return middle - kw_offsets;
! 		else if (difference < 0)
! 			low = middle + 1;
! 		else
! 			high = middle - 1;
  	}
  
! 	return -1;
  }
--- 35,85 
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *str,
    const ScanKeywordList *keywords)
  {
! 	size_t		len;
! 	int			h;
! 	const char *kw;
  
+ 	/*
+ 	 * Reject immediately if too long to be any keyword.  This saves useless
+ 	 * hashing and downcasing work on long strings.
+

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Joerg Sonnenberger

On Tue, Jan 08, 2019 at 05:53:25PM -0500, Tom Lane wrote:
> John Naylor  writes:
> > -As for the graph algorithm, I'd have to play with it to understand
> > how it works.
> 
> I improved the comment about how come the hash table entry assignment
> works.  One thing I'm not clear about myself is
> 
>   # A cycle-free graph is either empty or has some vertex of degree 1.
> 
> That sounds like a standard graph theory result, but I'm not familiar
> with a proof for it.

Let's assume all vertexes have a degree > 1, the graph is acyclic and
non-empty. Pick any vertex. Let's construct a path now starting from
this vertex. It is connected to at least one other vertex. Let's follow
that path. Again, there must be connected to one more vertex and it can't
go back to the starting point (since that would be a cycle). The next
vertex must still have another connections and it can't go back to any
already visited vertexes. Continue until you run out of vertex...

Joerg

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Tom Lane

John Naylor  writes:
> -As for the graph algorithm, I'd have to play with it to understand
> how it works.

I improved the comment about how come the hash table entry assignment
works.  One thing I'm not clear about myself is

# A cycle-free graph is either empty or has some vertex of degree 1.

That sounds like a standard graph theory result, but I'm not familiar
with a proof for it.

regards, tom lane

#--
#
# PerfectHash.pm
#Perl module that constructs minimal perfect hash functions
#
# This code constructs a minimal perfect hash function for the given
# set of keys, using an algorithm described in
# "An optimal algorithm for generating minimal perfect hash functions"
# by Czech, Havas and Majewski in Information Processing Letters,
# 43(5):256-264, October 1992.
# This implementation is loosely based on NetBSD's "nbperf",
# which was written by Joerg Sonnenberger.
#
# The resulting hash function is perfect in the sense that if the presented
# key is one of the original set, it will return the key's index in the set
# (in range 0..N-1).  However, the caller must still verify the match,
# as false positives are possible.  Also, the hash function may return
# values that are out of range (negative, or >= N).  This indicates that
# the presented key is definitely not in the set.
#
#
# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
# Portions Copyright (c) 1994, Regents of the University of California
#
# src/tools/PerfectHash.pm
#
#--

package PerfectHash;

use strict;
use warnings;

# At runtime, we'll compute two simple hash functions of the input key,
# and use them to index into a mapping table.  The hash functions are just
# multiply-and-add in uint32 arithmetic, with different multipliers but
# the same initial seed.  All the complexity in this module is concerned
# with selecting hash parameters that will work and building the mapping
# table.

# We support making case-insensitive hash functions, though this only
# works for a strict-ASCII interpretation of case insensitivity,
# ie, A-Z maps onto a-z and nothing else.
my $case_insensitive = 0;

#
# Construct a C function implementing a perfect hash for the given keys.
# The C function definition is returned as a string.
#
# The keys can be any set of Perl strings; it is caller's responsibility
# that there not be any duplicates.  (Note that the "strings" can be
# binary data, but endianness is the caller's problem.)
#
# The name to use for the function is caller-specified, but its signature
# is always "int f(const void *key, size_t keylen)".  The caller may
# prepend "static " to the result string if it wants a static function.
#
# If $ci is true, the function is case-insensitive, for the limited idea
# of case-insensitivity explained above.
#
sub generate_hash_function
{
my ($keys_ref, $funcname, $ci) = @_;

# It's not worth passing this around as a parameter; just use a global.
$case_insensitive = $ci;

# Try different hash function parameters until we find a set that works
# for these keys.  In principle we might need to change multipliers,
# but these two multipliers are chosen to be primes that are cheap to
# calculate via shift-and-add, so don't change them without care.
my $hash_mult1 = 31;
my $hash_mult2 = 2053;

# We just try successive hash seed values until we find one that works.
# (Commonly, random seeds are tried, but we want reproducible results
# from this program so we don't do that.)
my $hash_seed;
my @subresult;
for ($hash_seed = 0; $hash_seed < 1000; $hash_seed++)
{
@subresult =
  _construct_hash_table($keys_ref, $hash_mult1, $hash_mult2,
$hash_seed);
last if @subresult;
}

# Choke if we didn't succeed in a reasonable number of tries.
die "failed to generate perfect hash" if !@subresult;

# Extract info from the function result array.
my $elemtype = $subresult[0];
my @hashtab  = @{ $subresult[1] };
my $nhash= scalar(@hashtab);

# OK, construct the hash function definition including the hash table.
my $f = '';
$f .= sprintf "int\n";
$f .= sprintf "%s(const void *key, size_t keylen)\n{\n", $funcname;
$f .= sprintf "\tstatic const %s h[%d] = {\n", $elemtype, $nhash;
for (my $i = 0; $i < $nhash; $i++)
{
$f .= sprintf "%s%6d,%s",
  ($i % 8 == 0 ? "\t\t" : " "),
  $hashtab[$i],
  ($i % 8 == 7 ? "\n" : "");
}
$f .= sprintf "\n" if ($nhash % 8 != 0);
$f .= sprintf "\t};\n\n";
$f .= sprintf "\tconst unsigned char *k = key;\n";
$f .=

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Tom Lane

John Naylor  writes:
> Just a couple comments about the module:

> -If you qualify the function's module name as you did
> (PerfectHash::generate_hash_function), you don't have to export the
> function into the callers namespace, so you can skip the @EXPORT_OK
> setting. Most of our modules don't export.

OK by me.  I was more concerned about hiding the stuff that isn't
supposed to be exported.

> -There is a bit of a cognitive clash between $case_sensitive in
> gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each
> make sense in their own file, but might it be worth using one or the
> other?

Yeah, dunno.  It seems to make sense for the command-line-level default of
gen_keywordlist.pl to be "case insensitive", since most users want that.
But that surely shouldn't be the default in PerfectHash.pm, and I'm not
very sure how to reconcile the discrepancy.

> In the committed keyword patch, I noticed that in common/keywords.c,
> the array length is defined with
> ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS]
> but other keyword arrays just have ...[]. Is there a reason for the 
> difference?

The length macro was readily available there so I used it.  AFAIR
that wasn't true elsewhere, though I might've missed something.
It's pretty much just belt-and-suspenders coding anyway, since all
those arrays are machine generated ...

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread John Naylor

On Tue, Jan 8, 2019 at 3:04 PM Tom Lane  wrote:
> > I'll take a crack at separating into a module.  I'll wait a bit in
> > case there are any stylistic suggestions on the patch as it stands.
>
> I had a go at that myself.  I'm sure there's plenty to criticize in
> the result, but at least it passes make check-world ;-)

Just a couple comments about the module:

-If you qualify the function's module name as you did
(PerfectHash::generate_hash_function), you don't have to export the
function into the callers namespace, so you can skip the @EXPORT_OK
setting. Most of our modules don't export.

-There is a bit of a cognitive clash between $case_sensitive in
gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each
make sense in their own file, but might it be worth using one or the
other?

-As for the graph algorithm, I'd have to play with it to understand
how it works.


In the committed keyword patch, I noticed that in common/keywords.c,
the array length is defined with

ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS]

but other keyword arrays just have ...[]. Is there a reason for the difference?

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Tom Lane

Andres Freund  writes:
> On 2019-01-08 13:41:16 -0500, John Naylor wrote:
>> Do you mean the fmgr table?

> Not the entire fmgr table, but just the builtin oid index, generated by
> the following section:
> ...
> The generated fmgr_builtin_oid_index is pretty sparse, and a more dense
> hashtable might e.g. more efficient from a cache perspective.

I experimented with this, but TBH I think it's a dead loss.  We currently
have 2768 built-in functions, so the perfect hash table requires 5537
int16 entries, which is not *that* much less than the 1 entries
that are in fmgr_builtin_oid_index presently.  When you consider the
extra cycles needed to do the hashing, and the fact that you have to
touch (usually) two cache lines not one in the lookup table, it's hard
to see how this could net out as a win performance-wise.

Also, I fail to understand why fmgr_builtin_oid_index has 1 entries
anyway.  We could easily have fmgrtab.c expose the last actually assigned
builtin function OID (presently 6121) and make the index array only
that big, which just about eliminates the space advantage completely.

BTW, I found out while trying this that Joerg's fear of the hash
multipliers being too simplistic is valid: the perfect hash generator
failed until I changed them.  I picked a larger value that should be
just as easy to use for shift-and-add purposes.

regards, tom lane

diff --git a/src/backend/utils/Gen_fmgrtab.pl b/src/backend/utils/Gen_fmgrtab.pl
index cafe408..9fceb60 100644
*** a/src/backend/utils/Gen_fmgrtab.pl
--- b/src/backend/utils/Gen_fmgrtab.pl
*** use Catalog;
*** 18,23 
--- 18,24 
  
  use strict;
  use warnings;
+ use PerfectHash;
  
  # Collect arguments
  my @input_files;
*** foreach my $s (sort { $a->{oid} <=> $b->
*** 219,237 
  	print $pfh "extern Datum $s->{prosrc}(PG_FUNCTION_ARGS);\n";
  }
  
! # Create the fmgr_builtins table, collect data for fmgr_builtin_oid_index
  print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n";
  my %bmap;
  $bmap{'t'} = 'true';
  $bmap{'f'} = 'false';
! my @fmgr_builtin_oid_index;
  my $fmgr_count = 0;
  foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr)
  {
  	print $tfh
  	  "  { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }";
  
! 	$fmgr_builtin_oid_index[ $s->{oid} ] = $fmgr_count++;
  
  	if ($fmgr_count <= $#fmgr)
  	{
--- 220,244 
  	print $pfh "extern Datum $s->{prosrc}(PG_FUNCTION_ARGS);\n";
  }
  
! # Create the fmgr_builtins table, collect data for hash function
  print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n";
  my %bmap;
  $bmap{'t'} = 'true';
  $bmap{'f'} = 'false';
! my @fmgr_builtin_oids;
! my $prev_oid = 0;
  my $fmgr_count = 0;
  foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr)
  {
  	print $tfh
  	  "  { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }";
  
! 	die "duplicate OIDs" if $s->{oid} <= $prev_oid;
! 	$prev_oid = $s->{oid};
! 
! 	push @fmgr_builtin_oids, pack("n", $s->{oid});
! 
! 	$fmgr_count++;
  
  	if ($fmgr_count <= $#fmgr)
  	{
*** print $tfh "};\n";
*** 246,283 
  
  print $tfh qq|
  const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
- |;
  
- 
- # Create fmgr_builtins_oid_index table.
- #
- # Note that the array has to be filled up to FirstGenbkiObjectId,
- # as we can't rely on zero initialization as 0 is a valid mapping.
- print $tfh qq|
- const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId] = {
  |;
  
- for (my $i = 0; $i < $FirstGenbkiObjectId; $i++)
- {
- 	my $oid = $fmgr_builtin_oid_index[$i];
  
! 	# fmgr_builtin_oid_index is sparse, map nonexistant functions to
! 	# InvalidOidBuiltinMapping
! 	if (not defined $oid)
! 	{
! 		$oid = 'InvalidOidBuiltinMapping';
! 	}
  
! 	if ($i + 1 == $FirstGenbkiObjectId)
! 	{
! 		print $tfh "  $oid\n";
! 	}
! 	else
! 	{
! 		print $tfh "  $oid,\n";
! 	}
! }
! print $tfh "};\n";
  
  
  # And add the file footers.
--- 253,267 
  
  print $tfh qq|
  const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
  
  |;
  
  
! # Create perfect hash function for searching fmgr_builtin by OID.
  
! print $tfh PerfectHash::generate_hash_function(\@fmgr_builtin_oids,
! 	   "fmgr_builtin_oid_hash",
! 	   0);
  
  
  # And add the file footers.
diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index b41649f..ad93032 100644
*** a/src/backend/utils/fmgr/fmgr.c
--- b/src/backend/utils/fmgr/fmgr.c
*** extern Datum fmgr_security_definer(PG_FU
*** 72,92 
  static const FmgrBuiltin *
  fmgr_isbuiltin(Oid id)
  {
! 	uint16		index;
  
  	/* fast lookup only possible if original oid still assigned */
  	if (id >= FirstGenbkiObjectId)
  		return NULL;
  
  	/*
! 	 * Lookup function data. If there's a miss in that range it's likely a
! 	 * nonexistant function, returning NULL here will trigger an ERROR later.

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Tom Lane

John Naylor  writes:
> On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan
>  wrote:
>> If he doesn't I will.

> I'll take a crack at separating into a module.  I'll wait a bit in
> case there are any stylistic suggestions on the patch as it stands.

I had a go at that myself.  I'm sure there's plenty to criticize in
the result, but at least it passes make check-world ;-)

I resolved the worry I had last night about the range of table values
by putting in logic to check the range and choose a suitable table
element type.  There are a couple of existing calls where we manage
to fit the hashtable elements into int8 that way; of course, by
definition that doesn't save a whole lot of space since such tables
couldn't have many elements, but it seems cleaner anyway.

regards, tom lane

diff --git a/src/common/Makefile b/src/common/Makefile
index 317b071..d0c2b97 100644
*** a/src/common/Makefile
--- b/src/common/Makefile
*** OBJS_FRONTEND = $(OBJS_COMMON) fe_memuti
*** 63,68 
--- 63,73 
  OBJS_SHLIB = $(OBJS_FRONTEND:%.o=%_shlib.o)
  OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)
  
+ # where to find gen_keywordlist.pl and subsidiary files
+ TOOLSDIR = $(top_srcdir)/src/tools
+ GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl
+ GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm
+ 
  all: libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a
  
  distprep: kwlist_d.h
*** libpgcommon_srv.a: $(OBJS_SRV)
*** 118,125 
  	$(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@
  
  # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(top_srcdir)/src/tools/gen_keywordlist.pl
! 	$(PERL) $(top_srcdir)/src/tools/gen_keywordlist.pl --extern $<
  
  # Dependencies of keywords*.o need to be managed explicitly to make sure
  # that you don't get broken parsing code, even in a non-enable-depend build.
--- 123,130 
  	$(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@
  
  # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(GEN_KEYWORDLIST_DEPS)
! 	$(GEN_KEYWORDLIST) --extern $<
  
  # Dependencies of keywords*.o need to be managed explicitly to make sure
  # that you don't get broken parsing code, even in a non-enable-depend build.
diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index d72842e..9dc1fee 100644
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***
*** 35,94 
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *text,
    const ScanKeywordList *keywords)
  {
! 	int			len,
! i;
! 	char		word[NAMEDATALEN];
! 	const char *kw_string;
! 	const uint16 *kw_offsets;
! 	const uint16 *low;
! 	const uint16 *high;
! 
! 	len = strlen(text);
  
  	if (len > keywords->max_kw_len)
! 		return -1;/* too long to be any keyword */
! 
! 	/* We assume all keywords are shorter than NAMEDATALEN. */
! 	Assert(len < NAMEDATALEN);
  
  	/*
! 	 * Apply an ASCII-only downcasing.  We must not use tolower() since it may
! 	 * produce the wrong translation in some locales (eg, Turkish).
  	 */
! 	for (i = 0; i < len; i++)
! 	{
! 		char		ch = text[i];
  
! 		if (ch >= 'A' && ch <= 'Z')
! 			ch += 'a' - 'A';
! 		word[i] = ch;
! 	}
! 	word[len] = '\0';
  
  	/*
! 	 * Now do a binary search using plain strcmp() comparison.
  	 */
! 	kw_string = keywords->kw_string;
! 	kw_offsets = keywords->kw_offsets;
! 	low = kw_offsets;
! 	high = kw_offsets + (keywords->num_keywords - 1);
! 	while (low <= high)
  	{
! 		const uint16 *middle;
! 		int			difference;
  
! 		middle = low + (high - low) / 2;
! 		difference = strcmp(kw_string + *middle, word);
! 		if (difference == 0)
! 			return middle - kw_offsets;
! 		else if (difference < 0)
! 			low = middle + 1;
! 		else
! 			high = middle - 1;
  	}
  
! 	return -1;
  }
--- 35,89 
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *str,
    const ScanKeywordList *keywords)
  {
! 	size_t		len;
! 	int			h;
! 	const char *kw;
  
+ 	/*
+ 	 * Reject immediately if too long to be any keyword.  This saves useless
+ 	 * hashing and downcasing work on long strings.
+ 	 */
+ 	len = strlen(str);
  	if (len > keywords->max_kw_len)
! 		return -1;
  
  	/*
! 	 * Compute the hash function.  We assume it was generated to produce
! 	 * case-insensitive results.  Since it's a perfect hash, we need only
! 	 * match to the specific keyword it identifies.
  	 */
! 	h = keywords->hash(str, len);
  
! 	/*
! 	 * An out-of-range result implies no match.  (This can happen for
! 	 * non-keyword inputs because the hash function will sum two unrelated
! 	 * hashtable entries.)
! 	 */
! 	if (h < 0 || h >= keywords->num_keywords)
! 		return -1;
  
  	/*
! 	 * Compare character-by-character to see if we have a match, applying an
! 	 * ASCII-only

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Andres Freund

Hi,

On 2019-01-08 13:41:16 -0500, John Naylor wrote:
> On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan
>  wrote:
> > On 1/7/19 7:52 PM, Andres Freund wrote:
> > > Builtin functions for one, which we'd swatted down last time round due
> > > to gperfs defficiencies.
> 
> Do you mean the fmgr table?

Not the entire fmgr table, but just the builtin oid index, generated by
the following section:

# Create the fmgr_builtins table, collect data for fmgr_builtin_oid_index
print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n";
my %bmap;
$bmap{'t'} = 'true';
$bmap{'f'} = 'false';
my @fmgr_builtin_oid_index;
my $fmgr_count = 0;
foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr)
{
print $tfh
  "  { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, 
$bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }";

$fmgr_builtin_oid_index[ $s->{oid} ] = $fmgr_count++;

if ($fmgr_count <= $#fmgr)
{
print $tfh ",\n";
}
else
{
print $tfh "\n";
}
}
print $tfh "};\n";

print $tfh qq|
const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
|;

The generated fmgr_builtin_oid_index is pretty sparse, and a more dense
hashtable might e.g. more efficient from a cache perspective.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread John Naylor

On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan
 wrote:
> On 1/7/19 7:52 PM, Andres Freund wrote:
> > Builtin functions for one, which we'd swatted down last time round due
> > to gperfs defficiencies.

Do you mean the fmgr table?

> >> But in any case, that sounds like a task for someone with
> >> more sense of Perl style than I have.
> > John, any chance you could help out with that... :)
>
> If he doesn't I will.

I'll take a crack at separating into a module.  I'll wait a bit in
case there are any stylistic suggestions on the patch as it stands.

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Andrew Dunstan



On 1/7/19 7:52 PM, Andres Freund wrote:
> Hi,
>
> On 2019-01-07 19:37:51 -0500, Tom Lane wrote:
>> Andres Freund  writes:
>>> Hm, shouldn't we extract the perfect hash generation into a perl module
>>> or such? It seems that there's plenty other possible uses for it.
>> Such as?
> Builtin functions for one, which we'd swatted down last time round due
> to gperfs defficiencies. But I think there's plenty more potential,
> e.g. it'd make sense from a performance POV to use a perfect hash
> function for locks on builtin objects (the hashtable for lookups therein
> shows up prominently in a fair number of profiles, and they are a large
> percentage of the acquistions). I'm certain there's plenty more, I've
> not though too much about it.
>


Yeah, this is pretty neat,



>> But in any case, that sounds like a task for someone with
>> more sense of Perl style than I have.
> John, any chance you could help out with that... :)
>

If he doesn't I will.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-07 Thread Andres Freund

Hi,

On 2019-01-07 19:37:51 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Hm, shouldn't we extract the perfect hash generation into a perl module
> > or such? It seems that there's plenty other possible uses for it.
>
> Such as?

Builtin functions for one, which we'd swatted down last time round due
to gperfs defficiencies. But I think there's plenty more potential,
e.g. it'd make sense from a performance POV to use a perfect hash
function for locks on builtin objects (the hashtable for lookups therein
shows up prominently in a fair number of profiles, and they are a large
percentage of the acquistions). I'm certain there's plenty more, I've
not though too much about it.

> But in any case, that sounds like a task for someone with
> more sense of Perl style than I have.

John, any chance you could help out with that... :)

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-07 Thread Tom Lane

Andres Freund  writes:
> Hm, shouldn't we extract the perfect hash generation into a perl module
> or such? It seems that there's plenty other possible uses for it.

Such as?  But in any case, that sounds like a task for someone with
more sense of Perl style than I have.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-07 Thread Tom Lane

I wrote:
> Probably there's a lot to be criticized about the Perl style below;
> anybody feel a need to rewrite it?

Here's a somewhat better version.  I realized that I was being too
slavishly tied to the data structures used in the C version; in Perl
it's easier to manage the lists of edges as hashes.  I can't see any
need to distinguish left and right edge sets, either, so this just
has one such hash per vertex.

Also, it seems to me that we *can* make intelligent use of unused
hashtable entries to exit early on many non-keyword inputs.  The reason
the existing code fails to do so is that it computes the sums and
differences of hashtable entries in unsigned modulo arithmetic; but if
we make the hashtable entries signed, we can set them up as exact
differences and drop the final modulo operation in the hash function.
Then, any out-of-range sum must indicate an input that is not a keyword
(because it is touching a pair of hashtable entries that didn't go
together) and we can exit early from the caller.  This in turn lets us
mark unused hashtable entries with large values to ensure that sums
involving them will be out of range.

A weak spot in that argument is that it's not entirely clear how
large the differences can get --- with an unlucky series of
collisions, maybe they could get large enough to overflow int16?
I don't think that's likely for the size of problem this script
is going to encounter, so I just put in an error check for it.
But it could do with closer analysis before deciding that this
is a general-purpose solution.

regards, tom lane

diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index d72842e..9dc1fee 100644
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***
*** 35,94 
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *text,
    const ScanKeywordList *keywords)
  {
! 	int			len,
! i;
! 	char		word[NAMEDATALEN];
! 	const char *kw_string;
! 	const uint16 *kw_offsets;
! 	const uint16 *low;
! 	const uint16 *high;
! 
! 	len = strlen(text);

  	if (len > keywords->max_kw_len)
! 		return -1;/* too long to be any keyword */
! 
! 	/* We assume all keywords are shorter than NAMEDATALEN. */
! 	Assert(len < NAMEDATALEN);

  	/*
! 	 * Apply an ASCII-only downcasing.  We must not use tolower() since it may
! 	 * produce the wrong translation in some locales (eg, Turkish).
  	 */
! 	for (i = 0; i < len; i++)
! 	{
! 		char		ch = text[i];

! 		if (ch >= 'A' && ch <= 'Z')
! 			ch += 'a' - 'A';
! 		word[i] = ch;
! 	}
! 	word[len] = '\0';

  	/*
! 	 * Now do a binary search using plain strcmp() comparison.
  	 */
! 	kw_string = keywords->kw_string;
! 	kw_offsets = keywords->kw_offsets;
! 	low = kw_offsets;
! 	high = kw_offsets + (keywords->num_keywords - 1);
! 	while (low <= high)
  	{
! 		const uint16 *middle;
! 		int			difference;

! 		middle = low + (high - low) / 2;
! 		difference = strcmp(kw_string + *middle, word);
! 		if (difference == 0)
! 			return middle - kw_offsets;
! 		else if (difference < 0)
! 			low = middle + 1;
! 		else
! 			high = middle - 1;
  	}

! 	return -1;
  }
--- 35,89 
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *str,
    const ScanKeywordList *keywords)
  {
! 	size_t		len;
! 	int			h;
! 	const char *kw;

+ 	/*
+ 	 * Reject immediately if too long to be any keyword.  This saves useless
+ 	 * hashing and downcasing work on long strings.
+ 	 */
+ 	len = strlen(str);
  	if (len > keywords->max_kw_len)
! 		return -1;

  	/*
! 	 * Compute the hash function.  We assume it was generated to produce
! 	 * case-insensitive results.  Since it's a perfect hash, we need only
! 	 * match to the specific keyword it identifies.
  	 */
! 	h = keywords->hash(str, len);

! 	/*
! 	 * An out-of-range result implies no match.  (This can happen for
! 	 * non-keyword inputs because the hash function will sum two unrelated
! 	 * hashtable entries.)
! 	 */
! 	if (h < 0 || h >= keywords->num_keywords)
! 		return -1;

  	/*
! 	 * Compare character-by-character to see if we have a match, applying an
! 	 * ASCII-only downcasing to the input characters.  We must not use
! 	 * tolower() since it may produce the wrong translation in some locales
! 	 * (eg, Turkish).
  	 */
! 	kw = GetScanKeyword(h, keywords);
! 	while (*str != '\0')
  	{
! 		char		ch = *str++;

! 		if (ch >= 'A' && ch <= 'Z')
! 			ch += 'a' - 'A';
! 		if (ch != *kw++)
! 			return -1;
  	}
+ 	if (*kw != '\0')
+ 		return -1;

! 	/* Success! */
! 	return h;
  }
diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h
index 39efb35..dbff367 100644
*** a/src/include/common/kwlookup.h
--- b/src/include/common/kwlookup.h
***
*** 14,19 
--- 14,22 
  #ifndef KWLOOKUP_H
  #define KWLOOKUP_H

+ /* Hash function used by ScanKeywordLookup */
+ typedef int (*ScanKeywordHashFunc) (const void *key, size_t keylen);
+ 
  /*

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-07 Thread Andres Freund

Hi,

On 2019-01-07 16:11:04 -0500, Tom Lane wrote:
> I wrote:
> > I took a quick look through the NetBSD nbperf sources at
> > http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/nbperf/
> > and I concur with your judgment that we could manage translating
> > that into Perl, especially if we only implement the parts we need.
> 
> Here's an implementation of that, using the hash functions you showed
> upthread.  The speed of the Perl script seems to be pretty acceptable;
> less than 100ms to handle the main SQL keyword list, on my machine.
> Yeah, the C version might be less than 1ms, but I don't think that
> we need to put up with non-Perl build tooling for that.
> 
> Using the same test case as before (parsing information_schema.sql),
> I get runtimes around 3560 ms, a shade better than my jury-rigged
> prototype.
> 
> Probably there's a lot to be criticized about the Perl style below;
> anybody feel a need to rewrite it?

Hm, shouldn't we extract the perfect hash generation into a perl module
or such? It seems that there's plenty other possible uses for it.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-07 Thread Tom Lane

I wrote:
> I took a quick look through the NetBSD nbperf sources at
> http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/nbperf/
> and I concur with your judgment that we could manage translating
> that into Perl, especially if we only implement the parts we need.

Here's an implementation of that, using the hash functions you showed
upthread.  The speed of the Perl script seems to be pretty acceptable;
less than 100ms to handle the main SQL keyword list, on my machine.
Yeah, the C version might be less than 1ms, but I don't think that
we need to put up with non-Perl build tooling for that.

Using the same test case as before (parsing information_schema.sql),
I get runtimes around 3560 ms, a shade better than my jury-rigged
prototype.

Probably there's a lot to be criticized about the Perl style below;
anybody feel a need to rewrite it?

regards, tom lane

diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index d72842e..445b99d 100644
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***
*** 35,94 
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *text,
    const ScanKeywordList *keywords)
  {
! 	int			len,
! i;
! 	char		word[NAMEDATALEN];
! 	const char *kw_string;
! 	const uint16 *kw_offsets;
! 	const uint16 *low;
! 	const uint16 *high;
! 
! 	len = strlen(text);
  
  	if (len > keywords->max_kw_len)
! 		return -1;/* too long to be any keyword */
! 
! 	/* We assume all keywords are shorter than NAMEDATALEN. */
! 	Assert(len < NAMEDATALEN);
  
  	/*
! 	 * Apply an ASCII-only downcasing.  We must not use tolower() since it may
! 	 * produce the wrong translation in some locales (eg, Turkish).
  	 */
! 	for (i = 0; i < len; i++)
! 	{
! 		char		ch = text[i];
! 
! 		if (ch >= 'A' && ch <= 'Z')
! 			ch += 'a' - 'A';
! 		word[i] = ch;
! 	}
! 	word[len] = '\0';
  
  	/*
! 	 * Now do a binary search using plain strcmp() comparison.
  	 */
! 	kw_string = keywords->kw_string;
! 	kw_offsets = keywords->kw_offsets;
! 	low = kw_offsets;
! 	high = kw_offsets + (keywords->num_keywords - 1);
! 	while (low <= high)
  	{
! 		const uint16 *middle;
! 		int			difference;
  
! 		middle = low + (high - low) / 2;
! 		difference = strcmp(kw_string + *middle, word);
! 		if (difference == 0)
! 			return middle - kw_offsets;
! 		else if (difference < 0)
! 			low = middle + 1;
! 		else
! 			high = middle - 1;
  	}
  
! 	return -1;
  }
--- 35,81 
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *str,
    const ScanKeywordList *keywords)
  {
! 	size_t		len;
! 	uint32		h;
! 	const char *kw;
  
+ 	/*
+ 	 * Reject immediately if too long to be any keyword.  This saves useless
+ 	 * hashing and downcasing work on long strings.
+ 	 */
+ 	len = strlen(str);
  	if (len > keywords->max_kw_len)
! 		return -1;
  
  	/*
! 	 * Compute the hash function.  We assume it was generated to produce
! 	 * case-insensitive results.  Since it's a perfect hash, we need only
! 	 * match to the specific keyword it identifies.
  	 */
! 	h = keywords->hash(str, len);
  
  	/*
! 	 * Compare character-by-character to see if we have a match, applying an
! 	 * ASCII-only downcasing to the input characters.  We must not use
! 	 * tolower() since it may produce the wrong translation in some locales
! 	 * (eg, Turkish).
  	 */
! 	kw = GetScanKeyword(h, keywords);
! 	while (*str != '\0')
  	{
! 		char		ch = *str++;
  
! 		if (ch >= 'A' && ch <= 'Z')
! 			ch += 'a' - 'A';
! 		if (ch != *kw++)
! 			return -1;
  	}
+ 	if (*kw != '\0')
+ 		return -1;
  
! 	/* Success! */
! 	return h;
  }
diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h
index 39efb35..a6609ee 100644
*** a/src/include/common/kwlookup.h
--- b/src/include/common/kwlookup.h
***
*** 14,19 
--- 14,22 
  #ifndef KWLOOKUP_H
  #define KWLOOKUP_H
  
+ /* Hash function used by ScanKeywordLookup */
+ typedef uint32 (*ScanKeywordHashFunc) (const void *key, size_t keylen);
+ 
  /*
   * This struct contains the data needed by ScanKeywordLookup to perform a
   * search within a set of keywords.  The contents are typically generated by
*** typedef struct ScanKeywordList
*** 23,28 
--- 26,32 
  {
  	const char *kw_string;		/* all keywords in order, separated by \0 */
  	const uint16 *kw_offsets;	/* offsets to the start of each keyword */
+ 	ScanKeywordHashFunc hash;	/* perfect hash function for keywords */
  	int			num_keywords;	/* number of keywords */
  	int			max_kw_len;		/* length of longest keyword */
  } ScanKeywordList;
diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile
index b5b74a3..abfe3cc 100644
*** a/src/interfaces/ecpg/preproc/Makefile
--- b/src/interfaces/ecpg/preproc/Makefile
*** preproc.y: ../../../backend/parser/gram.
*** 57,63 
  
  # generate keyword headers
  c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST)
! 	$(PERL) $(GEN_KEYWORDL

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-07 Thread John Naylor

On 1/6/19, Tom Lane  wrote:
> I've pushed that version (v8 + max_kw_len); if the buildfarm doesn't
> fall over, we can move on with looking at hashing.

Thank you. The API adjustment looks good, and I'm glad that splitting
out the aux info led to even further cleanups.

-John Naylor

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-06 Thread Tom Lane

Joerg Sonnenberger  writes:
> On Sun, Jan 06, 2019 at 02:29:05PM -0500, Tom Lane wrote:
>> * We should extend the ScanKeywordList representation to include a
>> field holding the longest keyword length in the table, which
>> gen_keywordlist.pl would have no trouble providing.  Then we could
>> skip downcasing and/or hashing for any word longer than that, replacing
>> the current NAMEDATALEN test, and thereby putting a tight bound on
>> the cost of downcasing and/or hashing.

> Correct, possibly even have an array for each class of keywords.

I added that change to v8 and noted a further small improvement in my
test case.  That probably says something about the prevalence of long
identifiers in information_schema.sql ;-), but anyway we can figure
it's not a net loss.

I've pushed that version (v8 + max_kw_len); if the buildfarm doesn't
fall over, we can move on with looking at hashing.

I took a quick look through the NetBSD nbperf sources at

http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/nbperf/

and I concur with your judgment that we could manage translating
that into Perl, especially if we only implement the parts we need.
I'm curious what further changes you've made locally, and what
parameters you were using.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-06 Thread Tom Lane

Joerg Sonnenberger  writes:
> On Sun, Jan 06, 2019 at 02:29:05PM -0500, Tom Lane wrote:
>> So we probably can't have inlined hashing code --- I imagine the
>> hash generator needs the flexibility to pick different values of
>> those multipliers.

> Right now, only the initial values are randomized. Picking a different
> set of hash functions is possible, but someone that should be done only
> if there is an actual need. That was what I meant with stronger mixing
> might be necessary for "annoying" keyword additions.

Hmm.  I'm still leaning towards using generated, out-of-line hash
functions though, because then we could have a generator switch
indicating whether to apply the |0x20 case coercion or not.
(I realize that we could blow off that consideration and use a
case-insensitive hash function all the time, but it seems cleaner
to me not to make assumptions about how variable the hash function
parameters will need to be.)

> There are two ways for dealing with it:
> (1) Have one big hash table with all the various keywords and a class
> mask stored. If there is enough overlap between the keyword tables, it
> can significantly reduce the amount of space needed. In terms of code
> complexity, it adds one class check at the end, i.e. a bitmap test.

No, this would be a bad idea IMO, because it makes the core, plpgsql,
and ecpg keyword sets all interdependent.  If you add a keyword to any
one of those and forget to rebuild the other components, you got trouble.
Maybe we could make that reliable, but I don't think it's worth fooling
with for hypothetical benefit.  Also, it'd make the net space usage more
not less, because each of those executables/shlibs would contain copies
of all the keywords for the other ones' needs.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-06 Thread Joerg Sonnenberger

On Sun, Jan 06, 2019 at 02:29:05PM -0500, Tom Lane wrote:
> * It's too bad that the hash function doesn't have a return convention
> that allows distinguishing "couldn't possibly match any keyword" from
> "might match keyword 0".  I imagine a lot of the zero entries in its
> hashtable could be interpreted as the former, so at the cost of perhaps
> a couple more if-tests we could skip work at the caller.  As this patch
> stands, we could only skip the strcmp() so it might not be worth the
> trouble --- but if we use Joerg's |0x20 hack then we could hash before
> downcasing, allowing us to skip the downcasing step if the word couldn't
> possibly be a keyword.  Likely that would be worth the trouble.

The hash function itself doesn't have enough data in it to know whether
its a match or not. A strcmp or memcmp at the end will always be
necessary if you don't know already that it is a keyword.

> * We should extend the ScanKeywordList representation to include a
> field holding the longest keyword length in the table, which
> gen_keywordlist.pl would have no trouble providing.  Then we could
> skip downcasing and/or hashing for any word longer than that, replacing
> the current NAMEDATALEN test, and thereby putting a tight bound on
> the cost of downcasing and/or hashing.

Correct, possibly even have an array for each class of keywords.

> * If we do hash first, then we could replace the downcasing loop and
> strcmp with an integrated loop that downcases and compares one
> character at a time, removing the need for the NAMEDATALEN-sized
> buffer variable.

This is also an option. Assuming that the character set for keywords
doesn't change (letter or _), one 64bit bit test per input character
would ensure that the |0x20 hack gives correct result for comparing as
well. Any other input would be an instant mismatch. If digits are valid
keyword characters, it would be two tests.

> * I think it matters to the speed of the hashing loop that the
> magic multipliers are hard-coded.  (Examining the assembly code
> shows that, at least on my x86_64 hardware, gcc prefers to use
> shift-and-add sequences here instead of multiply instructions.)

Right, that's one of the reasons for choosing them. The other is that it
gives decent mixing for ASCII-only input. At the moment, they are
hard-coded.

> So we probably can't have inlined hashing code --- I imagine the
> hash generator needs the flexibility to pick different values of
> those multipliers.

Right now, only the initial values are randomized. Picking a different
set of hash functions is possible, but someone that should be done only
if there is an actual need. That was what I meant with stronger mixing
might be necessary for "annoying" keyword additions.

> I envision making this work by having
> gen_keywordlist.pl emit a function definition for the hash step and
> include a function pointer to it in ScanKeywordList.  That extra
> function call will make things fractionally slower than what I have
> here, but I don't think it will change the conclusions any.  This
> approach would also greatly alleviate the concern I had yesterday
> about ecpg's c_keywords.c having a second copy of the hashing code;
> what it would have is its own generated function, which isn't much
> of a problem.

There are two ways for dealing with it:
(1) Have one big hash table with all the various keywords and a class
mask stored. If there is enough overlap between the keyword tables, it
can significantly reduce the amount of space needed. In terms of code
complexity, it adds one class check at the end, i.e. a bitmap test.
(2) Build independent hash tables for each input class. A bit simpler to
manage, but can result in a bit wasted space.

>From the generator side, it doesn't matter which choice is taken.

> * Given that the generator's runtime is negligible when coded in C,
> I suspect that we might able to tolerate the speed hit from translating
> it to Perl, and frankly I'd much rather do that than cope with the
> consequences of including C code in our build process.

I'm just not fluent enough in Perl to be much help for that, but I can
sit down and write a trivial Python version of it :) There are a couple
of changes that are useful to have in this context, e.g. the ability to
directly provide the offsets in the result table to allow dropping the
index -> offset table completely.

Joerg

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-06 Thread Tom Lane

Joerg Sonnenberger  writes:
> On Mon, Jan 07, 2019 at 03:11:55AM +1300, David Rowley wrote:
>> What I'm most interested in is how long it took to generate the hash
>> function in hash2.c?

> It's within the noise floor of time(1) on my laptop, e.g. ~1ms.

I decided to do some simple performance measurements to see if we're
actually getting any useful results here.  I set up a test case that
just runs raw_parser in a tight loop:

while (count-- > 0)
{
List   *parsetree_list;
MemoryContext oldcontext;

oldcontext = MemoryContextSwitchTo(mycontext);
parsetree_list = pg_parse_query(query_string);
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(mycontext);
CHECK_FOR_INTERRUPTS();
}

and exercised it on the contents of information_schema.sql.
I think that's a reasonably representative test case considering
that we're only examining the speed of the flex+bison stages.
(Since it's mostly DDL, including parse analysis might be a bit
unlike normal workloads, but for raw parsing it should be fine.)

On my workstation, in a non-cassert build, HEAD requires about 4700 ms
for 1000 iterations (with maybe 1% cross-run variation).

Applying the v8 patch I posted yesterday, the time drops to ~4520 ms
or about a 4% savings.  So not a lot, but it's pretty repeatable,
and it shows that reducing the cache footprint of keyword lookup
is worth something.

I then tried plastering in Joerg's example hash function, as in the
attached delta patch on top of v8.  This is *not* a usable patch;
it breaks plpgsql and ecpg, because ScanKeywordLookup no longer
works for non-core keyword sets.  But that doesn't matter for the
information_schema test case, and on that I find the runtime drops
to ~3820 ms, or 19% better than HEAD.  So this is definitely an
idea worth pursuing.

Some additional thoughts:

* It's too bad that the hash function doesn't have a return convention
that allows distinguishing "couldn't possibly match any keyword" from
"might match keyword 0".  I imagine a lot of the zero entries in its
hashtable could be interpreted as the former, so at the cost of perhaps
a couple more if-tests we could skip work at the caller.  As this patch
stands, we could only skip the strcmp() so it might not be worth the
trouble --- but if we use Joerg's |0x20 hack then we could hash before
downcasing, allowing us to skip the downcasing step if the word couldn't
possibly be a keyword.  Likely that would be worth the trouble.

* We should extend the ScanKeywordList representation to include a
field holding the longest keyword length in the table, which
gen_keywordlist.pl would have no trouble providing.  Then we could
skip downcasing and/or hashing for any word longer than that, replacing
the current NAMEDATALEN test, and thereby putting a tight bound on
the cost of downcasing and/or hashing.

* If we do hash first, then we could replace the downcasing loop and
strcmp with an integrated loop that downcases and compares one
character at a time, removing the need for the NAMEDATALEN-sized
buffer variable.

* I think it matters to the speed of the hashing loop that the
magic multipliers are hard-coded.  (Examining the assembly code
shows that, at least on my x86_64 hardware, gcc prefers to use
shift-and-add sequences here instead of multiply instructions.)
So we probably can't have inlined hashing code --- I imagine the
hash generator needs the flexibility to pick different values of
those multipliers.  I envision making this work by having
gen_keywordlist.pl emit a function definition for the hash step and
include a function pointer to it in ScanKeywordList.  That extra
function call will make things fractionally slower than what I have
here, but I don't think it will change the conclusions any.  This
approach would also greatly alleviate the concern I had yesterday
about ecpg's c_keywords.c having a second copy of the hashing code;
what it would have is its own generated function, which isn't much
of a problem.

* Given that the generator's runtime is negligible when coded in C,
I suspect that we might able to tolerate the speed hit from translating
it to Perl, and frankly I'd much rather do that than cope with the
consequences of including C code in our build process.

I'm eagerly awaiting seeing Joerg's code, but I think in the
meantime I'm going to go look up NetBSD's nbperf to get an idea
of how painful it might be to do in Perl.  (Now, bearing in mind
that I'm not exactly fluent in Perl, there are probably other
people around here who could produce a better translation ...)

regards, tom lane

diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index db62623..8c55f40 100644
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***
*** 17,22 
--- 17,149 

  #include "common/kwlookup.h"

+ static uint32
+ hash(const void *key, size_t keylen)
+ {
+ 	static const uint16 g[881] = {
+ 	0x015b, 0x, 0x0070, 0x

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-06 Thread Joerg Sonnenberger

On Mon, Jan 07, 2019 at 03:11:55AM +1300, David Rowley wrote:
> What I'm most interested in is how long it took to generate the hash
> function in hash2.c?

It's within the noise floor of time(1) on my laptop, e.g. ~1ms.

Joerg

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-06 Thread David Rowley

On Sat, 5 Jan 2019 at 09:20, John Naylor  wrote:
>
> On 1/3/19, Joerg Sonnenberger  wrote:
> > Hello John,
> > I was pointed at your patch on IRC and decided to look into adding my
> > own pieces. What I can provide you is a fast perfect hash function
> > generator.  I've attached a sample hash function based on the current
> > main keyword list. hash() essentially gives you the number of the only
> > possible match, a final strcmp/memcmp is still necessary to verify that
> > it is an actual keyword though. The |0x20 can be dropped if all cases
> > have pre-lower-cased the input already. This would replace the binary
> > search in the lookup functions. Returning offsets directly would be easy
> > as well. That allows writing a single string where each entry is prefixed
> > with a type mask, the token id, the length of the keyword and the actual
> > keyword text. Does that sound useful to you?
>
> Judging by previous responses, there is still interest in using
> perfect hash functions, so thanks for this. I'm not knowledgeable
> enough to judge its implementation, so I'll leave that for others.

Well, I'm quite impressed by the resulting hash function. The
resulting hash value can be used directly to index the existing 440
element ScanKeywords[] array (way better than the 1815 element array
Andrew got from gperf).  If we also happened to also store the length
of the keyword in that array then we could compare the length of the
word after hashing. If the length is the same then we could perform a
memcmp() to confirm the match, should be a little cheaper than a
strcmp() and we should be able to store the length for free on 64-bit
machines. If the length is not the same then it's not a keyword.

It may also save some cycles to determine the input word's length at
the same time as lowering it.

The keyword length could also be easily determined by changing the
PG_KEYWORD macro to become:

#define PG_KEYWORD(a,b,c) {a,0,c,sizeof(a)-1},

after, of course adding a new field to the ScanKeyword struct.

What I'm most interested in is how long it took to generate the hash
function in hash2.c?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-05 Thread Tom Lane

I wrote:
> I spent some time hacking on this today, and I think it's committable
> now, but I'm putting it back up in case anyone wants to have another
> look (and also so the cfbot can check it on Windows).

... and indeed, the cfbot doesn't like it.  Here's v8, with the
missing addition to Mkvcbuild.pm.

regards, tom lane

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index e8ef966..9131991 100644
*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
*** fill_in_constant_lengths(pgssJumbleState
*** 3075,3082 
  	/* initialize the flex scanner --- should match raw_parser() */
  	yyscanner = scanner_init(query,
  			 &yyextra,
! 			 ScanKeywords,
! 			 NumScanKeywords);
  
  	/* we don't want to re-emit any escape string warnings */
  	yyextra.escape_string_warning = false;
--- 3075,3082 
  	/* initialize the flex scanner --- should match raw_parser() */
  	yyscanner = scanner_init(query,
  			 &yyextra,
! 			 &ScanKeywords,
! 			 ScanKeywordTokens);
  
  	/* we don't want to re-emit any escape string warnings */
  	yyextra.escape_string_warning = false;
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index 7e9b122..4c0c258 100644
*** a/src/backend/parser/parser.c
--- b/src/backend/parser/parser.c
*** raw_parser(const char *str)
*** 41,47 
  
  	/* initialize the flex scanner */
  	yyscanner = scanner_init(str, &yyextra.core_yy_extra,
! 			 ScanKeywords, NumScanKeywords);
  
  	/* base_yylex() only needs this much initialization */
  	yyextra.have_lookahead = false;
--- 41,47 
  
  	/* initialize the flex scanner */
  	yyscanner = scanner_init(str, &yyextra.core_yy_extra,
! 			 &ScanKeywords, ScanKeywordTokens);
  
  	/* base_yylex() only needs this much initialization */
  	yyextra.have_lookahead = false;
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index fbeb86f..e1cae85 100644
*** a/src/backend/parser/scan.l
--- b/src/backend/parser/scan.l
*** bool		escape_string_warning = true;
*** 67,72 
--- 67,87 
  bool		standard_conforming_strings = true;
  
  /*
+  * Constant data exported from this file.  This array maps from the
+  * zero-based keyword numbers returned by ScanKeywordLookup to the
+  * Bison token numbers needed by gram.y.  This is exported because
+  * callers need to pass it to scanner_init, if they are using the
+  * standard keyword list ScanKeywords.
+  */
+ #define PG_KEYWORD(kwname, value, category) value,
+ 
+ const uint16 ScanKeywordTokens[] = {
+ #include "parser/kwlist.h"
+ };
+ 
+ #undef PG_KEYWORD
+ 
+ /*
   * Set the type of YYSTYPE.
   */
  #define YYSTYPE core_YYSTYPE
*** other			.
*** 504,521 
  	 * We will pass this along as a normal character string,
  	 * but preceded with an internally-generated "NCHAR".
  	 */
! 	const ScanKeyword *keyword;
  
  	SET_YYLLOC();
  	yyless(1);	/* eat only 'n' this time */
  
! 	keyword = ScanKeywordLookup("nchar",
! yyextra->keywords,
! yyextra->num_keywords);
! 	if (keyword != NULL)
  	{
! 		yylval->keyword = keyword->name;
! 		return keyword->value;
  	}
  	else
  	{
--- 519,536 
  	 * We will pass this along as a normal character string,
  	 * but preceded with an internally-generated "NCHAR".
  	 */
! 	int		kwnum;
  
  	SET_YYLLOC();
  	yyless(1);	/* eat only 'n' this time */
  
! 	kwnum = ScanKeywordLookup("nchar",
! 			  yyextra->keywordlist);
! 	if (kwnum >= 0)
  	{
! 		yylval->keyword = GetScanKeyword(kwnum,
! 		 yyextra->keywordlist);
! 		return yyextra->keyword_tokens[kwnum];
  	}
  	else
  	{
*** other			.
*** 1021,1039 
  
  
  {identifier}	{
! 	const ScanKeyword *keyword;
  	char	   *ident;
  
  	SET_YYLLOC();
  
  	/* Is it a keyword? */
! 	keyword = ScanKeywordLookup(yytext,
! yyextra->keywords,
! yyextra->num_keywords);
! 	if (keyword != NULL)
  	{
! 		yylval->keyword = keyword->name;
! 		return keyword->value;
  	}
  
  	/*
--- 1036,1054 
  
  
  {identifier}	{
! 	int			kwnum;
  	char	   *ident;
  
  	SET_YYLLOC();
  
  	/* Is it a keyword? */
! 	kwnum = ScanKeywordLookup(yytext,
! 			  yyextra->keywordlist);
! 	if (kwnum >= 0)
  	{
! 		yylval->keyword = GetScanKeyword(kwnum,
! 		 yyextra->keywordlist);
! 		return yyextra->keyword_tokens[kwnum];
  	}
  
  	/*
*** scanner_yyerror(const char *message, cor
*** 1142,1149 
  core_yyscan_t
  scanner_init(const char *str,
  			 core_yy_extra_type *yyext,
! 			 const ScanKeyword *keywords,
! 			 int num_keywords)
  {
  	Size		slen = strlen(str);
  	yyscan_t	scanner;
--- 115

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-05 Thread Tom Lane

John Naylor  writes:
> [ v6-0001-Use-offset-based-keyword-lookup.patch ]

I spent some time hacking on this today, and I think it's committable
now, but I'm putting it back up in case anyone wants to have another
look (and also so the cfbot can check it on Windows).

Given the discussion about possibly switching to perfect hashing,
I thought it'd be a good idea to try to make the APIs less dependent
on the exact table representation.  So in the attached, I created
a struct ScanKeywordList that holds all the data ScanKeywordLookup
needs, and the generated headers declare variables of that type,
and we just pass around a pointer to that instead of passing several
different things.

I also went ahead with the idea of splitting the category and token
data into separate arrays.  That allows moving the backend token
array out of src/common entirely, which I think is a good thing
because of the dependency situation: we no longer need to run the
bison build before we can compile src/common/keywords_srv.o.

There's one remaining refactoring issue that I think we'd want to consider
before trying to jack this up and wheel a perfect-hash lookup under it:
where to do the downcasing transform.  Right now, ecpg's c_keywords.c
has its own copy of the binary-search logic because it doesn't want the
downcasing transform that ScanKeywordLookup does.  So unless we want it
to also have a copy of the hash lookup logic, we need to rearrange that
somehow.  We could give ScanKeywordLookup a "bool downcase" argument,
or we could refactor things so that the downcasing is done by callers
if they need it (which many don't).  I'm not very sure which of those
three alternatives is best.

My argument upthread that we could always do the downcasing before
keyword lookup now feels a bit shaky, because I was reminded while
working on this code that we actually have different downcasing rules
for keywords and identifiers (yes, really), so that it's not possible
for those code paths to share a downcasing transform.  So the idea of
moving the keyword-downcasing logic to the callers is likely to not
work out quite as nicely as I thought.  (This might also mean that
I was overly hasty to reject Joerg's |0x20 hack.  It's still an
ugly hack, but it would save doing the keyword downcasing transform
if we don't get a hashcode match...)

regards, tom lane

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index e8ef966..9131991 100644
*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
*** fill_in_constant_lengths(pgssJumbleState
*** 3075,3082 
  	/* initialize the flex scanner --- should match raw_parser() */
  	yyscanner = scanner_init(query,
  			 &yyextra,
! 			 ScanKeywords,
! 			 NumScanKeywords);
  
  	/* we don't want to re-emit any escape string warnings */
  	yyextra.escape_string_warning = false;
--- 3075,3082 
  	/* initialize the flex scanner --- should match raw_parser() */
  	yyscanner = scanner_init(query,
  			 &yyextra,
! 			 &ScanKeywords,
! 			 ScanKeywordTokens);
  
  	/* we don't want to re-emit any escape string warnings */
  	yyextra.escape_string_warning = false;
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index 7e9b122..4c0c258 100644
*** a/src/backend/parser/parser.c
--- b/src/backend/parser/parser.c
*** raw_parser(const char *str)
*** 41,47 
  
  	/* initialize the flex scanner */
  	yyscanner = scanner_init(str, &yyextra.core_yy_extra,
! 			 ScanKeywords, NumScanKeywords);
  
  	/* base_yylex() only needs this much initialization */
  	yyextra.have_lookahead = false;
--- 41,47 
  
  	/* initialize the flex scanner */
  	yyscanner = scanner_init(str, &yyextra.core_yy_extra,
! 			 &ScanKeywords, ScanKeywordTokens);
  
  	/* base_yylex() only needs this much initialization */
  	yyextra.have_lookahead = false;
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index fbeb86f..e1cae85 100644
*** a/src/backend/parser/scan.l
--- b/src/backend/parser/scan.l
*** bool		escape_string_warning = true;
*** 67,72 
--- 67,87 
  bool		standard_conforming_strings = true;
  
  /*
+  * Constant data exported from this file.  This array maps from the
+  * zero-based keyword numbers returned by ScanKeywordLookup to the
+  * Bison token numbers needed by gram.y.  This is exported because
+  * callers need to pass it to scanner_init, if they are using the
+  * standard keyword list ScanKeywords.
+  */
+ #define PG_KEYWORD(kwname, value, category) value,
+ 
+ const uint16 ScanKeywordTokens[] = {
+ #include "parser/kwlist.h"
+ };
+ 
+ #undef PG_KEYWORD
+ 
+ /*
   * Set the type of YYSTYPE.
   */
  #define YYSTYPE core_YYSTYPE
*** other			.
*** 504,521 
  	 * We will pass this along as a normal character string,
  	 * but preceded with an internally-gene

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-04 Thread Joerg Sonnenberger

On Fri, Jan 04, 2019 at 02:36:15PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2019-01-04 16:43:39 -0500, Tom Lane wrote:
> > Joerg Sonnenberger  writes:
> > >> * What's the generator written in?  (if the answer's not "Perl", wedging
> > >> it into our build processes might be painful)
> >
> > > Plain C, nothing really fancy in it.
> >
> > That's actually a bigger problem than you might think, because it
> > doesn't fit in very nicely in a cross-compiling build: we might not
> > have any C compiler at hand that generates programs that can execute
> > on the build machine.  That's why we prefer Perl for tools that need
> > to execute during the build.  However, if the code is pretty small
> > and fast, maybe translating it to Perl is feasible.  Or perhaps
> > we could add sufficient autoconfiscation infrastructure to identify
> > a native C compiler.  It's not very likely that there isn't one,
> > but it is possible that nothing we learned about the configured
> > target compiler would apply to it :-(

There is a pre-made autoconf macro for doing the basic glue for
CC_FOR_BUILD, it's been used by various projects already including libXt
and friends. 

> I think it might be ok if we included the output of the generator in the
> buildtree? Not being able to add keywords while cross-compiling sounds like
> an acceptable restriction to me.  I assume we'd likely grow further users
> of such a generator over time, and some of the input lists might be big
> enough that we'd not want to force it to be recomputed on every machine.

This is quite reasonable as well. I wouldn't worry about the size of the
input list at all. Processing the Webster dictionary needs something
less than 0.4s on my laptop for 235k entries.

Joerg

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-04 Thread Andres Freund

Hi,

On 2019-01-04 16:43:39 -0500, Tom Lane wrote:
> Joerg Sonnenberger  writes:
> >> * What's the generator written in?  (if the answer's not "Perl", wedging
> >> it into our build processes might be painful)
>
> > Plain C, nothing really fancy in it.
>
> That's actually a bigger problem than you might think, because it
> doesn't fit in very nicely in a cross-compiling build: we might not
> have any C compiler at hand that generates programs that can execute
> on the build machine.  That's why we prefer Perl for tools that need
> to execute during the build.  However, if the code is pretty small
> and fast, maybe translating it to Perl is feasible.  Or perhaps
> we could add sufficient autoconfiscation infrastructure to identify
> a native C compiler.  It's not very likely that there isn't one,
> but it is possible that nothing we learned about the configured
> target compiler would apply to it :-(

I think it might be ok if we included the output of the generator in the
buildtree? Not being able to add keywords while cross-compiling sounds like
an acceptable restriction to me.  I assume we'd likely grow further users
of such a generator over time, and some of the input lists might be big
enough that we'd not want to force it to be recomputed on every machine.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-04 Thread Tom Lane

Joerg Sonnenberger  writes:
> On Fri, Jan 04, 2019 at 03:31:11PM -0500, Tom Lane wrote:
>> The sample hash function certainly looks great.  I'm not terribly on board
>> with using |0x20 as a substitute for lower-casing, but that's a minor
>> detail.

> Yeah, I've included that part more because I don't know the current use
> cases enough. If all instances are already doing lower-casing in
> advance, it is trivial to drop.

I think we probably don't need that, because we'd always need to generate
a lower-cased version of the input anyway: either to compare to the
potential keyword match, or to use as the normalized identifier if it
turns out not to be a keyword.  I don't think there are any cases where
it's useful to delay downcasing till after the keyword lookup.

>> * What's the generator written in?  (if the answer's not "Perl", wedging
>> it into our build processes might be painful)

> Plain C, nothing really fancy in it.

That's actually a bigger problem than you might think, because it
doesn't fit in very nicely in a cross-compiling build: we might not
have any C compiler at hand that generates programs that can execute
on the build machine.  That's why we prefer Perl for tools that need
to execute during the build.  However, if the code is pretty small
and fast, maybe translating it to Perl is feasible.  Or perhaps
we could add sufficient autoconfiscation infrastructure to identify
a native C compiler.  It's not very likely that there isn't one,
but it is possible that nothing we learned about the configured
target compiler would apply to it :-(

>> * What license is it under?

> Two clause BSD license.

OK, that works, at least.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-04 Thread Joerg Sonnenberger

On Fri, Jan 04, 2019 at 03:31:11PM -0500, Tom Lane wrote:
> John Naylor  writes:
> > On 1/3/19, Joerg Sonnenberger  wrote:
> >> I was pointed at your patch on IRC and decided to look into adding my
> >> own pieces. What I can provide you is a fast perfect hash function
> >> generator.  I've attached a sample hash function based on the current
> >> main keyword list. hash() essentially gives you the number of the only
> >> possible match, a final strcmp/memcmp is still necessary to verify that
> >> it is an actual keyword though. The |0x20 can be dropped if all cases
> >> have pre-lower-cased the input already. This would replace the binary
> >> search in the lookup functions. Returning offsets directly would be easy
> >> as well. That allows writing a single string where each entry is prefixed
> >> with a type mask, the token id, the length of the keyword and the actual
> >> keyword text. Does that sound useful to you?
> 
> > Judging by previous responses, there is still interest in using
> > perfect hash functions, so thanks for this. I'm not knowledgeable
> > enough to judge its implementation, so I'll leave that for others.
> 
> We haven't actually seen the implementation, so it's hard to judge ;-).

It's a temporary hacked version of nbperf in the NetBSD tree.

> The sample hash function certainly looks great.  I'm not terribly on board
> with using |0x20 as a substitute for lower-casing, but that's a minor
> detail.

Yeah, I've included that part more because I don't know the current use
cases enough. If all instances are already doing lower-casing in
advance, it is trivial to drop.

> The foremost questions in my mind are:
> 
> * What's the generator written in?  (if the answer's not "Perl", wedging
> it into our build processes might be painful)

Plain C, nothing really fancy in it.

> * What license is it under?

Two clause BSD license.

> * Does it always suceed in producing a single-level lookup table?

This question is a bit tricky. The short answer is: yes. The longer
answer: The choosen hash function in the example is very simple (e.g.
just two variations of DJB-style hash), so with that: no, not without
potentially fiddling a bit with the hash function if things ever get
nasty like having two keywords that hit a funnel for both variants. The
main concern for the choice was to be fast. When using two families of
independent hash functions, the generator requires a probalistic linear
time in the number of keys. That means with a strong enough hash function
like the Jenkins hash used in PG elsewhere, it will succeed very fast.
So if it fails on new keywords, making the mixing a bit stronger should
be enough.

Joerg

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-04 Thread John Naylor

On 12/27/18, Tom Lane  wrote:
> If you really are hot about saving that other 440 bytes, the way to
> do it would be to drop the struct entirely and use two parallel
> arrays, an int16[] for value and a char[] (or better uint8[]) for
> category.  Those would be filled by reading kwlist.h twice with
> different definitions for PG_KEYWORD.  Not sure it's worth the
> trouble though --- in particular, not clear that it's a win from
> the standpoint of number of cache lines touched.

Understood. That said, after re-implementing all keyword lookups, I
wondered if there'd be a notational benefit to dropping the struct,
especially since as yet no caller uses both token and category. It
makes pl_scanner.c and its reserved keyword list a bit nicer, and gets
rid of the need to force frontend to have 'zero' token numbers, but
I'm not sure it's a clear win. I've attached a patch (applies on top
of v6), gzipped to avoid confusing the cfbot.

-John Naylor

keyword-nostruct.patch.gz
Description: GNU Zip compressed data

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-04 Thread Tom Lane

Andres Freund  writes:
> On 2019-01-04 12:26:18 -0500, Tom Lane wrote:
>> I'm wondering where we want to go with this.  Is anyone excited
>> about pursuing the perfect-hash-function idea?  (Joerg's example
>> function looked pretty neat to me.)  If we are going to do that,
>> does it make sense to push this version beforehand?

> I think it does make sense to push this version beforehand. Most of
> the code would be needed anyway, so it's not like this is going to
> cause a lot of churn.

Yeah, I'm leaning in that direction too, first on the grounds of
"don't let the perfect be the enemy of the good", and second because
if we do end up with perfect hashing, we'd still need a table-generation
step.  The build infrastructure this adds would support a generator
that produces perfect hashes just as well as what this is doing,
even if we end up having to whack the API of ScanKeywordLookup around
some more.  So barring objections, I'll have a look at pushing this,
and then we can think about using perfect hashing instead.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-04 Thread Tom Lane

John Naylor  writes:
> On 1/3/19, Joerg Sonnenberger  wrote:
>> I was pointed at your patch on IRC and decided to look into adding my
>> own pieces. What I can provide you is a fast perfect hash function
>> generator.  I've attached a sample hash function based on the current
>> main keyword list. hash() essentially gives you the number of the only
>> possible match, a final strcmp/memcmp is still necessary to verify that
>> it is an actual keyword though. The |0x20 can be dropped if all cases
>> have pre-lower-cased the input already. This would replace the binary
>> search in the lookup functions. Returning offsets directly would be easy
>> as well. That allows writing a single string where each entry is prefixed
>> with a type mask, the token id, the length of the keyword and the actual
>> keyword text. Does that sound useful to you?

> Judging by previous responses, there is still interest in using
> perfect hash functions, so thanks for this. I'm not knowledgeable
> enough to judge its implementation, so I'll leave that for others.

We haven't actually seen the implementation, so it's hard to judge ;-).

The sample hash function certainly looks great.  I'm not terribly on board
with using |0x20 as a substitute for lower-casing, but that's a minor
detail.

The foremost questions in my mind are:

* What's the generator written in?  (if the answer's not "Perl", wedging
it into our build processes might be painful)

* What license is it under?

* Does it always suceed in producing a single-level lookup table?

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-04 Thread Andres Freund

On 2019-01-04 12:26:18 -0500, Tom Lane wrote:
> I wrote:
> > Andres Freund  writes:
> >> On 2018-12-29 16:59:52 -0500, John Naylor wrote:
> >>> I think 0001 with complete keyword lookup replacement is in decent
> >>> enough shape to post. Make check-world passes. A few notes and
> >>> caveats:
> 
> >> I tried to take this for a spin, an for me the build fails because various
> >> frontend programs don't have KeywordOffsets/Strings defined, but reference 
> >> it
> >> through various functions exposed to the frontend (like fmtId()).  That I 
> >> see
> >> that error but you don't is probably related to me using -fuse-ld=gold in
> >> CFLAGS.
> 
> > I was just about to point out that the cfbot is seeing that too ...
> 
> Aside from the possible linkage problem, this will need a minor rebase
> over 4879a5172, which rearranged some of plpgsql's calls of
> ScanKeywordLookup.
> 
> While I don't think it's going to be hard to resolve these issues,
> I'm wondering where we want to go with this.  Is anyone excited
> about pursuing the perfect-hash-function idea?  (Joerg's example
> function looked pretty neat to me.)  If we are going to do that,
> does it make sense to push this version beforehand?

I think it does make sense to push this version beforehand. Most of
the code would be needed anyway, so it's not like this is going to
cause a lot of churn.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-04 Thread John Naylor

On 1/3/19, Joerg Sonnenberger  wrote:
> Hello John,
> I was pointed at your patch on IRC and decided to look into adding my
> own pieces. What I can provide you is a fast perfect hash function
> generator.  I've attached a sample hash function based on the current
> main keyword list. hash() essentially gives you the number of the only
> possible match, a final strcmp/memcmp is still necessary to verify that
> it is an actual keyword though. The |0x20 can be dropped if all cases
> have pre-lower-cased the input already. This would replace the binary
> search in the lookup functions. Returning offsets directly would be easy
> as well. That allows writing a single string where each entry is prefixed
> with a type mask, the token id, the length of the keyword and the actual
> keyword text. Does that sound useful to you?

Judging by previous responses, there is still interest in using
perfect hash functions, so thanks for this. I'm not knowledgeable
enough to judge its implementation, so I'll leave that for others.

-John Naylor

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-04 Thread John Naylor

On 12/30/18, Andres Freund  wrote:
> I tried to take this for a spin, an for me the build fails because various
> frontend programs don't have KeywordOffsets/Strings defined, but reference
> it
> through various functions exposed to the frontend (like fmtId()).  That I
> see
> that error but you don't is probably related to me using -fuse-ld=gold in
> CFLAGS.
>
> I can "fix" this by including kwlist_d.h in common/keywords.c
> regardless of FRONTEND. That also lead me to discover that the build
> dependencies somewhere aren't correctly set-up, because I need to
> force a clean rebuild to trigger the problem again, just changing
> keywords.c back doesn't trigger the problem.

Hmm, that was a typo, and I didn't notice even when I found I had to
include kwlist_d.h in ecpg/keywords.c. :-(  I've fixed both of those
in the attached v6.

As far as dependencies, I'm far from sure I have it up to par. That
piece could use some discussion.

On 1/4/19, Tom Lane  wrote:
> Aside from the possible linkage problem, this will need a minor rebase
> over 4879a5172, which rearranged some of plpgsql's calls of
> ScanKeywordLookup.
>
> While I don't think it's going to be hard to resolve these issues,
> I'm wondering where we want to go with this.  Is anyone excited
> about pursuing the perfect-hash-function idea?  (Joerg's example
> function looked pretty neat to me.)  If we are going to do that,
> does it make sense to push this version beforehand?

If it does, for v6 I've also done the rebase, updated the copyright
year, and fixed an error in MSVC.

-John Naylor
From fb846eeabd450b6932e57f7e8d4f0aebc125d178 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 4 Jan 2019 14:35:40 -0500
Subject: [PATCH v6] Use offset-based keyword lookup.

Replace binary search over an array of ScanKeyword structs with a binary
search over an array of offsets into a keyword string. Access auxillary
data only after a keyword hit. This has better locality of reference and
a smaller memory footprint.
---
 .../pg_stat_statements/pg_stat_statements.c   |   2 +
 src/backend/parser/parser.c   |   4 +-
 src/backend/parser/scan.l |  38 ++--
 src/backend/utils/adt/misc.c  |   2 +-
 src/backend/utils/adt/ruleutils.c |   9 +-
 src/common/.gitignore |   1 +
 src/common/Makefile   |  25 ++-
 src/common/keywords.c |  42 +++--
 src/fe_utils/string_utils.c   |   9 +-
 src/include/common/keywords.h |  19 +-
 src/include/parser/kwlist.h   |   2 +-
 src/include/parser/scanner.h  |   8 +-
 src/interfaces/ecpg/preproc/.gitignore|   2 +
 src/interfaces/ecpg/preproc/Makefile  |  20 ++-
 src/interfaces/ecpg/preproc/c_keywords.c  |  65 ++-
 src/interfaces/ecpg/preproc/c_kwlist.h|  53 ++
 src/interfaces/ecpg/preproc/ecpg_keywords.c   |  81 ++---
 src/interfaces/ecpg/preproc/ecpg_kwlist.h |  68 +++
 src/interfaces/ecpg/preproc/keywords.c|   6 +-
 src/interfaces/ecpg/preproc/pgc.l |  22 +--
 src/interfaces/ecpg/preproc/preproc_extern.h  |   8 +-
 src/pl/plpgsql/src/.gitignore |   2 +
 src/pl/plpgsql/src/Makefile   |  16 +-
 src/pl/plpgsql/src/pl_reserved_kwlist.h   |  55 ++
 src/pl/plpgsql/src/pl_scanner.c   | 167 --
 src/pl/plpgsql/src/pl_unreserved_kwlist.h | 111 
 src/tools/gen_keywords.pl | 136 ++
 src/tools/msvc/Solution.pm|  44 +
 src/tools/msvc/clean.bat  |   6 +
 29 files changed, 696 insertions(+), 327 deletions(-)
 create mode 100644 src/common/.gitignore
 create mode 100644 src/interfaces/ecpg/preproc/c_kwlist.h
 create mode 100644 src/interfaces/ecpg/preproc/ecpg_kwlist.h
 create mode 100644 src/pl/plpgsql/src/pl_reserved_kwlist.h
 create mode 100644 src/pl/plpgsql/src/pl_unreserved_kwlist.h
 create mode 100644 src/tools/gen_keywords.pl

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index e8ef966bb5..4776de7595 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -3076,6 +3076,8 @@ fill_in_constant_lengths(pgssJumbleState *jstate, const char *query,
 	yyscanner = scanner_init(query,
 			 &yyextra,
 			 ScanKeywords,
+			 KeywordString,
+			 KeywordOffsets,
 			 NumScanKeywords);
 
 	/* we don't want to re-emit any escape string warnings */
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index 7e9b1222fd..19fa7bb8ae 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -40,8 +40,8 @@ raw_parser(const char *str)
 	int			yyresult;
 
 	/* initialize the flex scanner */
-	yyscanner = scanner_init(str, &yyextra.core_yy_extra,
-			 ScanKeywords, Nu

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-04 Thread Tom Lane

I wrote:
> Andres Freund  writes:
>> On 2018-12-29 16:59:52 -0500, John Naylor wrote:
>>> I think 0001 with complete keyword lookup replacement is in decent
>>> enough shape to post. Make check-world passes. A few notes and
>>> caveats:

>> I tried to take this for a spin, an for me the build fails because various
>> frontend programs don't have KeywordOffsets/Strings defined, but reference it
>> through various functions exposed to the frontend (like fmtId()).  That I see
>> that error but you don't is probably related to me using -fuse-ld=gold in
>> CFLAGS.

> I was just about to point out that the cfbot is seeing that too ...

Aside from the possible linkage problem, this will need a minor rebase
over 4879a5172, which rearranged some of plpgsql's calls of
ScanKeywordLookup.

While I don't think it's going to be hard to resolve these issues,
I'm wondering where we want to go with this.  Is anyone excited
about pursuing the perfect-hash-function idea?  (Joerg's example
function looked pretty neat to me.)  If we are going to do that,
does it make sense to push this version beforehand?

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-03 Thread Joerg Sonnenberger

On Sun, Dec 16, 2018 at 11:50:15AM -0500, John Naylor wrote:
> A few months ago I was looking into faster search algorithms for
> ScanKeywordLookup(), so this is interesting to me. While an optimal
> full replacement would be a lot of work, the above ideas are much less
> invasive and would still have some benefit. Unless anyone intends to
> work on this, I'd like to flesh out the offset-into-giant-string
> approach a bit further:

Hello John,
I was pointed at your patch on IRC and decided to look into adding my
own pieces. What I can provide you is a fast perfect hash function
generator.  I've attached a sample hash function based on the current
main keyword list. hash() essentially gives you the number of the only
possible match, a final strcmp/memcmp is still necessary to verify that
it is an actual keyword though. The |0x20 can be dropped if all cases
have pre-lower-cased the input already. This would replace the binary
search in the lookup functions. Returning offsets directly would be easy
as well. That allows writing a single string where each entry is prefixed
with a type mask, the token id, the length of the keyword and the actual
keyword text. Does that sound useful to you?

Joerg
#include 
#include 

uint32_t
hash(const void * __restrict key, size_t keylen)
{
static const uint16_t g[881] = {
0x015b, 0x, 0x0070, 0x01b2, 0x, 0x0078, 0x0020, 0x,
0x, 0x0193, 0x, 0x, 0x, 0x01ac, 0x, 0x0122,
0x00b9, 0x0176, 0x013b, 0x, 0x, 0x, 0x0150, 0x,
0x, 0x, 0x008b, 0x00ea, 0x00b3, 0x0197, 0x, 0x0118,
0x012d, 0x0102, 0x, 0x0091, 0x0061, 0x008c, 0x, 0x,
0x0144, 0x01b4, 0x, 0x, 0x01a8, 0x019e, 0x, 0x00da,
0x, 0x, 0x0122, 0x0176, 0x00f3, 0x016a, 0x00f4, 0x00c0,
0x0111, 0x, 0x0103, 0x0028, 0x001a, 0x0180, 0x, 0x,
0x005f, 0x, 0x00d9, 0x, 0x016d, 0x, 0x0170, 0x0007,
0x016f, 0x, 0x014e, 0x0098, 0x00a8, 0x004b, 0x, 0x0056,
0x, 0x0121, 0x0012, 0x0102, 0x0192, 0x, 0x00f2, 0x0066,
0x, 0x003a, 0x, 0x, 0x0144, 0x, 0x, 0x0133,
0x0067, 0x0169, 0x, 0x, 0x0152, 0x0122, 0x, 0x0058,
0x0135, 0x0045, 0x0193, 0x00d2, 0x007e, 0x, 0x00ae, 0x012c,
0x, 0x, 0x, 0x, 0x0124, 0x, 0x0046, 0x0018,
0x, 0x00ba, 0x00d1, 0x004a, 0x, 0x, 0x, 0x,
0x, 0x001f, 0x, 0x0101, 0x, 0x, 0x, 0x01b5,
0x016e, 0x0173, 0x008a, 0x, 0x0173, 0x000b, 0x, 0x00d5,
0x005e, 0x, 0x00ac, 0x, 0x, 0x, 0x01a1, 0x,
0x, 0x0127, 0x, 0x005e, 0x, 0x016f, 0x, 0x012b,
0x01a4, 0x01b4, 0x, 0x, 0x003a, 0x, 0x, 0x00f5,
0x00b1, 0x0003, 0x0123, 0x001b, 0x, 0x004f, 0x, 0x,
0x, 0x, 0x, 0x007a, 0x, 0x, 0x, 0x,
0x00c2, 0x00a2, 0x00b9, 0x, 0x00cb, 0x, 0x00d2, 0x,
0x0197, 0x0121, 0x, 0x00d6, 0x0107, 0x, 0x, 0x,
0x, 0x, 0x, 0x0165, 0x00df, 0x0121, 0x, 0x,
0x, 0x, 0x, 0x019b, 0x, 0x01ad, 0x, 0x014f,
0x018d, 0x, 0x015f, 0x0168, 0x, 0x0199, 0x, 0x,
0x, 0x00a1, 0x, 0x, 0x0109, 0x, 0x, 0x01a6,
0x0097, 0x, 0x0018, 0x, 0x00d1, 0x, 0x, 0x,
0x0187, 0x0018, 0x, 0x00aa, 0x, 0x, 0x, 0x,
0x0136, 0x0063, 0x00b8, 0x, 0x0067, 0x0114, 0x, 0x,
0x0151, 0x, 0x, 0x, 0x00bf, 0x, 0x, 0x,
0x01b4, 0x00d4, 0x, 0x0006, 0x017e, 0x0167, 0x003a, 0x017f,
0x0183, 0x00c9, 0x01a2, 0x, 0x, 0x0153, 0x00ce, 0x,
0x, 0x, 0x, 0x, 0x, 0x0051, 0x, 0x0086,
0x, 0x0083, 0x0137, 0x, 0x, 0x0050, 0x, 0x00d7,
0x, 0x, 0x, 0x0129, 0x00f1, 0x, 0x009b, 0x01a7,
0x, 0x00b4, 0x, 0x00e0, 0x0046, 0x0025, 0x, 0x,
0x, 0x0144, 0x, 0x01a5, 0x0044, 0x0096, 0x0078, 0x0166,
0x, 0x, 0x, 0x0143, 0x, 0x00b8, 0x, 0x009e,
0x, 0x008c, 0x, 0x, 0x, 0x, 0x, 0x00fe,
0x, 0x, 0x0037, 0x0057, 0x, 0x00c3, 0x, 0x,
0x, 0x00bf, 0x014b, 0x0069, 0x00ce, 0x, 0x019d, 0x007f,
0x0186, 0x, 0x0119, 0x0015, 0x, 0x000e, 0x0113, 0x0139,
0x008e, 0x01ab, 0x, 0x005c, 0x, 0x0095, 0x, 0x019d,
0x, 0x0195, 0x0036, 0x, 0x, 0x00e0, 0x0146, 0x,
0x0033, 0x, 0x, 0x0035, 0x, 0x, 0x, 0x,
0x00d2,

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-30 Thread Tom Lane

Andres Freund  writes:
> On 2018-12-29 16:59:52 -0500, John Naylor wrote:
>> I think 0001 with complete keyword lookup replacement is in decent
>> enough shape to post. Make check-world passes. A few notes and
>> caveats:

> I tried to take this for a spin, an for me the build fails because various
> frontend programs don't have KeywordOffsets/Strings defined, but reference it
> through various functions exposed to the frontend (like fmtId()).  That I see
> that error but you don't is probably related to me using -fuse-ld=gold in
> CFLAGS.

I was just about to point out that the cfbot is seeing that too ...

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-30 Thread Andres Freund

Hi,

On 2018-12-29 16:59:52 -0500, John Naylor wrote:
> I think 0001 with complete keyword lookup replacement is in decent
> enough shape to post. Make check-world passes. A few notes and
> caveats:

I tried to take this for a spin, an for me the build fails because various
frontend programs don't have KeywordOffsets/Strings defined, but reference it
through various functions exposed to the frontend (like fmtId()).  That I see
that error but you don't is probably related to me using -fuse-ld=gold in
CFLAGS.

I can "fix" this by including kwlist_d.h in common/keywords.c
regardless of FRONTEND. That also lead me to discover that the build
dependencies somewhere aren't correctly set-up, because I need to
force a clean rebuild to trigger the problem again, just changing
keywords.c back doesn't trigger the problem.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-29 Thread John Naylor

I think 0001 with complete keyword lookup replacement is in decent
enough shape to post. Make check-world passes. A few notes and
caveats:

-I added an --extern option to the script for the core keyword
headers. This also capitalizes variables.
-ECPG keyword lookup is a bit different in that the ecpg and sql
lookup functions are wrapped in a single function rather than called
separately within pgc.l. It might be worth untangling that, but I have
not done so.
-Some variable names haven't changed even though now they're only
referring to token values, which might be confusing.
-I haven't checked if I need to install the generated headers.
-I haven't measured performance or binary size. If anyone is excited
enough to do that, great, otherwise I'll do that as time permits.
-There are probably makefile bugs.

Now, on to previous review points:

On 12/27/18, Tom Lane  wrote:
> +/* Payload data for keywords */
> +typedef struct ScanKeywordAux
> +{
> + int16   value;  /* grammar's token code */
> + charcategory;   /* see codes above */
> +} ScanKeywordAux;
>
> There isn't really any point in changing category to "char", because
> alignment considerations will mandate that sizeof(ScanKeywordAux) be
> a multiple of 2 anyway.  With some compilers we could get around that
> with a pragma to force non-aligned storage, but doing so would be a
> net loss on most non-Intel architectures.

Reverted, especially since we can skip the struct entirely for some
callers as you pointed out below.

> diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore
> @@ -1,3 +1,4 @@
> +/*kwlist_d.h
>
> Not a fan of using wildcards in .gitignore files, at least not when
> there's just one or two files you intend to match.

Removed.

>  # Force these dependencies to be known even without dependency info built:
> -pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o:
> plpgsql.h pl_gram.h plerrcodes.h
> +pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o:
> plpgsql.h pl_gram.h plerrcodes.h pl_unreserved_kwlist_d.h
>
> Hm, do we really need any more than pl_scanner.o to depend on that header?

I think you're right, so separated into a new rule.

> +# Parse keyword header for names.
> +my @keywords;
> +while (<$kif>)
> +{
> + if (/^PG_KEYWORD\("(\w+)",\s*\w+,\s*\w+\)/)
>
> This is assuming more than it should about the number of arguments for
> PG_KEYWORD, as well as what's in them.  I think it'd be sufficient to
> match like this:
>
>   if (/^PG_KEYWORD\("(\w+)",/)

...and...

> diff --git a/src/pl/plpgsql/src/pl_unreserved_kwlist.h
> b/src/pl/plpgsql/src/pl_unreserved_kwlist.h
> +/* name, value, category */
> +PG_KEYWORD("absolute", K_ABSOLUTE, UNRESERVED_KEYWORD)
>
> Likewise, I'd just have these be two-argument macros.  There's no reason
> for the various kwlist.h headers to agree on the number of payload
> arguments for PG_KEYWORD.

Both done, however...

> +/* FIXME: Have to redefine this symbol for the WIP. */
> +#undef PG_KEYWORD
> +#define PG_KEYWORD(kwname, value, category) {value, category},
> +
> +static const ScanKeywordAux unreserved_keywords[] = {
> +#include "pl_unreserved_kwlist.h"
>  };
>
> The category isn't useful for this keyword list, so couldn't you
> just make this an array of uint16 values?

Yes, this works for the unreserved keywords. The reserved ones still
need the aux struct to work with the core scanner, even though scan.l
doesn't reference category either. This has the consequence that we
can't dispense with category, e.g.:

PG_KEYWORD("all", K_ALL, RESERVED_KEYWORD)

...unless we do without the struct entirely, but that's not without
disadvantages as you mentioned.

I decided to export the struct (rather than just int16 for category)
to the frontend, even though we have to set the token values to zero,
since there might someday be another field of use to the frontend.
Also to avoid confusion.

> I don't mind allowing the prefix to default to empty.  What I was
> concerned about was that base_filename could end up undefined.
> Probably the thing to do is to generate base_filename separately,
> say by stripping any initial ".*/" sequence and then substitute
> '_' for '.'.

I removed assumptions about the filename.

> +Options:
> +-o   output path
> +-p   optional prefix for generated data structures
>
> This usage message is pretty vague about how you write the options
> (cf gripe above).

I tried it like this:

Usage: gen_keywords.pl [--output/-o ] [--prefix/-p ] input_file
--output  Output directory
--prefix  String prepended to var names in the output file


On 12/27/18, Andrew Dunstan  wrote:
> I would rather we used the standard perl module Getopt::Long, as
> numerous programs we have already do.

Done. I'll also send a patch later to bring some other scripts in line.

-John Naylor
From 87e3e5e045823070a865933e969262d55e156137 Mon Sep 17 00:00:00 2001
From: Joh

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-28 Thread Andres Freund

On 2018-12-27 14:22:11 -0500, Tom Lane wrote:
> Andrew Dunstan  writes:
> > A smaller project might be to see if we can replace the binary keyword
> > search in ScanKeyword with a perfect hashing function generated by
> > gperf, or something similar. I had a quick look at that, too.
> 
> Yeah, we've looked at gperf before, eg
> 
> https://www.postgresql.org/message-id/20170927183156.jqzcsy7ocjcbd...@alap3.anarazel.de
> 
> Perhaps it'd be a win but I'm not very convinced.

Note that the tradeoffs mentioned there, by memory, aren't necessarily
applicable here. As we're dealing with strings anyway, gperf wanting to
deal with strings rather than being able to deal with numbers isn't
problematic.

> I don't know much about the theory of perfect hashing, but I wonder
> if we could just roll our own tool for that.  Since we're not dealing
> with extremely large keyword sets, perhaps brute force search for a
> set of multipliers for a hash computation like
> (char[0] * some_prime + char[1] * some_other_prime ...) mod table_size
> would work.

The usual way to do do perfect hashing is to bascially have a two stage
hashtable, with the first stage keyed by a "normal" hash fuinction, and
the second one disambiguating the values that hash into the same bucket,
by additionally keying a hash-function with the value in the cell in the
intermediate hash table. Determining the parameters in the intermediate
table is what takes time.  That most perfect hash functions look like
that way is also a good part of the reason why I doubt it's worthwhile
to go there over a simple linear probing hashtable, with a good
hashfunction - computing two hash-values will usually be worse than
linear probing for *small* and *not modified* hashtables.

A simple (i.e. slow for large numbers of keys) implementation for
generating a perfect hash function isn't particularly hard. E.g. look at
the python implementation at http://iswsa.acm.org/mphf/index.html and
http://stevehanov.ca/blog/index.php?id=119 for an easy explanation with
graphics.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread David Fetter

On Thu, Dec 27, 2018 at 07:04:41PM -0300, Alvaro Herrera wrote:
> On 2018-Dec-27, Tom Lane wrote:
> 
> > I poked around a little on my own machines, and I can confirm that
> > Getopt::Long is present in a default Perl install-from-source at
> > least as far back as perl 5.6.1.  It's barely conceivable that some
> > packager might omit it from their minimal package, but Red Hat,
> > Apple, NetBSD, and OpenBSD all include it.  So it sure looks to
> > me like relying on it should be non-problematic.
> 
> In Debian it's included in package perl-modules-5.24, which packages
> perl and libperl5.24 depend on.  I suppose it's possible to install
> perl-base and not install perl-modules, but it'd be a really bare-bones
> machine.  I'm not sure it's possible to build Postgres in such a
> machine.

$ corelist -a Getopt::Long

Data for 2018-11-29
Getopt::Long was first released with perl 5
  5  undef 
  5.001  undef 
  5.002  2.01  
  5.003072.04  
  5.004  2.10  
  5.004052.19  
  5.005  2.17  
  5.005032.19  
  5.005042.20  
[much output elided]

Fortunately, this has been part of Perl core a lot further back than
we promise to support for builds, so I think we're clear to use it
everywhere we process options.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread Alvaro Herrera

On 2018-Dec-27, Tom Lane wrote:

> I poked around a little on my own machines, and I can confirm that
> Getopt::Long is present in a default Perl install-from-source at
> least as far back as perl 5.6.1.  It's barely conceivable that some
> packager might omit it from their minimal package, but Red Hat,
> Apple, NetBSD, and OpenBSD all include it.  So it sure looks to
> me like relying on it should be non-problematic.

In Debian it's included in package perl-modules-5.24, which packages
perl and libperl5.24 depend on.  I suppose it's possible to install
perl-base and not install perl-modules, but it'd be a really bare-bones
machine.  I'm not sure it's possible to build Postgres in such a
machine.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread Tom Lane

Andrew Dunstan  writes:
> On 12/27/18 3:34 PM, Tom Lane wrote:
>> ... that's a pretty thin argument, and if Getopt::Long is present even
>> in the most minimal Perl installations then it's certainly moot.

> It's bundled separately, but on both systems I looked at it's needed by
> the base perl package. I don't recall ever seeing a system where it's
> not available. I'm reasonably careful about what packages the buildfarm
> requires, and it's used Getopt::Long from day one.

I poked around a little on my own machines, and I can confirm that
Getopt::Long is present in a default Perl install-from-source at
least as far back as perl 5.6.1.  It's barely conceivable that some
packager might omit it from their minimal package, but Red Hat,
Apple, NetBSD, and OpenBSD all include it.  So it sure looks to
me like relying on it should be non-problematic.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread Andrew Dunstan



On 12/27/18 3:34 PM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 12/27/18 3:00 PM, John Naylor wrote:
>>> This style was cargo-culted from the catalog scripts. I can settle on
>>> just the first form if you like.
>> I would rather we used the standard perl module Getopt::Long, as
>> numerous programs we have already do.
> Hmm ... grepping finds that used only in
>
> src/tools/pgindent/pgindent
> src/tools/git_changelog
> src/pl/plperl/text2macro.pl
>
> so I'm not quite sure about the "numerous" claim.  Adopting that
> here would possibly impose the requirement of having Getopt::Long
> on some developers who are getting by without it today.  However,
> that's a pretty thin argument, and if Getopt::Long is present even
> in the most minimal Perl installations then it's certainly moot.


It's bundled separately, but on both systems I looked at it's needed by
the base perl package. I don't recall ever seeing a system where it's
not available. I'm reasonably careful about what packages the buildfarm
requires, and it's used Getopt::Long from day one.


>
> On the whole I'm +1 for this.  Perhaps also, as an independent patch,
> we should change the catalog scripts to use Getopt::Long.
>
>   



Probably some others, too.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread Tom Lane

Andrew Dunstan  writes:
> On 12/27/18 3:00 PM, John Naylor wrote:
>> This style was cargo-culted from the catalog scripts. I can settle on
>> just the first form if you like.

> I would rather we used the standard perl module Getopt::Long, as
> numerous programs we have already do.

Hmm ... grepping finds that used only in

src/tools/pgindent/pgindent
src/tools/git_changelog
src/pl/plperl/text2macro.pl

so I'm not quite sure about the "numerous" claim.  Adopting that
here would possibly impose the requirement of having Getopt::Long
on some developers who are getting by without it today.  However,
that's a pretty thin argument, and if Getopt::Long is present even
in the most minimal Perl installations then it's certainly moot.

On the whole I'm +1 for this.  Perhaps also, as an independent patch,
we should change the catalog scripts to use Getopt::Long.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread Andrew Dunstan



On 12/27/18 3:00 PM, John Naylor wrote:
> On 12/27/18, Tom Lane  wrote:
>> diff --git a/src/tools/gen_keywords.pl b/src/tools/gen_keywords.pl
>> +elsif ($arg =~ /^-o/)
>> +{
>> +$output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV;
>> +}
>>
>> My perl-fu is not great, but it looks like this will accept arguments
>> like "-ofilename", which is a style I don't like at all.  I'd rather
>> either insist on the filename being separate or write the switch like
>> "-o=filename".  Also, project style when taking both forms is usually
>> more like
>>  -o filename
>>  --offset=filename
> This style was cargo-culted from the catalog scripts. I can settle on
> just the first form if you like.
>


I would rather we used the standard perl module Getopt::Long, as
numerous programs we have already do.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread Tom Lane

John Naylor  writes:
> On 12/27/18, Tom Lane  wrote:
>> +$kw_input_file =~ /((\w*)kwlist)\.h/;
>> +my $base_filename = $1;
>> +$prefix = $2 if !defined $prefix;
>> 
>> Hmm, what happens if the input filename does not end with "kwlist.h"?

> If that's a maintainability hazard, I can force every invocation to
> provide a prefix instead.

I don't mind allowing the prefix to default to empty.  What I was
concerned about was that base_filename could end up undefined.
Probably the thing to do is to generate base_filename separately,
say by stripping any initial ".*/" sequence and then substitute
'_' for '.'.

>> I looked very briefly at v4-0002, and I'm not very convinced about
>> the "middle" aspect of that optimization.  It seems unmaintainable,
>> plus you've not exhibited how the preferred keywords would get selected
>> in the first place (wiring them into the Perl script is surely not
>> acceptable).

> What if the second argument of the macro held this info?

Yeah, you'd have to do something like that.  But I'm still concerned
about the maintainability aspect: if we mark say "commit" as the
starting point in the "c" group, future additions or deletions of
keywords starting with "c" might render that an increasingly poor
choice.  But most likely nobody would ever notice that the marking
was getting more and more suboptimal.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread John Naylor

On 12/27/18, Tom Lane  wrote:
> diff --git a/src/tools/gen_keywords.pl b/src/tools/gen_keywords.pl
> + elsif ($arg =~ /^-o/)
> + {
> + $output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV;
> + }
>
> My perl-fu is not great, but it looks like this will accept arguments
> like "-ofilename", which is a style I don't like at all.  I'd rather
> either insist on the filename being separate or write the switch like
> "-o=filename".  Also, project style when taking both forms is usually
> more like
>   -o filename
>   --offset=filename

This style was cargo-culted from the catalog scripts. I can settle on
just the first form if you like.

> +$kw_input_file =~ /((\w*)kwlist)\.h/;
> +my $base_filename = $1;
> +$prefix = $2 if !defined $prefix;
>
> Hmm, what happens if the input filename does not end with "kwlist.h"?

If that's a maintainability hazard, I can force every invocation to
provide a prefix instead.

> I looked very briefly at v4-0002, and I'm not very convinced about
> the "middle" aspect of that optimization.  It seems unmaintainable,
> plus you've not exhibited how the preferred keywords would get selected
> in the first place (wiring them into the Perl script is surely not
> acceptable).

What if the second argument of the macro held this info? Something like:

PG_KEYWORD("security", FULL_SEARCH, UNRESERVED_KEYWORD)
PG_KEYWORD("select", OPTIMIZE, SELECT, RESERVED_KEYWORD)

with a warning emitted if more than one keyword per range has
OPTIMIZE. That would require all keyword lists to have that second
argument, but selecting a preferred keyword would be optional.

-John Naylor

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread Tom Lane

Andrew Dunstan  writes:
> RD parsers are not terribly hard to write.

Sure, as long as they are for grammars that are (a) small, (b) static,
and (c) LL(1), which is strictly weaker than the LALR(1) grammar class
that bison can handle.  We already have a whole lot of constructs that
are at the edges of what bison can handle, which makes me dubious that
an RD parser could be built at all without a lot of performance-eating
lookahead and/or backtracking.

> A smaller project might be to see if we can replace the binary keyword
> search in ScanKeyword with a perfect hashing function generated by
> gperf, or something similar. I had a quick look at that, too.

Yeah, we've looked at gperf before, eg

https://www.postgresql.org/message-id/20170927183156.jqzcsy7ocjcbd...@alap3.anarazel.de

Perhaps it'd be a win but I'm not very convinced.

I don't know much about the theory of perfect hashing, but I wonder
if we could just roll our own tool for that.  Since we're not dealing
with extremely large keyword sets, perhaps brute force search for a
set of multipliers for a hash computation like
(char[0] * some_prime + char[1] * some_other_prime ...) mod table_size
would work.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread Tom Lane

John Naylor  writes:
> Barring additional bikeshedding on 0001, I'll plan on implementing
> offset-based lookup for the other keyword types and retire the old
> ScanKeyword. Once that's done, we can benchmark and compare with the
> optimizations in 0002.

Sounds like a plan.

Assorted minor bikeshedding on v4-0001 (just from eyeballing it, I didn't
test it):

+/* Like ScanKeywordLookup, but uses offsets into a keyword string. */
+int
+ScanKeywordLookupOffset(const char *string_to_lookup,
+   const char *kw_strings,

Not really "like" it, since the return value is totally different and
so is the representation of the keyword list.  I realize that your
plan is probably to get rid of ScanKeywordLookup and then adapt the
latter's comment for this code, but don't forget that you need to
adjust said comment.

+/* Payload data for keywords */
+typedef struct ScanKeywordAux
+{
+   int16   value;  /* grammar's token code */
+   charcategory;   /* see codes above */
+} ScanKeywordAux;

There isn't really any point in changing category to "char", because
alignment considerations will mandate that sizeof(ScanKeywordAux) be
a multiple of 2 anyway.  With some compilers we could get around that
with a pragma to force non-aligned storage, but doing so would be a
net loss on most non-Intel architectures.

If you really are hot about saving that other 440 bytes, the way to
do it would be to drop the struct entirely and use two parallel
arrays, an int16[] for value and a char[] (or better uint8[]) for
category.  Those would be filled by reading kwlist.h twice with
different definitions for PG_KEYWORD.  Not sure it's worth the
trouble though --- in particular, not clear that it's a win from
the standpoint of number of cache lines touched.

diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore
@@ -1,3 +1,4 @@
+/*kwlist_d.h

Not a fan of using wildcards in .gitignore files, at least not when
there's just one or two files you intend to match.

 # Force these dependencies to be known even without dependency info built:
-pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h 
pl_gram.h plerrcodes.h
+pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h 
pl_gram.h plerrcodes.h pl_unreserved_kwlist_d.h

Hm, do we really need any more than pl_scanner.o to depend on that header?

+/* FIXME: Have to redefine this symbol for the WIP. */
+#undef PG_KEYWORD
+#define PG_KEYWORD(kwname, value, category) {value, category},
+
+static const ScanKeywordAux unreserved_keywords[] = {
+#include "pl_unreserved_kwlist.h"
 };

The category isn't useful for this keyword list, so couldn't you
just make this an array of uint16 values?

diff --git a/src/pl/plpgsql/src/pl_unreserved_kwlist.h 
b/src/pl/plpgsql/src/pl_unreserved_kwlist.h
+/* name, value, category */
+PG_KEYWORD("absolute", K_ABSOLUTE, UNRESERVED_KEYWORD)

Likewise, I'd just have these be two-argument macros.  There's no reason
for the various kwlist.h headers to agree on the number of payload
arguments for PG_KEYWORD.

diff --git a/src/tools/gen_keywords.pl b/src/tools/gen_keywords.pl
+   elsif ($arg =~ /^-o/)
+   {
+   $output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV;
+   }

My perl-fu is not great, but it looks like this will accept arguments
like "-ofilename", which is a style I don't like at all.  I'd rather
either insist on the filename being separate or write the switch like
"-o=filename".  Also, project style when taking both forms is usually
more like
-o filename
--offset=filename

+$kw_input_file =~ /((\w*)kwlist)\.h/;
+my $base_filename = $1;
+$prefix = $2 if !defined $prefix;

Hmm, what happens if the input filename does not end with "kwlist.h"?

+# Parse keyword header for names.
+my @keywords;
+while (<$kif>)
+{
+   if (/^PG_KEYWORD\("(\w+)",\s*\w+,\s*\w+\)/)

This is assuming more than it should about the number of arguments for
PG_KEYWORD, as well as what's in them.  I think it'd be sufficient to
match like this:

if (/^PG_KEYWORD\("(\w+)",/)

+Options:
+-o   output path
+-p   optional prefix for generated data structures

This usage message is pretty vague about how you write the options
(cf gripe above).

I looked very briefly at v4-0002, and I'm not very convinced about
the "middle" aspect of that optimization.  It seems unmaintainable,
plus you've not exhibited how the preferred keywords would get selected
in the first place (wiring them into the Perl script is surely not
acceptable).  If you want to pursue that, please separate it into
an 0002 that just adds the letter-range aspect and then an 0003
that adds the "middle" business on top.  Then we can do testing to
see whether either of those ideas are worthwhile.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread Andrew Dunstan

On 12/27/18 12:12 PM, Tom Lane wrote:

>> I don't buy that we're inable to write a descent parser that way.
> I do not think that we could write one for the current state of the
> PG grammar without an investment of effort so large that it's not
> going to happen.  Even if such a parser were to spring fully armed
> from somebody's forehead, we absolutely cannot expect that it would
> continue to work correctly after non-wizard contributors modify it.

I just did a quick survey of generator tools. Unfortunately, the best
candidate alternative (ANTLR) no longer supports generating plain C
code. I don't know of another tool that is well maintained, supports C,
and generates top down parsers. Twenty-five years ago or so I wrote a
top-down table-driven parser generator, but that was in another country,
and besides, the wench is dead.

There are well known techniques (See s 4.4 of the Dragon Book, if you
have a copy) for formal analysis of grammars to determine predictive
parser action. They aren't hard, and the tables they produce are
typically much smaller than those used for LALR parsers. Still, probably
not for the faint of heart.

The tools that have moved to using hand cut RD parsers have done so
precisely because they get a significant performance benefit from doing so.

RD parsers are not terribly hard to write. Yes, the JSON grammar is
tiny, but I think I wrote the basics of the RD parser we use for JSON in
about an hour. I think arguing that our hacker base is not competent to
maintain such a thing for the SQL grammar is wrong.  We successfully
maintain vastly more complex pieces of code.

Having said all that, I don't intend to spend any time on implementing
an alternative parser. It would as you say involve a heck of a lot of
time, which I don't have. It would be a fine academic research project
for some student.

A smaller project might be to see if we can replace the binary keyword
search  in ScanKeyword with a perfect hashing function generated by
gperf, or something similar. I had a quick look at that, too.
Unfortunately the smallest hash table I could generate for our 440
symbols had 1815 entries, so I'm not sure how well that would work.
Worth investigating, though.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread Tom Lane

Andres Freund  writes:
> On 2018-12-26 14:03:57 -0500, Tom Lane wrote:
>> It's impossible to write correct RD parsers by hand for any but the most
>> trivial, conflict-free languages, and what we have got to deal with
>> is certainly neither of those; moreover, it's a constantly moving target.
>> We'd be buying into an endless landscape of parser bugs if we go that way.
>> It's *not* worth it.

> It's not exactly new that people end up moving to bison to recursive
> descent parsers once they hit the performance problems and want to give
> better error messages. E.g. both gcc and clang have hand-written
> recursive-descent parsers for C and C++ these days.

Note that they are dealing with fixed language definitions.  Furthermore,
there's no need to worry about whether that code has to be hacked on by
less-than-expert people.  Neither condition applies to us.

The thing that most concerns me about not using a grammar tool of some
sort is that with handwritten RD, it's very easy to get into situations
where you've "defined" (well, implemented, because you never did have
a formal definition) a language that is ambiguous, admitting of more
than one valid parse interpretation.  You won't find out until someone
files a bug report complaining that some apparently-valid statement
isn't doing what they expect.  At that point you are in a world of hurt,
because it's too late to fix it without changing the language definition
and thus creating user-visible compatibility breakage.

Now bison isn't perfect in this regard, because you can shoot yourself
in the foot with ill-considered precedence specifications (and we've
done so ;-(), but it is light-years more likely to detect ambiguous
grammar up-front than any handwritten parser logic is.

If we had a tool that proved a BNF grammar non-ambiguous and then
wrote an RD parser for it, that'd be fine with me --- but we need
a tool, not somebody claiming he can write an error-free RD parser
for an arbitrary language.  My position is that anyone claiming that
is just plain deluded.

I also do not buy your unsupported-by-any-evidence claim that the
error reports would be better.  I've worked on RD parsers in the
past, and they're not really better, at least not without expending
enormous amounts of effort --- and run-time cycles --- specifically
on the error reporting aspect.  Again, I don't see that happening
for us.

> I don't buy that we're inable to write a descent parser that way.

I do not think that we could write one for the current state of the
PG grammar without an investment of effort so large that it's not
going to happen.  Even if such a parser were to spring fully armed
from somebody's forehead, we absolutely cannot expect that it would
continue to work correctly after non-wizard contributors modify it.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread John Naylor

On 12/26/18, John Naylor  wrote:
> On 12/26/18, Robert Haas  wrote:
>> I wonder if we could do something really simple like a lookup based on
>> the first character of the scan keyword. It looks to me like there are
>> 440 keywords right now, and the most common starting letter is 'c',
>> which is the first letter of 51 keywords. So dispatching based on the
>> first letter clips at least 3 steps off the binary search.  I don't
>> know whether that's enough to be worthwhile, but it's probably pretty
>> simple to implement.

> I agree it'd be fairly simple to do, and might raise the bar for doing
> anything more complex.

I went ahead and did this for v4, but split out into a separate patch.
In addition, I used a heuristic to bypass binary search for the most
common keywords. Normally, the middle value is computed
mathematically, but I found that in each range of keywords beginning
with the same letter, there is often 1 or 2 common keywords that are
good first guesses, such as select, from, join, limit, where. I taught
the lookup to try those first, and then compute subsequent steps the
usual way.

Barring additional bikeshedding on 0001, I'll plan on implementing
offset-based lookup for the other keyword types and retire the old
ScanKeyword. Once that's done, we can benchmark and compare with the
optimizations in 0002.

-John Naylor
From ed78318df69bd3fa1543f3dc04d0a6ca9794a81c Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 26 Dec 2018 21:28:27 -0500
Subject: [PATCH v4 1/2] WIP Use offset-based keyword lookup.

(For WIP, only for plpgsql unreserved keywords)

Replace binary search over an array of ScanKeyword structs with a binary
search over an array of offsets into a keyword string. Access auxillary
data only after a keyword hit. This has better locality of reference and
a smaller memory footprint.
---
 src/common/keywords.c |  55 +
 src/include/common/keywords.h |  12 ++
 src/pl/plpgsql/src/.gitignore |   1 +
 src/pl/plpgsql/src/Makefile   |  13 +-
 src/pl/plpgsql/src/pl_scanner.c   | 128 +---
 src/pl/plpgsql/src/pl_unreserved_kwlist.h | 108 +
 src/tools/gen_keywords.pl | 137 ++
 src/tools/msvc/Solution.pm|  10 ++
 8 files changed, 361 insertions(+), 103 deletions(-)
 create mode 100644 src/pl/plpgsql/src/pl_unreserved_kwlist.h
 create mode 100644 src/tools/gen_keywords.pl

diff --git a/src/common/keywords.c b/src/common/keywords.c
index 0c0c794c68..b0e5a721b6 100644
--- a/src/common/keywords.c
+++ b/src/common/keywords.c
@@ -112,3 +112,58 @@ ScanKeywordLookup(const char *text,
 
 	return NULL;
 }
+
+/* Like ScanKeywordLookup, but uses offsets into a keyword string. */
+int
+ScanKeywordLookupOffset(const char *string_to_lookup,
+		const char *kw_strings,
+		const uint16 *kw_offsets,
+		int num_keywords)
+{
+	int			len,
+i;
+	char		word[NAMEDATALEN];
+	const uint16 *low;
+	const uint16 *high;
+
+	len = strlen(string_to_lookup);
+	/* We assume all keywords are shorter than NAMEDATALEN. */
+	if (len >= NAMEDATALEN)
+		return -1;
+
+	/*
+	 * Apply an ASCII-only downcasing.  We must not use tolower() since it may
+	 * produce the wrong translation in some locales (eg, Turkish).
+	 */
+	for (i = 0; i < len; i++)
+	{
+		char		ch = string_to_lookup[i];
+
+		if (ch >= 'A' && ch <= 'Z')
+			ch += 'a' - 'A';
+		word[i] = ch;
+	}
+	word[len] = '\0';
+
+	/*
+	 * Now do a binary search using plain strcmp() comparison.
+	 */
+	low = kw_offsets;
+	high = kw_offsets + (num_keywords - 1);
+	while (low <= high)
+	{
+		const uint16 *middle;
+		int			difference;
+
+		middle = low + (high - low) / 2;
+		difference = strcmp(kw_strings + *middle, word);
+		if (difference == 0)
+			return middle - kw_offsets;
+		else if (difference < 0)
+			low = middle + 1;
+		else
+			high = middle - 1;
+	}
+
+	return -1;
+}
diff --git a/src/include/common/keywords.h b/src/include/common/keywords.h
index 0b31505b66..201d0fcc7a 100644
--- a/src/include/common/keywords.h
+++ b/src/include/common/keywords.h
@@ -28,6 +28,13 @@ typedef struct ScanKeyword
 	int16		category;		/* see codes above */
 } ScanKeyword;
 
+/* Payload data for keywords */
+typedef struct ScanKeywordAux
+{
+	int16		value;			/* grammar's token code */
+	char		category;		/* see codes above */
+} ScanKeywordAux;
+
 #ifndef FRONTEND
 extern PGDLLIMPORT const ScanKeyword ScanKeywords[];
 extern PGDLLIMPORT const int NumScanKeywords;
@@ -41,4 +48,9 @@ extern const ScanKeyword *ScanKeywordLookup(const char *text,
   const ScanKeyword *keywords,
   int num_keywords);
 
+int ScanKeywordLookupOffset(const char *string_to_lookup,
+		const char *kw_strings,
+		const uint16 *kw_offsets,
+		int num_keywords);
+
 #endif			/* KEYWORDS_H */
diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore
index ff6ac965fd..a649302fdb 100644
--- a/src/pl/plpgsql/src/.gitignore
+++ b/s

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-27 Thread Andres Freund

Hi,

On 2018-12-26 14:03:57 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > My bet is, and has been for quite a while, that we'll have to go for a
> > hand-written recursive descent type parser.
> 
> I will state right up front that that will happen over my dead body.
> 
> It's impossible to write correct RD parsers by hand for any but the most
> trivial, conflict-free languages, and what we have got to deal with
> is certainly neither of those; moreover, it's a constantly moving target.
> We'd be buying into an endless landscape of parser bugs if we go that way.
> It's *not* worth it.

It's not exactly new that people end up moving to bison to recursive
descent parsers once they hit the performance problems and want to give
better error messages. E.g. both gcc and clang have hand-written
recursive-descent parsers for C and C++ these days.   I don't buy that
we're inable to write a descent parser that way.

What I *do* buy is that it's more problematic for the design of our SQL
dialect, because the use of bison often uncovers ambiguities in new
extensions of the language. And I don't really have a good idea how to
handle that.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-26 Thread John Naylor

On 12/26/18, Robert Haas  wrote:
> I wonder if we could do something really simple like a lookup based on
> the first character of the scan keyword. It looks to me like there are
> 440 keywords right now, and the most common starting letter is 'c',
> which is the first letter of 51 keywords. So dispatching based on the
> first letter clips at least 3 steps off the binary search.  I don't
> know whether that's enough to be worthwhile, but it's probably pretty
> simple to implement.

Using radix tree structures for the top couple of node levels is a
known technique to optimize tries that need to be more space-efficient
at lower levels, so this has precedent. In this case there would be a
space trade off of

(alphabet size, rounded up) * (size of index to lower boundary + size
of index to upper boundary) = 32 * (2 + 2) = 128 bytes

which is pretty small compared to what we'll save by offset-based
lookup. On average, there'd be 4.1 binary search steps, which is nice.
I agree it'd be fairly simple to do, and might raise the bar for doing
anything more complex.

-John Naylor

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-26 Thread Tom Lane

Andres Freund  writes:
> On 2018-12-26 10:45:11 -0500, Robert Haas wrote:
>> I'm not sure that I understand quite what you have in mind for a
>> serialized non-perfect hashtable.  Are you thinking that we'd just
>> construct a simplehash and serialize it?

> I was basically thinking that we'd have the perl script implement a
> simple hash and put the keyword (pointers) into an array, handling
> conflicts with the simplest linear probing thinkable. As there's never a
> need for modifications, that ought to be fairly simple.

I think it was Knuth who said that when you use hashing, you are putting
a great deal of faith in the average case, because the worst case is
terrible.  The applicability of that to this problem is that if you hit
a bad case (say, a long collision chain affecting some common keywords)
you could end up with poor performance that affects a lot of people for
a long time.  And our keyword list is not so static that you could prove
once that the behavior is OK and then forget about it.

So I'm suspicious of proposals to use simplistic hashing here.

There might well be some value in Robert's idea of keying off the first
letter to get rid of the first few binary-search steps, not least because
those steps are particularly terrible from a cache-footprint perspective.
I'm not sold on doing anything significantly more invasive than that.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-26 Thread Tom Lane

Andres Freund  writes:
> My bet is, and has been for quite a while, that we'll have to go for a
> hand-written recursive descent type parser.

I will state right up front that that will happen over my dead body.

It's impossible to write correct RD parsers by hand for any but the most
trivial, conflict-free languages, and what we have got to deal with
is certainly neither of those; moreover, it's a constantly moving target.
We'd be buying into an endless landscape of parser bugs if we go that way.
It's *not* worth it.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-26 Thread Andres Freund

Hi,

On 2018-12-26 10:45:11 -0500, Robert Haas wrote:
> I'm not sure that I understand quite what you have in mind for a
> serialized non-perfect hashtable.  Are you thinking that we'd just
> construct a simplehash and serialize it?

I was basically thinking that we'd have the perl script implement a
simple hash and put the keyword (pointers) into an array, handling
conflicts with the simplest linear probing thinkable. As there's never a
need for modifications, that ought to be fairly simple.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-26 Thread Andres Freund

Hi,

On 2018-12-26 11:50:18 -0500, Robert Haas wrote:
> On Wed, Dec 26, 2018 at 11:22 AM Tom Lane  wrote:
> > I think there's a lot of goalpost-moving going on here.  The original
> > idea was to trim the physical size of the data structure, as stated
> > in the thread subject, and just reap whatever cache benefits we got
> > along the way from that.  I am dubious that we actually have any
> > performance problem in this code that needs a big dollop of added
> > complexity to fix.
> 
> I have seen ScanKeywordLookup show up in profiles quite often and
> fairly prominently -- like several percent of total runtime. I'm not
> trying to impose requirements on John's patch, and I agree that
> reducing the physical size of the structure is a good step whether
> anything else is done or not. However, I don't see that as a reason to
> shut down further discussion of other possible improvements.  If his
> patch makes this disappear from profiles, cool, but if it doesn't,
> then sooner or later somebody's going to want to do more.

I agree. And most of the patch would be a pre-requisite for anything
more elaborate anyway.


> FWIW, my bet is this helps but isn't enough to get rid of the problem
> completely.  A 9-step binary search has got to be slower than a really
> well-optimized hash table lookup.

Yea, at least with a non-optimized layout. If we'd used a binary search
optimized lookup order it might be different, but probably at best
equivalent to a good hashtable.

> > In my hands, the only part of the low-level parsing code that commonly
> > shows up as interesting in profiles is the Bison engine.  That's probably
> > because the grammar tables are circa half a megabyte and blow out cache
> > pretty badly :-(.  I don't know of any way to make that better,
> > unfortunately.  I suspect that it's just going to get worse, because
> > people keep submitting additions to the grammar.
> 
> I'm kinda surprised that you haven't seen ScanKeywordLookup() in
> there, but I agree with you that the size of the main parser tables is
> a real issue, and that there's no easy solution. At various times
> there has been discussion of using some other parser generator, and
> I've also toyed with the idea of writing one specifically for
> PostgreSQL. Unfortunately, it seems like bison is all but
> unmaintained; the alternatives are immature and have limited adoption
> and limited community; and writing something from scratch is a ton of
> work.  :-(

My bet is, and has been for quite a while, that we'll have to go for a
hand-written recursive descent type parser.  They can be *substantially*
faster, and performance isn't as affected by the grammar size. And,
about as important, they also allow for a lot more heuristics around
grammar errors - I do think we'll soon have to better than to throw a
generic syntax error for the cases where the grammar doesn't match at
all.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-26 Thread Tom Lane

Robert Haas  writes:
> I'm kinda surprised that you haven't seen ScanKeywordLookup() in
> there, but I agree with you that the size of the main parser tables is
> a real issue, and that there's no easy solution. At various times
> there has been discussion of using some other parser generator, and
> I've also toyed with the idea of writing one specifically for
> PostgreSQL. Unfortunately, it seems like bison is all but
> unmaintained; the alternatives are immature and have limited adoption
> and limited community; and writing something from scratch is a ton of
> work.  :-(

Yeah, and also: SQL is a damn big and messy language, and so it's not
very clear that it's really bison's fault that it's slow to parse.
We might do a ton of work to implement an alternative, and then find
ourselves no better off.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-26 Thread Tom Lane

David Fetter  writes:
> On Wed, Dec 26, 2018 at 11:22:39AM -0500, Tom Lane wrote:
>> In my hands, the only part of the low-level parsing code that
>> commonly shows up as interesting in profiles is the Bison engine.

> Should we be considering others?

We've looked around before, IIRC, and not really seen any arguably
better tools.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-26 Thread Robert Haas

On Wed, Dec 26, 2018 at 11:22 AM Tom Lane  wrote:
> I think there's a lot of goalpost-moving going on here.  The original
> idea was to trim the physical size of the data structure, as stated
> in the thread subject, and just reap whatever cache benefits we got
> along the way from that.  I am dubious that we actually have any
> performance problem in this code that needs a big dollop of added
> complexity to fix.

I have seen ScanKeywordLookup show up in profiles quite often and
fairly prominently -- like several percent of total runtime. I'm not
trying to impose requirements on John's patch, and I agree that
reducing the physical size of the structure is a good step whether
anything else is done or not. However, I don't see that as a reason to
shut down further discussion of other possible improvements.  If his
patch makes this disappear from profiles, cool, but if it doesn't,
then sooner or later somebody's going to want to do more.

FWIW, my bet is this helps but isn't enough to get rid of the problem
completely.  A 9-step binary search has got to be slower than a really
well-optimized hash table lookup. In a perfect world the latter
touches the cache line containing the keyword -- which presumably is
already in cache since we just scanned it -- then computes a hash
value without touching any other cache lines -- and then goes straight
to the right entry.  So it touches ONE new cache line.  That might a
level of optimization that's hard to achieve in practice, but I don't
think it's crazy to want to get there.

> In my hands, the only part of the low-level parsing code that commonly
> shows up as interesting in profiles is the Bison engine.  That's probably
> because the grammar tables are circa half a megabyte and blow out cache
> pretty badly :-(.  I don't know of any way to make that better,
> unfortunately.  I suspect that it's just going to get worse, because
> people keep submitting additions to the grammar.

I'm kinda surprised that you haven't seen ScanKeywordLookup() in
there, but I agree with you that the size of the main parser tables is
a real issue, and that there's no easy solution. At various times
there has been discussion of using some other parser generator, and
I've also toyed with the idea of writing one specifically for
PostgreSQL. Unfortunately, it seems like bison is all but
unmaintained; the alternatives are immature and have limited adoption
and limited community; and writing something from scratch is a ton of
work.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-26 Thread David Fetter

On Wed, Dec 26, 2018 at 11:22:39AM -0500, Tom Lane wrote:
> 
> In my hands, the only part of the low-level parsing code that
> commonly shows up as interesting in profiles is the Bison engine.

Should we be considering others? As I understand it, steps have been
made in this field since yacc was originally designed. Is LALR
actually suitable for languages like SQL, or is it just there for
historical reasons?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-26 Thread Tom Lane

Robert Haas  writes:
> On Wed, Dec 19, 2018 at 8:01 PM Andres Freund  wrote:
>> The last time I looked into perfect hash functions, it wasn't easy to
>> find a generator that competed with a decent normal hashtable (in
>> particular gperf's are very unconvincing). The added tooling is a
>> concern imo.  OTOH, we're comparing not with a hashtable, but a binary
>> search, where the latter will usually loose.  Wonder if we shouldn't
>> generate a serialized non-perfect hashtable instead. The lookup code for
>> a read-only hashtable without concern for adversarial input is pretty
>> trivial.

> I wonder if we could do something really simple like a lookup based on
> the first character of the scan keyword. It looks to me like there are
> 440 keywords right now, and the most common starting letter is 'c',
> which is the first letter of 51 keywords. So dispatching based on the
> first letter clips at least 3 steps off the binary search.  I don't
> know whether that's enough to be worthwhile, but it's probably pretty
> simple to implement.

I think there's a lot of goalpost-moving going on here.  The original
idea was to trim the physical size of the data structure, as stated
in the thread subject, and just reap whatever cache benefits we got
along the way from that.  I am dubious that we actually have any
performance problem in this code that needs a big dollop of added
complexity to fix.

In my hands, the only part of the low-level parsing code that commonly
shows up as interesting in profiles is the Bison engine.  That's probably
because the grammar tables are circa half a megabyte and blow out cache
pretty badly :-(.  I don't know of any way to make that better,
unfortunately.  I suspect that it's just going to get worse, because
people keep submitting additions to the grammar.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-26 Thread Robert Haas

On Wed, Dec 19, 2018 at 8:01 PM Andres Freund  wrote:
> The last time I looked into perfect hash functions, it wasn't easy to
> find a generator that competed with a decent normal hashtable (in
> particular gperf's are very unconvincing). The added tooling is a
> concern imo.  OTOH, we're comparing not with a hashtable, but a binary
> search, where the latter will usually loose.  Wonder if we shouldn't
> generate a serialized non-perfect hashtable instead. The lookup code for
> a read-only hashtable without concern for adversarial input is pretty
> trivial.

I wonder if we could do something really simple like a lookup based on
the first character of the scan keyword. It looks to me like there are
440 keywords right now, and the most common starting letter is 'c',
which is the first letter of 51 keywords. So dispatching based on the
first letter clips at least 3 steps off the binary search.  I don't
know whether that's enough to be worthwhile, but it's probably pretty
simple to implement.

I'm not sure that I understand quite what you have in mind for a
serialized non-perfect hashtable.  Are you thinking that we'd just
construct a simplehash and serialize it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-23 Thread John Naylor

On 12/22/18, Tom Lane  wrote:
> John Naylor  writes:
>> Using a single file also gave me another idea: Take value and category
>> out of ScanKeyword, and replace them with an index into another array
>> containing those, which will only be accessed in the event of a hit.
>> That would shrink ScanKeyword to 4 bytes (offset, index), further
>> increasing locality of reference. Might not be worth it, but I can try
>> it after moving on to the core scanner.
>
> I like that idea a *lot*, actually, because it offers the opportunity
> to decouple this mechanism from all assumptions about what the
> auxiliary data for a keyword is.

Okay, in that case I went ahead and did it for WIP v3.

> (it'd be a good idea to have a switch that allows specifying the
> prefix of these constant names).

Done as an optional switch, and tested, but not yet used in favor of
the previous method as a fallback. I'll probably do it in the final
version to keep lines below 80, and to add 'core_' to the core keyword
vars.

> /* Payload data for keywords */
> typedef struct MyKeyword
> {
> int16value;
> int16category;
> } MyKeyword;

I tweaked this a bit to

typedef struct ScanKeywordAux
{
int16   value;  /* grammar's token code */
charcategory;   /* see codes above */
} ScanKeywordAux;

It seems that category was only 2 bytes to make ScanKeyword a power of
2 (of course that was on 32 bit machines and doesn't hold true
anymore). Using char will save another few hundred bytes in the core
scanner. Since we're only accessing this once per identifier, we may
not need to worry so much about memory alignment.

> Aside from being arguably better from the locality-of-reference
> standpoint, this gets us out of the weird ifdef'ing you've got in
> the v2 patch.  The kwlist_d.h headers can be very ordinary headers.

Yeah, that's a nice (and for me unexpected) bonus.

-John Naylor
 src/common/keywords.c |  55 
 src/include/common/keywords.h |  12 +++
 src/pl/plpgsql/src/.gitignore |   1 +
 src/pl/plpgsql/src/Makefile   |  13 +--
 src/pl/plpgsql/src/pl_scanner.c   | 127 +++
 src/pl/plpgsql/src/pl_unreserved_kwlist.h | 107 +++
 src/tools/gen_keywords.pl | 139 ++
 src/tools/msvc/Solution.pm|  10 +++
 8 files changed, 361 insertions(+), 103 deletions(-)

diff --git a/src/common/keywords.c b/src/common/keywords.c
index 0c0c794c68..b0e5a721b6 100644
--- a/src/common/keywords.c
+++ b/src/common/keywords.c
@@ -112,3 +112,58 @@ ScanKeywordLookup(const char *text,
 
 	return NULL;
 }
+
+/* Like ScanKeywordLookup, but uses offsets into a keyword string. */
+int
+ScanKeywordLookupOffset(const char *string_to_lookup,
+		const char *kw_strings,
+		const uint16 *kw_offsets,
+		int num_keywords)
+{
+	int			len,
+i;
+	char		word[NAMEDATALEN];
+	const uint16 *low;
+	const uint16 *high;
+
+	len = strlen(string_to_lookup);
+	/* We assume all keywords are shorter than NAMEDATALEN. */
+	if (len >= NAMEDATALEN)
+		return -1;
+
+	/*
+	 * Apply an ASCII-only downcasing.  We must not use tolower() since it may
+	 * produce the wrong translation in some locales (eg, Turkish).
+	 */
+	for (i = 0; i < len; i++)
+	{
+		char		ch = string_to_lookup[i];
+
+		if (ch >= 'A' && ch <= 'Z')
+			ch += 'a' - 'A';
+		word[i] = ch;
+	}
+	word[len] = '\0';
+
+	/*
+	 * Now do a binary search using plain strcmp() comparison.
+	 */
+	low = kw_offsets;
+	high = kw_offsets + (num_keywords - 1);
+	while (low <= high)
+	{
+		const uint16 *middle;
+		int			difference;
+
+		middle = low + (high - low) / 2;
+		difference = strcmp(kw_strings + *middle, word);
+		if (difference == 0)
+			return middle - kw_offsets;
+		else if (difference < 0)
+			low = middle + 1;
+		else
+			high = middle - 1;
+	}
+
+	return -1;
+}
diff --git a/src/include/common/keywords.h b/src/include/common/keywords.h
index 0b31505b66..201d0fcc7a 100644
--- a/src/include/common/keywords.h
+++ b/src/include/common/keywords.h
@@ -28,6 +28,13 @@ typedef struct ScanKeyword
 	int16		category;		/* see codes above */
 } ScanKeyword;
 
+/* Payload data for keywords */
+typedef struct ScanKeywordAux
+{
+	int16		value;			/* grammar's token code */
+	char		category;		/* see codes above */
+} ScanKeywordAux;
+
 #ifndef FRONTEND
 extern PGDLLIMPORT const ScanKeyword ScanKeywords[];
 extern PGDLLIMPORT const int NumScanKeywords;
@@ -41,4 +48,9 @@ extern const ScanKeyword *ScanKeywordLookup(const char *text,
   const ScanKeyword *keywords,
   int num_keywords);
 
+int ScanKeywordLookupOffset(const char *string_to_lookup,
+		const char *kw_strings,
+		const uint16 *kw_offsets,
+		int num_keywords);
+
 #endif			/* KEYWORDS_H */
diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore
index ff6ac965fd..a649302fdb 100644
--- a/src/pl/plpgsql/src/.

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-22 Thread Tom Lane

Andres Freund  writes:
> On 2018-12-22 12:20:00 -0500, Tom Lane wrote:
>> I like that idea a *lot*, actually, because it offers the opportunity
>> to decouple this mechanism from all assumptions about what the
>> auxiliary data for a keyword is.

> OTOH, it doubles or triples the number of cachelines accessed when
> encountering a keyword.

Compared to what?  The current situation in that regard is a mess.

Also, AFAICS this proposal involves the least amount of data touched
during the lookup phase of anything we've discussed, so I do not even
accept that your criticism is correct.  One extra cacheline fetch
to get the aux data for a particular keyword after the search is not
going to tip the scales away from this being a win.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-22 Thread Andres Freund

Hi,

On 2018-12-22 12:20:00 -0500, Tom Lane wrote:
> John Naylor  writes:
> > Using a single file also gave me another idea: Take value and category
> > out of ScanKeyword, and replace them with an index into another array
> > containing those, which will only be accessed in the event of a hit.
> > That would shrink ScanKeyword to 4 bytes (offset, index), further
> > increasing locality of reference. Might not be worth it, but I can try
> > it after moving on to the core scanner.
> 
> I like that idea a *lot*, actually, because it offers the opportunity
> to decouple this mechanism from all assumptions about what the
> auxiliary data for a keyword is.

OTOH, it doubles or triples the number of cachelines accessed when
encountering a keyword. The fraction of keywords to not-keywords in SQL
makes me wonder whether that makes it a good deal.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-22 Thread Tom Lane

John Naylor  writes:
> Using a single file also gave me another idea: Take value and category
> out of ScanKeyword, and replace them with an index into another array
> containing those, which will only be accessed in the event of a hit.
> That would shrink ScanKeyword to 4 bytes (offset, index), further
> increasing locality of reference. Might not be worth it, but I can try
> it after moving on to the core scanner.

I like that idea a *lot*, actually, because it offers the opportunity
to decouple this mechanism from all assumptions about what the
auxiliary data for a keyword is.  Basically, we'd redefine
ScanKeywordLookup as having the API "given a string, return a
keyword index if it is a keyword, -1 if it isn't"; then the caller
would use the keyword index to look up the auxiliary data in a table
that it owns, and ScanKeywordLookup doesn't know about at all.

So that leads to a design like this: the master data is in a header
that's just like kwlist.h is today, except now we are thinking of
PG_KEYWORD as an N-argument macro not necessarily exactly 3 arguments.
The Perl script reads that, paying attention only to the first argument
of the macro calls, and outputs a file containing, say,

static const uint16 kw_offsets[] = { 0, 6, 15, ... };

static const char kw_strings[] =
"abort\0"
"absolute\0"
...
;

(it'd be a good idea to have a switch that allows specifying the
prefix of these constant names).  Then ScanKeywordLookup has the
signature

int ScanKeywordLookup(const char *string_to_lookup,
  const char *kw_strings,
  const uint16 *kw_offsets,
  int num_keywords);

and a file using this stuff looks something like

/* Payload data for keywords */
typedef struct MyKeyword
{
int16value;
int16category;
} MyKeyword;

#define PG_KEYWORD(kwname, value, category) {value, category},

static const MyKeyword MyKeywords[] = {
#include "kwlist.h"
};

/* String lookup table for keywords */
#include "kwlist_d.h"

/* Lookup code looks about like this: */
kwnum = ScanKeywordLookup(str,
  kw_strings,
  kw_offsets,
  lengthof(kw_offsets));
if (kwnum >= 0)
   ... look into MyKeywords[kwnum] for info ...

Aside from being arguably better from the locality-of-reference
standpoint, this gets us out of the weird ifdef'ing you've got in
the v2 patch.  The kwlist_d.h headers can be very ordinary headers.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-21 Thread John Naylor

On 12/20/18, Tom Lane  wrote:
> I'd be inclined to put the script in src/tools, I think.  IMO src/common
> is for code that actually gets built into our executables.

Done.

>> which takes
>> pl_unreserved_kwlist.h as input and outputs
>> pl_unreserved_kwlist_offset.h and pl_unreserved_kwlist_string.h.
>
> I wonder whether we'd not be better off producing just one output
> file, in which we have the offsets emitted as PG_KEYWORD macros
> and then the giant string emitted as a macro definition, ie
> something like
>
> #define PG_KEYWORD_STRING \
>   "absolute\0" \
>   "alias\0" \
>   ...
>
> That simplifies the Makefile-hacking, at least, and it possibly gives
> callers more flexibility about what they actually want to do with the
> string.

Okay, I tried that. Since the script is turning one header into
another, I borrowed the "*_d.h" nomenclature from the catalogs. Using
a single file required some #ifdef hacks in the output file. Maybe
there's a cleaner way to do this, but I don't know what it is.

Using a single file also gave me another idea: Take value and category
out of ScanKeyword, and replace them with an index into another array
containing those, which will only be accessed in the event of a hit.
That would shrink ScanKeyword to 4 bytes (offset, index), further
increasing locality of reference. Might not be worth it, but I can try
it after moving on to the core scanner.

> I'm for "not change things unnecessarily".  People might well be
> scraping the keyword list out of parser/kwlist.h for other purposes
> right now --- indeed, it's defined the way it is exactly to let
> people do that.  I don't see a good reason to force them to redo
> whatever tooling they have that depends on that.  So let's build
> kwlist_offsets.h alongside that, but not change kwlist.h itself.

Done.

>> I used the global .gitignore, but maybe that's an abuse of it.
>
> Yeah, I'd say it is.

Moved.

>> +# TODO: Error out if the keyword names are not in ASCII order.
>
> +many for including such a check.

Done.

> Also note that we don't require people to have Perl installed when
> building from a tarball.  Therefore, these derived headers must get
> built during "make distprep" and removed by maintainer-clean but
> not distclean.  I think this also has some implications for VPATH
> builds, but as long as you follow the pattern used for other
> derived header files (e.g. fmgroids.h), you should be fine.

Done. I also blindly added support for MSVC.

-John Naylor
 src/common/keywords.c |  55 
 src/include/common/keywords.h |  14 +++
 src/pl/plpgsql/src/.gitignore |   1 +
 src/pl/plpgsql/src/Makefile   |   9 +-
 src/pl/plpgsql/src/pl_scanner.c   | 119 ++
 src/pl/plpgsql/src/pl_unreserved_kwlist.h | 107 +++
 src/tools/gen_keywords.pl | 136 ++
 src/tools/msvc/Solution.pm|  10 +++
 8 files changed, 356 insertions(+), 95 deletions(-)

diff --git a/src/common/keywords.c b/src/common/keywords.c
index 0c0c794c68..0fb14a0372 100644
--- a/src/common/keywords.c
+++ b/src/common/keywords.c
@@ -112,3 +112,58 @@ ScanKeywordLookup(const char *text,
 
 	return NULL;
 }
+
+/* Like ScanKeywordLookup, but uses offsets into a keyword string. */
+const ScanKeywordOffset *
+ScanKeywordLookupOffset(const char *text,
+		const ScanKeywordOffset *keywords,
+		int num_keywords,
+		const char *keyword_string)
+{
+	int			len,
+i;
+	char		word[NAMEDATALEN];
+	const ScanKeywordOffset *low;
+	const ScanKeywordOffset *high;
+
+	len = strlen(text);
+	/* We assume all keywords are shorter than NAMEDATALEN. */
+	if (len >= NAMEDATALEN)
+		return NULL;
+
+	/*
+	 * Apply an ASCII-only downcasing.  We must not use tolower() since it may
+	 * produce the wrong translation in some locales (eg, Turkish).
+	 */
+	for (i = 0; i < len; i++)
+	{
+		char		ch = text[i];
+
+		if (ch >= 'A' && ch <= 'Z')
+			ch += 'a' - 'A';
+		word[i] = ch;
+	}
+	word[len] = '\0';
+
+	/*
+	 * Now do a binary search using plain strcmp() comparison.
+	 */
+	low = keywords;
+	high = keywords + (num_keywords - 1);
+	while (low <= high)
+	{
+		const ScanKeywordOffset *middle;
+		int			difference;
+
+		middle = low + (high - low) / 2;
+		difference = strcmp(keyword_string + middle->offset, word);
+		if (difference == 0)
+			return middle;
+		else if (difference < 0)
+			low = middle + 1;
+		else
+			high = middle - 1;
+	}
+
+	return NULL;
+}
diff --git a/src/include/common/keywords.h b/src/include/common/keywords.h
index 0b31505b66..7cd7bf4461 100644
--- a/src/include/common/keywords.h
+++ b/src/include/common/keywords.h
@@ -28,11 +28,20 @@ typedef struct ScanKeyword
 	int16		category;		/* see codes above */
 } ScanKeyword;
 
+typedef struct ScanKeywordOffset
+{
+	int32		offset;			/* offset into a keyword string */
+	int16		value;			/* grammar's token code */
+	int16		category;		/* see codes

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-20 Thread Tom Lane

John Naylor  writes:
> On 12/18/18, Tom Lane  wrote:
>> I'd be kind of inclined to convert all uses of ScanKeyword to the new way,
>> if only for consistency's sake.  On the other hand, I'm not the one
>> volunteering to do the work.

> That's reasonable, as long as the design is nailed down first. Along
> those lines, attached is a heavily WIP patch that only touches plpgsql
> unreserved keywords, to test out the new methodology in a limited
> area. After settling APIs and name/directory bikeshedding, I'll move
> on to the other four keyword types.

Let the bikeshedding begin ...

> There's a new Perl script, src/common/gen_keywords.pl,

I'd be inclined to put the script in src/tools, I think.  IMO src/common
is for code that actually gets built into our executables.

> which takes
> pl_unreserved_kwlist.h as input and outputs
> pl_unreserved_kwlist_offset.h and pl_unreserved_kwlist_string.h.

I wonder whether we'd not be better off producing just one output
file, in which we have the offsets emitted as PG_KEYWORD macros
and then the giant string emitted as a macro definition, ie
something like

#define PG_KEYWORD_STRING \
"absolute\0" \
"alias\0" \
...

That simplifies the Makefile-hacking, at least, and it possibly gives
callers more flexibility about what they actually want to do with the
string.

> The
> output headers are not installed or symlinked anywhere. Since the
> input keyword lists will never be #included directly, they might be
> better as .txt files, like errcodes.txt. If we went that far, we might
> also remove the PG_KEYWORD macros (they'd still be in the output
> files) and rename parser/kwlist.h to common/core_kwlist.txt. There's
> also a case for not changing things unnecessarily, especially if
> there's ever a new reason to include the base keyword list directly.

I'm for "not change things unnecessarily".  People might well be
scraping the keyword list out of parser/kwlist.h for other purposes
right now --- indeed, it's defined the way it is exactly to let
people do that.  I don't see a good reason to force them to redo
whatever tooling they have that depends on that.  So let's build
kwlist_offsets.h alongside that, but not change kwlist.h itself.

> To keep the other keyword types functional, I had to add a separate
> new struct ScanKeywordOffset and new function
> ScanKeywordLookupOffset(), so the patch is a bit messier than the
> final will be.

Check.

> I used the global .gitignore, but maybe that's an abuse of it.

Yeah, I'd say it is.

> +# TODO: Error out if the keyword names are not in ASCII order.

+many for including such a check.

Also note that we don't require people to have Perl installed when
building from a tarball.  Therefore, these derived headers must get
built during "make distprep" and removed by maintainer-clean but
not distclean.  I think this also has some implications for VPATH
builds, but as long as you follow the pattern used for other
derived header files (e.g. fmgroids.h), you should be fine.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-19 Thread John Naylor

On 12/19/18, Andrew Gierth  wrote:
> Is there any particular reason not to go further and use a perfect hash
> function for the lookup, rather than binary search?

When I was investigating faster algorithms, I ruled out gperf based on
discussions in the archives. The approach here has modest goals and
shouldn't be too invasive.

With the makefile support and separate keyword files in place, that'll
be one less thing to do if we ever decide to replace binary search.
The giant string will likely be useful as well.

Since we're on the subject, I think some kind of trie would be ideal
performance-wise, but a large amount of work. The nice thing about a
trie is that it can be faster then a hash table for a key miss. I
found a paper that described some space-efficient trie variations [1],
but we'd likely have to code the algorithm and a way to emit a C code
representation of it. I've found some libraries, but that would have
more of the same difficulties in practicality that gperf had.

[1] https://infoscience.epfl.ch/record/64394/files/triesearches.pdf

-John Naylor

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-19 Thread Tom Lane

Andrew Gierth  writes:
> Is there any particular reason not to go further and use a perfect hash
> function for the lookup, rather than binary search?

Tooling?  I seem to recall having looked at gperf and deciding that it
pretty much sucked, so it's not real clear to me what we would use.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-19 Thread Andres Freund

Hi,

On 2018-12-20 00:54:39 +, Andrew Gierth wrote:
> > "John" == John Naylor  writes:
> 
>  > On 12/18/18, Tom Lane  wrote:
>  >> I'd be kind of inclined to convert all uses of ScanKeyword to the
>  >> new way, if only for consistency's sake. On the other hand, I'm not
>  >> the one volunteering to do the work.
> 
>  John> That's reasonable, as long as the design is nailed down first.
>  John> Along those lines, attached is a heavily WIP patch that only
>  John> touches plpgsql unreserved keywords, to test out the new
>  John> methodology in a limited area. After settling APIs and
>  John> name/directory bikeshedding, I'll move on to the other four
>  John> keyword types.
> 
> Is there any particular reason not to go further and use a perfect hash
> function for the lookup, rather than binary search?

The last time I looked into perfect hash functions, it wasn't easy to
find a generator that competed with a decent normal hashtable (in
particular gperf's are very unconvincing). The added tooling is a
concern imo.  OTOH, we're comparing not with a hashtable, but a binary
search, where the latter will usually loose.  Wonder if we shouldn't
generate a serialized non-perfect hashtable instead. The lookup code for
a read-only hashtable without concern for adversarial input is pretty
trivial.

Greetings,

Andres Freund

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-19 Thread Andrew Gierth

> "John" == John Naylor  writes:

 > On 12/18/18, Tom Lane  wrote:
 >> I'd be kind of inclined to convert all uses of ScanKeyword to the
 >> new way, if only for consistency's sake. On the other hand, I'm not
 >> the one volunteering to do the work.

 John> That's reasonable, as long as the design is nailed down first.
 John> Along those lines, attached is a heavily WIP patch that only
 John> touches plpgsql unreserved keywords, to test out the new
 John> methodology in a limited area. After settling APIs and
 John> name/directory bikeshedding, I'll move on to the other four
 John> keyword types.

Is there any particular reason not to go further and use a perfect hash
function for the lookup, rather than binary search?

-- 
Andrew (irc:RhodiumToad)

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-19 Thread John Naylor

On 12/18/18, Tom Lane  wrote:
> I'd be kind of inclined to convert all uses of ScanKeyword to the new way,
> if only for consistency's sake.  On the other hand, I'm not the one
> volunteering to do the work.

That's reasonable, as long as the design is nailed down first. Along
those lines, attached is a heavily WIP patch that only touches plpgsql
unreserved keywords, to test out the new methodology in a limited
area. After settling APIs and name/directory bikeshedding, I'll move
on to the other four keyword types.

There's a new Perl script, src/common/gen_keywords.pl, which takes
pl_unreserved_kwlist.h as input and outputs
pl_unreserved_kwlist_offset.h and pl_unreserved_kwlist_string.h. The
output headers are not installed or symlinked anywhere. Since the
input keyword lists will never be #included directly, they might be
better as .txt files, like errcodes.txt. If we went that far, we might
also remove the PG_KEYWORD macros (they'd still be in the output
files) and rename parser/kwlist.h to common/core_kwlist.txt. There's
also a case for not changing things unnecessarily, especially if
there's ever a new reason to include the base keyword list directly.

To keep the other keyword types functional, I had to add a separate
new struct ScanKeywordOffset and new function
ScanKeywordLookupOffset(), so the patch is a bit messier than the
final will be. With a 4-byte offset, ScankeyWordOffset is 8 bytes,
down from 12, and is now a power of 2.

I used the global .gitignore, but maybe that's an abuse of it.

Make check passes, but I don't know how well it stresses keyword use.
I'll create a commitfest entry soon.

-John Naylor
 .gitignore|   2 +
 src/common/gen_keywords.pl| 151 ++
 src/common/keywords.c |  55 +++
 src/include/common/keywords.h |  14 +++
 src/pl/plpgsql/src/Makefile   |  13 ++-
 src/pl/plpgsql/src/pl_scanner.c   | 108 -
 src/pl/plpgsql/src/pl_unreserved_kwlist.h | 107 +
 7 files changed, 355 insertions(+), 95 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..8a9a05c20a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -31,6 +31,8 @@ win32ver.rc
 *.exe
 lib*dll.def
 lib*.pc
+*kwlist_offset.h
+*kwlist_string.h

 # Local excludes in root directory
 /GNUmakefile
diff --git a/src/common/gen_keywords.pl b/src/common/gen_keywords.pl
new file mode 100644
index 00..a8a8576d43
--- /dev/null
+++ b/src/common/gen_keywords.pl
@@ -0,0 +1,151 @@
+#--
+#
+# gen_keywords.pl
+#	Perl script that generates *_offset.h and *_string.h from a given
+#	keyword list file.  These headers are then included into files that
+#	call ScanKeywordLookup() on that keyword list.  The keyword name is
+#	is represented as an offset into a single string.
+#
+# Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/commen/gen_keywords.pl
+#
+#--
+
+
+my $output_path = '';
+my $kw_input_file;
+
+# Process command line switches.
+while (@ARGV)
+{
+	my $arg = shift @ARGV;
+	if ($arg !~ /^-/)
+	{
+		$kw_input_file = $arg;
+	}
+	elsif ($arg =~ /^-o/)
+	{
+		$output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV;
+	}
+	else
+	{
+		usage();
+	}
+}
+
+
+# Make sure output_path ends in a slash.
+if ($output_path ne '' && substr($output_path, -1) ne '/')
+{
+	$output_path .= '/';
+}
+
+$kw_input_file =~ /(\w*kwlist)\.h/;
+my $base_filename = $1;
+
+my $kw_offset_file = $output_path . $base_filename . '_offset.h';
+my $kw_string_file = $output_path . $base_filename . '_string.h';
+
+open(my $kif, '<', $kw_input_file)  || die "$kw_input_file: $!";
+open(my $kof, '>', $kw_offset_file) || die "$kw_offset_file: $!";
+open(my $ksf, '>', $kw_string_file) || die "$kw_string_file: $!";
+
+# Opening boilerplate for keyword offset header.
+printf $kof <)
+{
+	if (/^PG_KEYWORD\("(\w+)",\s+(\w+),\s+(\w+)\)/)
+	{
+		$name = $1;
+		$value = $2;
+		$category = $3;
+
+		push @keywords, $name;
+
+		# Emit ScanKeyword macros with numerical offsets instead of text.
+		print $kof "PG_KEYWORD($offset, $value, $category)\n";
+
+		# Calculate the cumulative offset of the next keyword,
+		# taking into account the null terminator.
+		$offset += length($name) +1;
+	}
+}
+
+# TODO: Error out if the keyword names are not in ASCII order.
+
+printf $ksf qq|const char *%s_string =\n"|, $base_filename;
+print  $ksf join qq|\\0"\n"|, @keywords;
+printf $ksf qq|";\n\n#endif\t\t\t\t\t\t\t/* %s_STRING_H */\n|, uc $base_filename;
+
+
+sub usage
+{
+	die <= NAMEDATALEN)
+		return NULL;
+
+	/*
+	 * Apply an ASCII-only downcasing.  We must not use tolower() since it may
+	 * produce the wrong translation in some locales (eg, Turkish).
+	 */
+	for (i = 0;

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-18 Thread Tom Lane

John Naylor  writes:
> On 12/17/18, Tom Lane  wrote:
>> Also, wouldn't we also adopt this technology for its unreserved keywords,
>> too?

> We wouldn't be forced to, but there might be other reasons to do so.
> Were you thinking of code consistency (within pl_scanner.c or
> globally)? Or something else?

> If we did adopt this setup for plpgsql unreserved keywords,
> ecpg/preproc/ecpg_keywords.c and ecpg/preproc/c_keywords.c would be
> left using the current ScanKeyword struct for search. Using offset
> search for all 5 types of keywords would be globally consistent, but
> it also means additional headers, generated headers, and makefile
> rules.

I'd be kind of inclined to convert all uses of ScanKeyword to the new way,
if only for consistency's sake.  On the other hand, I'm not the one
volunteering to do the work.

regards, tom lane

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-18 Thread John Naylor

On 12/17/18, Tom Lane  wrote:
> John Naylor  writes:
>> Since PL/pgSQL uses the core scanner, we'd need to use offsets in its
>> reserved_keywords[], too. Those don't change much, so we can probably
>> get away with hard-coding the offsets and the giant string in that
>> case. (If that's not acceptable, we could separate that out to
>> pl_reserved_kwlist.h and reuse the above tooling to generate
>> pl_reserved_kwlist_{offset,string}.h, but that's more complex.)
>
> plpgsql isn't as stable as all that: people propose new syntax for it
> all the time.  I do not think a hand-maintained array would be pleasant
> at all.

Okay.

> Also, wouldn't we also adopt this technology for its unreserved keywords,
> too?

We wouldn't be forced to, but there might be other reasons to do so.
Were you thinking of code consistency (within pl_scanner.c or
globally)? Or something else?

If we did adopt this setup for plpgsql unreserved keywords,
ecpg/preproc/ecpg_keywords.c and ecpg/preproc/c_keywords.c would be
left using the current ScanKeyword struct for search. Using offset
search for all 5 types of keywords would be globally consistent, but
it also means additional headers, generated headers, and makefile
rules.

-John Naylor

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-17 Thread Tom Lane

John Naylor  writes:
> A few months ago I was looking into faster search algorithms for
> ScanKeywordLookup(), so this is interesting to me. While an optimal
> full replacement would be a lot of work, the above ideas are much less
> invasive and would still have some benefit. Unless anyone intends to
> work on this, I'd like to flesh out the offset-into-giant-string
> approach a bit further:

Have at it...

> Since PL/pgSQL uses the core scanner, we'd need to use offsets in its
> reserved_keywords[], too. Those don't change much, so we can probably
> get away with hard-coding the offsets and the giant string in that
> case. (If that's not acceptable, we could separate that out to
> pl_reserved_kwlist.h and reuse the above tooling to generate
> pl_reserved_kwlist_{offset,string}.h, but that's more complex.)

plpgsql isn't as stable as all that: people propose new syntax for it
all the time.  I do not think a hand-maintained array would be pleasant
at all.

Also, wouldn't we also adopt this technology for its unreserved keywords,
too?

regards, tom lane

reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-16 Thread John Naylor

On 10/15/18, Tom Lane  wrote:
> Andres Freund  writes:
>> On 2018-10-15 16:36:26 -0400, Tom Lane wrote:
>>> We could possibly fix these by changing the data structure so that
>>> what's in a ScanKeywords entry is an offset into some giant string
>>> constant somewhere.  No idea how that would affect performance, but
>>> I do notice that we could reduce the sizeof(ScanKeyword), which can't
>>> hurt.
>
>> Yea, that might even help performancewise. Alternatively we could change
>> ScanKeyword to store the keyword name inline, but that'd be a measurable
>> size increase...
>
> Yeah.  It also seems like doing it this way would improve locality of
> access: the pieces of the giant string would presumably be in the same
> order as the ScanKeywords entries, whereas with the current setup,
> who knows where the compiler has put 'em or in what order.
>
> We'd need some tooling to generate the constants that way, though;
> I can't see how to make it directly from kwlist.h.

A few months ago I was looking into faster search algorithms for
ScanKeywordLookup(), so this is interesting to me. While an optimal
full replacement would be a lot of work, the above ideas are much less
invasive and would still have some benefit. Unless anyone intends to
work on this, I'd like to flesh out the offset-into-giant-string
approach a bit further:

Since there are several callers of the current approach that don't use
the core keyword list, we'd have to keep the existing struct and
lookup function, to keep the complexity manageable. Once we have an
offset-based struct and function, it makes sense to use it for all
searches of core keywords. This includes not only the core scanner,
but also adt/rule_utils.c, fe_utils/string_utils.c, and
ecpg/preproc/keywords.c.

There would need to be a header with offsets replacing name strings,
generated from parser/kwlist.h, maybe kwlist_offset.h. It'd probably
be convenient if it was emitted into the common/ dir. The giant string
would likely need its own header (kwlist_string.h?).

Since PL/pgSQL uses the core scanner, we'd need to use offsets in its
reserved_keywords[], too. Those don't change much, so we can probably
get away with hard-coding the offsets and the giant string in that
case. (If that's not acceptable, we could separate that out to
pl_reserved_kwlist.h and reuse the above tooling to generate
pl_reserved_kwlist_{offset,string}.h, but that's more complex.)

The rest should be just a SMOP. Any issues I left out?

-John Naylor

99 matches

Mail list logo