Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi Ashutosh: Thanks for your time. On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat wrote: > Hi Andy, > What might help is to add more description to your email message like > giving examples to explain your idea. > > Anyway, I looked at the testcases you added for examples. > +create table select_distinct_a(a int, b char(20), c char(20) not null, > d int, e int, primary key(a, b)); > +set enable_mergejoin to off; > +set enable_hashjoin to off; > +-- no node for distinct. > +explain (costs off) select distinct * from select_distinct_a; > + QUERY PLAN > +--- > + Seq Scan on select_distinct_a > +(1 row) > > From this example, it seems that the distinct operation can be dropped > because (a, b) is a primary key. Is my understanding correct? > Yes, you are correct. Actually I added then to commit message, but it's true that I should have copied them in this email body as well. so copy it now. [PATCH] Erase the distinctClause if the result is unique by definition For a single relation, we can tell it by any one of the following is true: 1. The pk is in the target list. 2. The uk is in the target list and the columns is not null 3. The columns in group-by clause is also in the target list for relation join, we can tell it by: if every relation in the jointree yields a unique result set, then the final result is unique as well regardless the join method. for semi/anti join, we will ignore the righttable. I like the idea since it eliminates one expensive operation. > > However the patch as presented has some problems > 1. What happens if the primary key constraint or NOT NULL constraint gets > dropped between a prepare and execute? The plan will no more be valid and > thus execution may produce non-distinct results. > Will this still be an issue if user use doesn't use a "read uncommitted" isolation level? I suppose it should be ok for this case. But even though I should add an isolation level check for this. Just added that in the patch to continue discussing of this issue. > PostgreSQL has similar concept of allowing non-grouping expression as part > of targetlist when those expressions can be proved to be functionally > dependent on the GROUP BY clause. See check_functional_grouping() and its > caller. I think, DISTINCT elimination should work on similar lines. > 2. For the same reason described in check_functional_grouping(), using > unique indexes for eliminating DISTINCT should be discouraged. > I checked the comments of check_functional_grouping, the reason is * Currently we only check to see if the rel has a primary key that is a * subset of the grouping_columns. We could also use plain unique constraints * if all their columns are known not null, but there's a problem: we need * to be able to represent the not-null-ness as part of the constraints added * to *constraintDeps. FIXME whenever not-null constraints get represented * in pg_constraint. Actually I am doubtful the reason for pg_constraint since we still be able to get the not null information from relation->rd_attr->attrs[n].attnotnull which is just what this patch did. 3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY > as well > This is a good point. The rules may have some different for join, so I prefer to to focus on the current one so far. > 4. The patch works only at the query level, but that functionality can be > expanded generally to other places which add Unique/HashAggregate/Group > nodes if the underlying relation can be proved to produce distinct rows. > But that's probably more work since we will have to label paths with unique > keys similar to pathkeys. > Do you mean adding some information into PlannerInfo, and when we create a node for Unique/HashAggregate/Group, we can just create a dummy node? > 5. Have you tested this OUTER joins, which can render inner side nullable? > Yes, that part was missed in the test case. I just added them. On Thu, Feb 6, 2020 at 11:31 AM Andy Fan wrote: > >> update the patch with considering the semi/anti join. >> >> Can anyone help to review this patch? >> >> Thanks >> >> >> On Fri, Jan 31, 2020 at 8:39 PM Andy Fan >> wrote: >> >>> Hi: >>> >>> I wrote a patch to erase the distinctClause if the result is unique by >>> definition, I find this because a user switch this code from oracle >>> to PG and find the performance is bad due to this, so I adapt pg for >>> this as well. >>> >>> This patch doesn't work for a well-written SQL, but some drawback >>> of a SQL may be not very obvious, since the cost of checking is pretty >>> low as well, so I think it would be ok to add.. >>> >>> Please see the patch for details. >>> >>> Thank you. >>> >> > > -- > -- > Best Wishes, > Ashutosh Bapat > 0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch Description: Binary data
Re: Internal key management system
On Sat, 8 Feb 2020 at 03:24, Andres Freund wrote: > > Hi, > > On 2020-02-07 20:44:31 +0900, Masahiko Sawada wrote: > > Yeah I'm not going to use pgcrypto for transparent data encryption. > > The KMS patch includes the new basic infrastructure for cryptographic > > functions (mainly AES-CBC). I'm thinking we can expand that > > infrastructure so that we can also use it for TDE purpose by > > supporting new cryptographic functions such as AES-CTR. Anyway, I > > agree to not have it depend on pgcrypto. > > I thought for a minute, before checking the patch, that you were saying > above that the KMS patch includes its *own* implementation of > cryptographic functions. I think it's pretty crucial that it continues > not to do that... I meant that we're going to use OpenSSL for AES encryption and decryption independent of pgcrypto's openssl code, as the first step. That is, KMS is available only when configured --with-openssl. And hopefully we eventually merge these openssl code and have pgcrypto use it, like when we introduced SCRAM. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
POC: rational number type (fractions)
Hi hackers, attached is a proof of concept patch adding a new base type called "rational" to represent fractions. It includes arithmetic, simplification, conversion to/from float, finding intermediates with a stern-brocot tree, custom aggregates, and btree/hash indices. The primary motivation was as a column type to support user-defined ordering of rows (with the ability to dynamically rearrange rows). The postgres wiki has a page [0] about this using pairs of integers to represent fractions, but it's not particularly elegant. I wrote about options for implementing user-defined order in an article [1] and ended up creating a postgres extension, pg_rational [2], to provide the new type. People have been using the extension, but told me they wished they could use it on hosted platforms like Amazon RDS which have a limited set of whitelisted extensions. Thus I'm submitting this patch to discuss getting the feature in core postgres. For usage, see the included regression test. To see how it works for the user-defined order use case see my article. I haven't included docs in the patch since the interface may change with community feedback. 0: https://wiki.postgresql.org/wiki/User-specified_ordering_with_fractions 1: https://begriffs.com/posts/2018-03-20-user-defined-order.html 2: https://github.com/begriffs/pg_rational -- Joe Nelson https://begriffs.com diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 13efa9338c..10bdcff5dc 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -80,6 +80,7 @@ OBJS = \ rangetypes_selfuncs.o \ rangetypes_spgist.o \ rangetypes_typanalyze.o \ + rational.o \ regexp.o \ regproc.o \ ri_triggers.o \ diff --git a/src/backend/utils/adt/rational.c b/src/backend/utils/adt/rational.c new file mode 100644 index 00..ab91198da3 --- /dev/null +++ b/src/backend/utils/adt/rational.c @@ -0,0 +1,637 @@ +/*- + * + * rational.c + * Rational number data type for the Postgres database system + * + * Copyright (c) 1998-2020, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/backend/utils/adt/rational.c + * + *- + */ + +#include "postgres.h" +#include "fmgr.h" +#include "access/hash.h" +#include "common/int.h" /* portable overflow detection */ +#include "libpq/pqformat.h" /* send/recv functions */ +#include +#include + +PG_MODULE_MAGIC; + +typedef struct +{ + int32 numer; + int32 denom; +} Rational; + +static int32 gcd(int32, int32); +static bool simplify(Rational *); +static int32 cmp(Rational *, Rational *); +static void neg(Rational *); +static Rational * create(long long, long long); +static Rational * add(Rational *, Rational *); +static Rational * mul(Rational *, Rational *); +static void mediant(Rational *, Rational *, Rational *); + +/* + * IO ** + */ + +PG_FUNCTION_INFO_V1(rational_in); +PG_FUNCTION_INFO_V1(rational_in_float); +PG_FUNCTION_INFO_V1(rational_out); +PG_FUNCTION_INFO_V1(rational_out_float); +PG_FUNCTION_INFO_V1(rational_recv); +PG_FUNCTION_INFO_V1(rational_create); +PG_FUNCTION_INFO_V1(rational_in_integer); +PG_FUNCTION_INFO_V1(rational_send); + +Datum +rational_in(PG_FUNCTION_ARGS) +{ + char *s = PG_GETARG_CSTRING(0), + *after; + long long n, +d; + + if (!isdigit(*s) && *s != '-') + ereport(ERROR, +(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("Missing or invalid numerator"))); + + n = strtoll(s, &after, 10); + + if (*after == '\0') + { + /* if just a number and no slash, interpret as an int */ + d = 1; + } + else + { + /* otherwise look for denominator */ + if (*after != '/') + ereport(ERROR, + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("Expecting '/' after number but found '%c'", *after))); + if (*(++after) == '\0') + ereport(ERROR, + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("Expecting value after '/' but got '\\0'"))); + + d = strtoll(after, &after, 10); + if (*after != '\0') + ereport(ERROR, + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("Expecting '\\0' but found '%c'", *after))); + } + PG_RETURN_POINTER(create(n, d)); +} + +/* + This function taken from John Kennedy's paper, "Algorithm To Convert a + Decimal to a Fraction." Translated from Pascal. +*/ +Datum +rational_in_float(PG_FUNCTION_ARGS) +{ + float8 target = PG_GETARG_FLOAT8(0), +z, +fnumer, +fdenom, +error; + int32 prev_denom, +sign; + Rational *result = palloc(sizeof(Rational)); + + if (target == (int32) target) + { + result->numer = (int32) target; + result->denom = 1; + PG_RETURN_POINTER(result); + } + + sign = target < 0.0 ? -1 : 1; + target = fabs(target); + + if (!(target <= INT32_MAX)) + { /* also excludes NaN's */ + ereport(ERROR, +(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF
Re: Does recovery write to backup_label ?
On Sat, Feb 8, 2020 at 5:05 AM Chapman Flack wrote: > > On 2/7/20 2:55 PM, Andres Freund wrote: > > >> If the file needs to have 0600 permissions, should there be > >> a note in the nonexclusive-mode backup docs to say so? > > > > I'm not convinced that that's useful. The default is that everything > > needs to be writable by postgres. The exceptions should be noted if > > anything, not the default. +1 > Could this arguably be a special case, as most things in the datadir > are put there by postgres, but the backup_label is now to be put there > (and not even 'there' there, but added as a final step only to a > 'backup-copy-of-there' there) by the poor schmuck who reads the > non-exclusive backup docs as saying "retrieve this content from > pg_stop_backup() and preserve in a file named backup_label" and can't > think of any obvious reason to put write permission on a file > that preserves immutable history in a backup? I have no strong objection to add more note about permissions of the files that the users put in the data directory. If we really do that, it'd be better to note about not only backup_label but also other files like tablespace_map, recovery.signal, promotion trigger file, etc. Regards, -- Fujii Masao
Postgres 32 bits client compilation fail. Win32 bits client is supported?
Hi, I am migrating my applications that use postgres client from msvc 2010 (32bits) to msvc 2019 (32 bits). Compilation using msvc 2019 (64 bits), works very well. But the build using msvc 2019 (32 bit) is not working. The 32-bit Platform variable is set to x86, resulting in the first error. "Project" C: \ dll \ postgres \ pgsql.sln "on node 1 (default targets). C: \ dll \ postgres \ pgsql.sln.metaproj: error MSB4126: The specified solution configuration "Release | x86" is invalid. Plea if specify a valid solution configuration using the Configuration and Platform properties (e.g. MSBuild.exe Solution.sl n / p: Configuration = Debug / p: Platform = "Any CPU") or leave those properties blank to use the default solution configurati on. [C: \ dll \ postgres \ pgsql.sln] Done Building Project "C: \ dll \ postgres \ pgsql.sln" (default targets) - FAILED. " This is because the Sub DeterminePlatform function of the Solution.pm program uses the following expression: "my $ output =` cl /? 2> & 1`; " The result of msvc 2019 (32bits) is: "Microsoft (R) C / C ++ Optimizing Compiler Version 19.24.28315 for x86" By setting the Platform variable manually to WIn32, the compilation process continues until it stops at: "Generating configuration headers ..." Where the second error occurs: unused defines: HAVE_STRUCT_CMSGCRED USE_NAMED_POSI ... etc ... ALIGNOF_DOUBLE USE_DEV_URANDOM at C: \ dll \ postgres \ src \ tools \ msvc / Mkvcbuild.pm line 842. Question: Will Postgres continue to support 32-bit client? regards, Ranier Vilela
Re: Implementing Incremental View Maintenance
Hi. UNION query problem.(server crash) When creating an INCREMENTAL MATERIALIZED VIEW, the server process crashes if you specify a query with a UNION. (commit id = 23151be7be8d8f8f9c35c2d0e4e5353aedf2b31e) execute log. ``` [ec2-user@ip-10-0-1-10 ivm]$ psql testdb -e -f union_query_crash.sql DROP TABLE IF EXISTS table_x CASCADE; psql:union_query_crash.sql:6: NOTICE: drop cascades to view xy_union_v DROP TABLE DROP TABLE IF EXISTS table_y CASCADE; DROP TABLE CREATE TABLE table_x (id int, data numeric); CREATE TABLE CREATE TABLE table_y (id int, data numeric); CREATE TABLE INSERT INTO table_x VALUES (generate_series(1, 3), random()::numeric); INSERT 0 3 INSERT INTO table_y VALUES (generate_series(1, 3), random()::numeric); INSERT 0 3 SELECT * FROM table_x; id |data + 1 | 0.950724735058774 2 | 0.0222670808201144 3 | 0.391258547114841 (3 rows) SELECT * FROM table_y; id |data + 1 | 0.991717347778337 2 | 0.0528458947672874 3 | 0.965044982911163 (3 rows) CREATE VIEW xy_union_v AS SELECT 'table_x' AS name, * FROM table_x UNION SELECT 'table_y' AS name, * FROM table_y ; CREATE VIEW TABLE xy_union_v; name | id |data -++ table_y | 2 | 0.0528458947672874 table_x | 2 | 0.0222670808201144 table_y | 3 | 0.965044982911163 table_x | 1 | 0.950724735058774 table_x | 3 | 0.391258547114841 table_y | 1 | 0.991717347778337 (6 rows) CREATE INCREMENTAL MATERIALIZED VIEW xy_imv AS SELECT 'table_x' AS name, * FROM table_x UNION SELECT 'table_y' AS name, * FROM table_y ; psql:union_query_crash.sql:28: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql:union_query_crash.sql:28: fatal: connection to server was lost ``` UNION query problem.(server crash) When creating an INCREMENTAL MATERIALIZED VIEW, the server process crashes if you specify a query with a UNION. (commit id = 23151be7be8d8f8f9c35c2d0e4e5353aedf2b31e) execute log. ``` [ec2-user@ip-10-0-1-10 ivm]$ psql testdb -e -f union_query_crash.sql DROP TABLE IF EXISTS table_x CASCADE; psql:union_query_crash.sql:6: NOTICE: drop cascades to view xy_union_v DROP TABLE DROP TABLE IF EXISTS table_y CASCADE; DROP TABLE CREATE TABLE table_x (id int, data numeric); CREATE TABLE CREATE TABLE table_y (id int, data numeric); CREATE TABLE INSERT INTO table_x VALUES (generate_series(1, 3), random()::numeric); INSERT 0 3 INSERT INTO table_y VALUES (generate_series(1, 3), random()::numeric); INSERT 0 3 SELECT * FROM table_x; id |data + 1 | 0.950724735058774 2 | 0.0222670808201144 3 | 0.391258547114841 (3 rows) SELECT * FROM table_y; id |data + 1 | 0.991717347778337 2 | 0.0528458947672874 3 | 0.965044982911163 (3 rows) CREATE VIEW xy_union_v AS SELECT 'table_x' AS name, * FROM table_x UNION SELECT 'table_y' AS name, * FROM table_y ; CREATE VIEW TABLE xy_union_v; name | id |data -++ table_y | 2 | 0.0528458947672874 table_x | 2 | 0.0222670808201144 table_y | 3 | 0.965044982911163 table_x | 1 | 0.950724735058774 table_x | 3 | 0.391258547114841 table_y | 1 | 0.991717347778337 (6 rows) CREATE INCREMENTAL MATERIALIZED VIEW xy_imv AS SELECT 'table_x' AS name, * FROM table_x UNION SELECT 'table_y' AS name, * FROM table_y ; psql:union_query_crash.sql:28: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql:union_query_crash.sql:28: fatal: connection to server was lost ``` 2018年12月27日(木) 21:57 Yugo Nagata : > Hi, > > I would like to implement Incremental View Maintenance (IVM) on > PostgreSQL. > IVM is a technique to maintain materialized views which computes and > applies > only the incremental changes to the materialized views rather than > recomputate the contents as the current REFRESH command does. > > I had a presentation on our PoC implementation of IVM at PGConf.eu 2018 > [1]. > Our implementation uses row OIDs to compute deltas for materialized > views. > The basic idea is that if we have information about which rows in base > tables > are contributing to generate a certain row in a matview then we can > identify > the affected rows when a base table is updated. This is based on an idea of > Dr. Masunaga [2] who is a member of our group and inspired from ID-based > approach[3]. > > In our implementation, the mapping of the row OIDs of the materialized view > and the base tables are stored in "OID map". When a base relation is > modified, > AFTER trigger is executed and the delta is recorded in delta tables using > the transition table feature. The accual udpate of the matview is triggerd > by REFRESH command with INCREMENTALLY option. > > However, we realize problems of our implementation. First, W
Re: Marking some contrib modules as trusted extensions
After looking more closely at these modules, I'm kind of inclined *not* to put the trusted marker on intagg. That module is just a backwards-compatibility wrapper around functionality that exists in the core code nowadays. So I think what we ought to be doing with it is deprecating and eventually removing it, not encouraging people to keep using it. Given that and the other discussion in this thread, I think the initial list of modules to trust is: btree_gin btree_gist citext cube dict_int earthdistance fuzzystrmatch hstore hstore_plperl intarray isn jsonb_plperl lo ltree pg_trgm pgcrypto seg tablefunc tcn tsm_system_rows tsm_system_time unaccent uuid-ossp So attached is a patch to do that. The code changes are trivial; just add "trusted = true" to each control file. We don't need to bump the module version numbers, since this doesn't change the contents of any extension, just who can install it. I do not think any regression test changes are needed either. (Note that commit 50fc694e4 already added a test that trusted extensions behave as expected, see src/pl/plperl/sql/plperl_setup.sql.) So it seems like the only thing that needs much discussion is the documentation changes. I adjusted contrib.sgml's discussion of how to install these modules in general, and then labeled the individual modules if they are trusted. regards, tom lane diff --git a/contrib/btree_gin/btree_gin.control b/contrib/btree_gin/btree_gin.control index d576da7..67d0c99 100644 --- a/contrib/btree_gin/btree_gin.control +++ b/contrib/btree_gin/btree_gin.control @@ -3,3 +3,4 @@ comment = 'support for indexing common datatypes in GIN' default_version = '1.3' module_pathname = '$libdir/btree_gin' relocatable = true +trusted = true diff --git a/contrib/btree_gist/btree_gist.control b/contrib/btree_gist/btree_gist.control index 81c8509..cd2d7eb 100644 --- a/contrib/btree_gist/btree_gist.control +++ b/contrib/btree_gist/btree_gist.control @@ -3,3 +3,4 @@ comment = 'support for indexing common datatypes in GiST' default_version = '1.5' module_pathname = '$libdir/btree_gist' relocatable = true +trusted = true diff --git a/contrib/citext/citext.control b/contrib/citext/citext.control index a872a3f..ccf4454 100644 --- a/contrib/citext/citext.control +++ b/contrib/citext/citext.control @@ -3,3 +3,4 @@ comment = 'data type for case-insensitive character strings' default_version = '1.6' module_pathname = '$libdir/citext' relocatable = true +trusted = true diff --git a/contrib/cube/cube.control b/contrib/cube/cube.control index f39a838..3e238fc 100644 --- a/contrib/cube/cube.control +++ b/contrib/cube/cube.control @@ -3,3 +3,4 @@ comment = 'data type for multidimensional cubes' default_version = '1.4' module_pathname = '$libdir/cube' relocatable = true +trusted = true diff --git a/contrib/dict_int/dict_int.control b/contrib/dict_int/dict_int.control index 6e2d2b3..ec04cce 100644 --- a/contrib/dict_int/dict_int.control +++ b/contrib/dict_int/dict_int.control @@ -3,3 +3,4 @@ comment = 'text search dictionary template for integers' default_version = '1.0' module_pathname = '$libdir/dict_int' relocatable = true +trusted = true diff --git a/contrib/earthdistance/earthdistance.control b/contrib/earthdistance/earthdistance.control index 5816d22..3df666d 100644 --- a/contrib/earthdistance/earthdistance.control +++ b/contrib/earthdistance/earthdistance.control @@ -3,4 +3,5 @@ comment = 'calculate great-circle distances on the surface of the Earth' default_version = '1.1' module_pathname = '$libdir/earthdistance' relocatable = true +trusted = true requires = 'cube' diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control index 6b2832a..3cd6660 100644 --- a/contrib/fuzzystrmatch/fuzzystrmatch.control +++ b/contrib/fuzzystrmatch/fuzzystrmatch.control @@ -3,3 +3,4 @@ comment = 'determine similarities and distance between strings' default_version = '1.1' module_pathname = '$libdir/fuzzystrmatch' relocatable = true +trusted = true diff --git a/contrib/hstore/hstore.control b/contrib/hstore/hstore.control index 93688cd..e0fbb8b 100644 --- a/contrib/hstore/hstore.control +++ b/contrib/hstore/hstore.control @@ -3,3 +3,4 @@ comment = 'data type for storing sets of (key, value) pairs' default_version = '1.6' module_pathname = '$libdir/hstore' relocatable = true +trusted = true diff --git a/contrib/hstore_plperl/hstore_plperl.control b/contrib/hstore_plperl/hstore_plperl.control index 16277f6..4b9fd13 100644 --- a/contrib/hstore_plperl/hstore_plperl.control +++ b/contrib/hstore_plperl/hstore_plperl.control @@ -3,4 +3,5 @@ comment = 'transform between hstore and plperl' default_version = '1.0' module_pathname = '$libdir/hstore_plperl' relocatable = true +trusted = true requires = 'hstore,plperl' diff --git a/contrib/intarray/intarray.control b/contrib/intarray/intarray.control index 7e50cc3..bf28804 100644 --- a/contrib/intarray/intarray.control +++ b/contrib/int
Re: error context for vacuum to include block number
On Tue, Feb 04, 2020 at 01:58:20PM +0900, Masahiko Sawada wrote: > Here is the comment for v16 patch: > > 2. > I think we can set the error context for heap scan again after > freespace map vacuum finishing, maybe after reporting the new phase. > Otherwise the user will get confused if an error occurs during > freespace map vacuum. And I think the comment is unclear, how about > "Set the error context fro heap scan again"? Good point > 3. > + if (cbarg->blkno!=InvalidBlockNumber) > + errcontext(_("while scanning block %u of relation \"%s.%s\""), > + cbarg->blkno, cbarg->relnamespace, cbarg->relname); > > We can use BlockNumberIsValid macro instead. Thanks. See attached, now squished together patches. I added functions to initialize the callbacks, so error handling is out of the way and minimally distract from the rest of vacuum. >From 95265412c56f3b308eed16531d7c83243e278f4f Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 12 Dec 2019 20:54:37 -0600 Subject: [PATCH v17 1/2] vacuum errcontext to show block being processed As requested here. https://www.postgresql.org/message-id/20190807235154.erbmr4o4bo6vgnjv%40alap3.anarazel.de --- src/backend/access/heap/vacuumlazy.c | 117 +++ 1 file changed, 117 insertions(+) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index a23cdef..9358ab4 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -292,6 +292,14 @@ typedef struct LVRelStats bool lock_waiter_detected; } LVRelStats; +typedef struct +{ + char *relnamespace; + char *relname; + char *indname; /* undefined while not processing index */ + BlockNumber blkno; /* undefined while not processing heap */ + int phase; /* Reusing same enums as for progress reporting */ +} vacuum_error_callback_arg; /* A few variables that don't seem worth passing around as parameters */ static int elevel = -1; @@ -361,6 +369,7 @@ static void end_parallel_vacuum(Relation *Irel, IndexBulkDeleteResult **stats, LVParallelState *lps, int nindexes); static LVSharedIndStats *get_indstats(LVShared *lvshared, int n); static bool skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared); +static void vacuum_error_callback(void *arg); /* @@ -724,6 +733,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, PROGRESS_VACUUM_MAX_DEAD_TUPLES }; int64 initprog_val[3]; + ErrorContextCallback errcallback; + vacuum_error_callback_arg errcbarg; pg_rusage_init(&ru0); @@ -870,6 +881,17 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, else skipping_blocks = false; + /* Setup error traceback support for ereport() */ + errcbarg.relnamespace = get_namespace_name(RelationGetNamespace(onerel)); + errcbarg.relname = relname; + errcbarg.blkno = InvalidBlockNumber; /* Not known yet */ + errcbarg.phase = PROGRESS_VACUUM_PHASE_SCAN_HEAP; + + errcallback.callback = vacuum_error_callback; + errcallback.arg = (void *) &errcbarg; + errcallback.previous = error_context_stack; + error_context_stack = &errcallback; + for (blkno = 0; blkno < nblocks; blkno++) { Buffer buf; @@ -891,6 +913,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, #define FORCE_CHECK_PAGE() \ (blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats)) + errcbarg.blkno = blkno; + pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno); if (blkno == next_unskippable_block) @@ -987,6 +1011,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, vmbuffer = InvalidBuffer; } + /* Pop the error context stack */ + error_context_stack = errcallback.previous; + /* Work on all the indexes, then the heap */ lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps, nindexes); @@ -1011,6 +1038,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, /* Report that we are once again scanning the heap */ pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_SCAN_HEAP); + + /* Set the error context while continuing heap scan */ + error_context_stack = &errcallback; } /* @@ -1597,6 +1627,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, RecordPageWithFreeSpace(onerel, blkno, freespace); } + /* Pop the error context stack */ + error_context_stack = errcallback.previous; + /* report that everything is scanned and vacuumed */ pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno); @@ -1772,11 +1805,26 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) int npages; PGRUsage ru0; Buffer vmbuffer = InvalidBuffer; + ErrorContextCallback errcallback; + vacuum_error_callback_arg errcbarg; /* Report that we are now vacu
Re: Adding a test for speculative insert abort case
Hi, I'm currently fighting with a race I'm observing in about 1/4 of the runs. I get: @@ -361,16 +361,18 @@ locktype mode granted speculative tokenShareLock f speculative tokenExclusiveLock t step controller_unlock_2_4: SELECT pg_advisory_unlock(2, 4); pg_advisory_unlock t s1: NOTICE: blurt_and_lock_123() called for k1 in session 1 s1: NOTICE: acquiring advisory lock on 2 +s1: NOTICE: blurt_and_lock_123() called for k1 in session 1 +s1: NOTICE: acquiring advisory lock on 2 step s1_upsert: <... completed> step s2_upsert: <... completed> step controller_show: SELECT * FROM upserttest; keydata k1 inserted s2 with conflict update s1 (this is the last permutation) The issue is basically that s1 goes through the check_exclusion_or_unique_constraint() conflict check twice. I added a bit of debugging information (using fprintf, with elog it was much harder to hit): Success: 2020-02-07 16:14:56.501 PST [1167003][5/1254:8465] CONTEXT: PL/pgSQL function blurt_and_lock_123(text) line 7 at RAISE 1167003: acquiring xact lock 2020-02-07 16:14:56.512 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 2020-02-07 16:14:56.522 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 1167002: releasing xact lock 2 3 1167004: acquired xact lock 1167004: xid 8466 acquiring 5 2020-02-07 16:14:56.523 PST [1167004][6/1014:8466] LOG: INSERT @ 0/9CC70F0: - Heap/INSERT: off 2 flags 0x0C 2020-02-07 16:14:56.523 PST [1167004][6/1014:8466] NOTICE: blurt_and_lock_123() called for k1 in session 2 2020-02-07 16:14:56.523 PST [1167004][6/1014:8466] CONTEXT: PL/pgSQL function blurt_and_lock_123(text) line 3 at RAISE 2020-02-07 16:14:56.523 PST [1167004][6/1014:8466] NOTICE: acquiring advisory lock on 2 2020-02-07 16:14:56.523 PST [1167004][6/1014:8466] CONTEXT: PL/pgSQL function blurt_and_lock_123(text) line 7 at RAISE 1167004: acquiring xact lock 2020-02-07 16:14:56.533 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 2020-02-07 16:14:56.544 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 2020-02-07 16:14:56.555 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 2020-02-07 16:14:56.565 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 2020-02-07 16:14:56.576 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 2020-02-07 16:14:56.587 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 1167002: releasing xact lock 2 2 1167004: acquired xact lock 2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] LOG: INSERT @ 0/9CC7150: - Btree/NEWROOT: lev 0 2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] LOG: INSERT @ 0/9CC7190: - Btree/INSERT_LEAF: off 1 2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] NOTICE: blurt_and_lock_4() called for k1 in session 2 2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] CONTEXT: PL/pgSQL function blurt_and_lock_4(text) line 3 at RAISE 2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] NOTICE: acquiring advisory lock on 4 2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] CONTEXT: PL/pgSQL function blurt_and_lock_4(text) line 4 at RAISE 1167004: acquiring xact lock 2020-02-07 16:14:56.598 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 2020-02-07 16:14:56.609 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 2020-02-07 16:14:56.620 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 2020-02-07 16:14:56.630 PST [1167001][3/0:0] DEBUG: bind to isolationtester_waiting 1167002: releasing xact lock 1 2 1167003: acquired xact lock 2020-02-07 16:14:56.631 PST [1167003][5/1254:8465] LOG: INSERT @ 0/9CC71D0: - Btree/INSERT_LEAF: off 1 2020-02-07 16:14:56.631 PST [1167003][5/1254:8465] NOTICE: blurt_and_lock_4() called for k1 in session 1 2020-02-07 16:14:56.631 PST [1167003][5/1254:8465] CONTEXT: PL/pgSQL function blurt_and_lock_4(text) line 3 at RAISE 2020-02-07 16:14:56.631 PST [1167003][5/1254:8465] NOTICE: acquiring advisory lock on 4 2020-02-07 16:14:56.631 PST [1167003][5/1254:8465] CONTEXT: PL/pgSQL function blurt_and_lock_4(text) line 4 at RAISE 1167003: acquiring xact lock 1167003: acquired xact lock 2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] LOG: INSERT @ 0/9CC7230: - Btree/NEWROOT: lev 0 2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] LOG: INSERT @ 0/9CC7270: - Btree/INSERT_LEAF: off 1 2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] LOG: INSERT @ 0/9CC72A8: - Heap/DELETE: off 1 flags 0x08 1167003: xid 8465 releasing lock 5 1167003: retry due to conflict 2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] NOTICE: blurt_and_lock_123() called for k1 in session 1 2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] CONTEXT: PL/pgSQL function blurt_and_lock_123(text) line 3 at RAISE 2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] NOTICE: acquiring advisory lock on 2 2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] CONTEXT: PL/pgSQL function blurt_a
Re: Draft release notes are up for review
Daniel Gustafsson writes: > My name is misspelled master 7d0bcb047 s/Gustaffson/Gustafsson/. Nothing else > stood out from skimming the diff. Ah, looks like I copied and pasted that from the commit log. Sorry, will fix in next draft. regards, tom lane
Re: Dumping/restoring fails on inherited generated column
Peter Eisentraut writes: > On 2020-02-03 20:32, Tom Lane wrote: >> This is showing us at least two distinct problems. Now as for >> "gtest30_1", what we have is that in the parent table "gtest30", column b >> exists but it has no default; the generated property is only added >> at the child table gtest30_1. So we need to emit ALTER COLUMN SET >> GENERATED ALWAYS for gtest30_1.b. HEAD is already doing the wrong >> thing there (it's emitting the expression, but as a plain default >> not GENERATED). And this patch makes it emit nothing, even worse. >> I think the key point here is that "attislocal" refers to whether the >> column itself is locally defined, not to whether its default is. > This is a bit of a mess. Let me explain my thinking on generated > columns versus inheritance. > If a parent table has a generated column, then any inherited column must > also be generated and use the same expression. (Otherwise querying the > parent table would produce results that are inconsistent with the > generation expression if the rows come from the child table.) Check. > If a parent table has a column that is not generated, then I think it > would be semantically sound if a child table had that same column but > generated. However, I think it would be very tricky to support this > correctly, and it doesn't seem useful enough, so I'd rather not do it. So ... why is that so hard exactly? AFAICS, the existing regression test cases show that it works fine. Except that pg_dump gets it wrong. In general, we surely want to support child columns that have defaults different from the parent column's default, so this doesn't seem quite that huge a leap to me. > That's what the gtest30_1 case above shows. It's not even clear whether > it's possible to dump this correctly in all cases because the syntax > that you allude to "turn this existing column into a generated column" > does not exist. I'm a little confused by that statement. What is this doing, if not that: regression=# create table foo (f1 int not null); CREATE TABLE regression=# alter table foo alter column f1 add generated always as identity; ALTER TABLE regression=# \d foo Table "public.foo" Column | Type | Collation | Nullable | Default +-+---+--+-- f1 | integer | | not null | generated always as identity If we didn't have things like ALTER ... SET GENERATED and ALTER ... DROP EXPRESSION, I'd be a lot more content to accept the position that generated-ness is an immutable column property. But we do have those things, so the restriction you're proposing seems mighty arbitrary. regards, tom lane
Re: Draft release notes are up for review
> On 7 Feb 2020, at 22:56, Tom Lane wrote: > Please send comments before Sunday. My name is misspelled master 7d0bcb047 s/Gustaffson/Gustafsson/. Nothing else stood out from skimming the diff. cheers ./daniel
Draft release notes are up for review
See https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=388d4351f78dfa6082922074127d496e6f525033 (Note: these cover through 9710d3d4a, but not the additional partitioning fix I see Alvaro just pushed.) Please send comments before Sunday. regards, tom lane
Re: improve transparency of bitmap-only heap scans
On Fri, Feb 07, 2020 at 03:22:12PM +, Alexey Bashtanov wrote: Hello, It took me a while to figure out what those names mean. "unfetched", as you call it on the code, may be more descriptive than "avoided" for the new label. However I think the other two are more confusing. It may be a good idea to change them together with this. It'll be sad if this patch is forgotten only because of the words choice. I've changed it all to "unfetched" for at least not to call the same thing differently in the code and in the output, and also rebased it and fit in 80 lines width limit. I kinda suspect one of the ressons why this got so little attention is that it was never added to any CF. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Reducing WaitEventSet syscall churn
On Sat, Feb 8, 2020 at 10:00 AM Thomas Munro wrote: > On Tue, Jan 21, 2020 at 1:45 PM Thomas Munro wrote: > > Here are some patches to get rid of frequent system calls. > > Here is one more case that I was sitting on because I wasn't sure how > to do it: walreceiver.c. To make that work, libpq needs to be able to > tell you when the socket has changed, which I did with a counter that > is exposed to client code in patch 0004. The walreceiver change in > 0005 works (trace the system calls on walreceiver to see the > difference), but perhaps we can come up with a better way to code it > so that eg logical/worker.c doesn't finish up duplicating the logic. > Thoughts? (To be clear: I know the 0005 patch doesn't clean up after itself in various cases, it's for discussion only to see if others have ideas about how to structure things to suit various potential users of libpqwalreceiver.so.)
Re: pg_basebackup and snapshots
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2020-02-07 14:56:47 -0500, Stephen Frost wrote: > > * Andres Freund (and...@anarazel.de) wrote: > > > Maybe that's looking too far into the future, but I'd like to see > > > improvements to pg_basebackup that make it integrate with root requiring > > > tooling, to do more efficient base backups. E.g. having pg_basebackup > > > handle start/stop backup and WAL handling, but do the actual backup of > > > the data via a snapshot mechanism (yes, one needs start/stop backup in > > > the general case, for multiple FSs), would be nice. > > > > The challenge with this approach is that you need to drop the 'backup > > label' file into place as part of this operation, either by putting it > > into the snapshot after it's been taken, or by putting it into the data > > directory at restore time. Of course, you have to keep track of WAL > > anyway from the time the snapshots are taken until the restore is done, > > so it's certainly possible, as with all of this, it's just somewhat > > complicated. > > It's not dead trivial, but also doesn't seem *that* hard to me compared > to the other challenges of adding features like this? How to best > approach it I think depends somewhat on what exact type of backup > (mainly whether to set up a new system or to make a PITR base backup) > we'd want to focus on. And what kind of snapshotting system / what kind > of target data store. I'm also not sure that pg_basebackup is the right tool for this though, really, given the complications and how it's somewhat beyond what pg_basebackup's mandate is. This isn't something you'd like do remotely, for example, due to the need to take the snapshot, mount the snapshot, etc. I don't see this as really in line with "just another option to -F", there'd be a fair bit of configuring, it seems, and a good deal of what pg_basebackup would really be doing with this feature is just running bits of code the user has given us, except for the actual calls to PG to do start/stop backup. > Plenty of snapshotting systems allow write access to the snapshot once > it finished, so that's one way one can deal with that. I have a hard > time believing that it'd be hard to have pg_basebackup delay writing the > backup label in that case. The WAL part would probably be harder, since > there we want to start writing before the snapshot is done. And copying > all the WAL at the end isn't enticing either. pg_basebackup already delays writing out the backup label until the end. But, yes, there's also timing issues to deal with, which are complicated because there isn't just a syscall we can use to say "take a snapshot for us" or to say "mount this snapshot over here" (at least, not in any kind of portable way, even in places where such things do exist). Maybe we could have shell commands that a user provides for "take a snapshot" and "mount this snapshot", but putting all of that on the user has its own drawbacks (more on that below..). > For the PITR base backup case it'd definitely be nice to support writing > (potentially with callbacks instead of implementing all of them in core) > into $cloud_provider's blob store, without having to transfer all data > first through a replication connection and then again to the blob store > (and without manually implementing non-exclusive base backup). Adding > WAL after the fact to the same blob really a thing for anything like > that (obviously - even if one can hack it by storing tars etc). We seem to be mixing things now.. You've moved into talking about 'blob stores' which are rather different from snapshots, no? I certainly agree with the general idea of supporting blob stores (pgbackrest has supported s3 for quite some time, with a nicely pluggable architecture that we'll be using to write drivers for other blob storage, all in very well tested C99 code, and it's all done directly, if you want, without going over the network in some other way first..). I don't really care for the idea of using callbacks for this, at least if what you mean by "callback" is "something like archive_command". There's a lot of potential failure cases and issues, writing to most s3 stores requires retries, and getting it all to work right when you're going through a shell to run some other command to actually get the data across safely and durably is, ultimately, a bit of a mess. I feel like we should be learning from the mess that is archive_command and avoiding anything like that if at all possible when it comes to moving data around that needs to be confirmed durably written. Making users have to piece together the bits to make it work just isn't a good idea either (see, again, archive command, and our own documentation for why that's a bad idea...). > Wonder if the the WAL part in particular would actually be best solved > by having recovery probe more than one WAL directory when looking for > WAL segments (i.e. doing so before switching methods). Much fast
Re: Reducing WaitEventSet syscall churn
On Tue, Jan 21, 2020 at 1:45 PM Thomas Munro wrote: > Here are some patches to get rid of frequent system calls. Here is one more case that I was sitting on because I wasn't sure how to do it: walreceiver.c. To make that work, libpq needs to be able to tell you when the socket has changed, which I did with a counter that is exposed to client code in patch 0004. The walreceiver change in 0005 works (trace the system calls on walreceiver to see the difference), but perhaps we can come up with a better way to code it so that eg logical/worker.c doesn't finish up duplicating the logic. Thoughts? 0001-Add-WaitMyLatch-to-replace-many-WaitLatch-calls.patch Description: Binary data 0002-Use-WaitMyLatch-for-condition-variables.patch Description: Binary data 0003-Introduce-a-reusable-WaitEventSet-for-the-stats-coll.patch Description: Binary data 0004-libpq-Add-PQsocketChangeCount-to-advertise-socket-ch.patch Description: Binary data 0005-Reuse-a-WaitEventSet-in-walreceiver.patch Description: Binary data
Re: pg_basebackup and snapshots
Hi, On 2020-02-07 14:56:47 -0500, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > Maybe that's looking too far into the future, but I'd like to see > > improvements to pg_basebackup that make it integrate with root requiring > > tooling, to do more efficient base backups. E.g. having pg_basebackup > > handle start/stop backup and WAL handling, but do the actual backup of > > the data via a snapshot mechanism (yes, one needs start/stop backup in > > the general case, for multiple FSs), would be nice. > > The challenge with this approach is that you need to drop the 'backup > label' file into place as part of this operation, either by putting it > into the snapshot after it's been taken, or by putting it into the data > directory at restore time. Of course, you have to keep track of WAL > anyway from the time the snapshots are taken until the restore is done, > so it's certainly possible, as with all of this, it's just somewhat > complicated. It's not dead trivial, but also doesn't seem *that* hard to me compared to the other challenges of adding features like this? How to best approach it I think depends somewhat on what exact type of backup (mainly whether to set up a new system or to make a PITR base backup) we'd want to focus on. And what kind of snapshotting system / what kind of target data store. Plenty of snapshotting systems allow write access to the snapshot once it finished, so that's one way one can deal with that. I have a hard time believing that it'd be hard to have pg_basebackup delay writing the backup label in that case. The WAL part would probably be harder, since there we want to start writing before the snapshot is done. And copying all the WAL at the end isn't enticing either. For the PITR base backup case it'd definitely be nice to support writing (potentially with callbacks instead of implementing all of them in core) into $cloud_provider's blob store, without having to transfer all data first through a replication connection and then again to the blob store (and without manually implementing non-exclusive base backup). Adding WAL after the fact to the same blob really a thing for anything like that (obviously - even if one can hack it by storing tars etc). Wonder if the the WAL part in particular would actually be best solved by having recovery probe more than one WAL directory when looking for WAL segments (i.e. doing so before switching methods). Much faster than using restore_command, and what one really wants in a pretty decent number of cases. And it'd allow to just restore the base backup (e.g. mount [copy of] the snapshot) and the received WAL stream separately, without needing more complicated orchestration. Perhaps I am also answering something completely besides what you were wondering about? Greetings, Andres Freund
Re: Does recovery write to backup_label ?
On 2/7/20 2:55 PM, Andres Freund wrote: >> If the file needs to have 0600 permissions, should there be >> a note in the nonexclusive-mode backup docs to say so? > > I'm not convinced that that's useful. The default is that everything > needs to be writable by postgres. The exceptions should be noted if > anything, not the default. Could this arguably be a special case, as most things in the datadir are put there by postgres, but the backup_label is now to be put there (and not even 'there' there, but added as a final step only to a 'backup-copy-of-there' there) by the poor schmuck who reads the non-exclusive backup docs as saying "retrieve this content from pg_stop_backup() and preserve in a file named backup_label" and can't think of any obvious reason to put write permission on a file that preserves immutable history in a backup? Regards, -Chap
pg_basebackup and snapshots
Greetings, (Moving to -hackers, changing thread title) * Andres Freund (and...@anarazel.de) wrote: > Maybe that's looking too far into the future, but I'd like to see > improvements to pg_basebackup that make it integrate with root requiring > tooling, to do more efficient base backups. E.g. having pg_basebackup > handle start/stop backup and WAL handling, but do the actual backup of > the data via a snapshot mechanism (yes, one needs start/stop backup in > the general case, for multiple FSs), would be nice. The challenge with this approach is that you need to drop the 'backup label' file into place as part of this operation, either by putting it into the snapshot after it's been taken, or by putting it into the data directory at restore time. Of course, you have to keep track of WAL anyway from the time the snapshots are taken until the restore is done, so it's certainly possible, as with all of this, it's just somewhat complicated. Certainly open to ideas on how to improve this. Thanks, Stephen signature.asc Description: PGP signature
Re: Does recovery write to backup_label ?
Hi, On 2020-02-07 11:08:48 -0500, Chapman Flack wrote: > Just saw this in a PG 11.6 cluster starting a recovery: > > 2020-02-07 10:45:40 EST FATAL: 42501: could not open file > "backup_label": Permission denied > 2020-02-07 10:45:40 EST LOCATION: fsync_fname_ext, fd.c:3531 Well, we generally assume that files in the data directory are writable, with a very few exceptions. And we do need to be able rename backup_label to backup_label.old. Which strictly speaking doesn't require write permissions on the file - but I assume that's what triggers the issue here. There's IIRC platforms that don't like fsyncing files with a O_RDONLY fd, so if we want to rename safely (which entails fsyncing beforehand), we don't have much choice. > I had assumed the label file would be treated as readonly > during recovery. It is IIRC documented that it does get renamed... > If the file needs to have 0600 permissions, should there be > a note in the nonexclusive-mode backup docs to say so? I'm not convinced that that's useful. The default is that everything needs to be writable by postgres. The exceptions should be noted if anything, not the default. Greetings, Andres Freund
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Hi, On 2020-02-07 20:02:01 +0100, Tomas Vondra wrote: > On Fri, Feb 07, 2020 at 10:33:48AM -0800, Andres Freund wrote: > > Hi, > > > > On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote: > > > And, the issue got reproduced with the same error: > > > WARNING: problem in Generation Tuples: number of free chunks 0 in > > > block 0x7fe9e9e74010 exceeds 1018 allocated > > > > That seems like a problem in generation.c - because this should be > > unreachable, I think? > That's rather strange. How could we print this message? The code looks > like this > > if (block->nfree >= block->nchunks) > elog(WARNING, "problem in Generation %s: number of free chunks %d in > block %p exceeds %d allocated", > name, block->nfree, block, block->nchunks); > > so this says 0 >= 1018. Or am I missing something? Indeed, it's pretty weird. I can't reproduce it either. Kuntal, which exact git version did you repro this on? What precise settings / platform? Greetings, Andres Freund
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Hi, On 2020-01-11 23:20:56 -0500, Tom Lane wrote: > Tomas Vondra writes: > Nah, don't think I believe that: the test inserts a bunch of tuples, > but they look like they will all be *exactly* the same size. > > CREATE TABLE decoding_test(x integer, y text); > ... > > FOR i IN 1..10 LOOP > BEGIN > INSERT INTO decoding_test(x) SELECT generate_series(1,5000); > EXCEPTION > when division_by_zero then perform 'dummy'; > END; I think the issue this triggers higher memory usage in in older versions is that before commit cec2edfa7859279f36d2374770ca920c59c73dd8 Author: Amit Kapila Date: 2019-11-16 17:49:33 +0530 Add logical_decoding_work_mem to limit ReorderBuffer memory usage. we enforced how many changes to keep in memory (vs on disk) /* * Maximum number of changes kept in memory, per transaction. After that, * changes are spooled to disk. * * The current value should be sufficient to decode the entire transaction * without hitting disk in OLTP workloads, while starting to spool to disk in * other workloads reasonably fast. * * At some point in the future it probably makes sense to have a more elaborate * resource management here, but it's not entirely clear what that would look * like. */ static const Size max_changes_in_memory = 4096; on a per-transaction basis. And that subtransactions are *different* transactions for that purpose (as they can be rolled back separately). As the test generates loads of records for different subtransactions, they each end up having quite a few changes (including the tuples pointed to!) in memory at the same time. Due to the way the limit of 4096 interacts with the 5000 rows inserted above, we only hit the out of memory error when loading. That's because when decoding (before the commit has been seen), we spill after 4096 changes: 2020-02-07 11:18:22.399 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 585 to disk 2020-02-07 11:18:22.419 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 586 to disk 2020-02-07 11:18:22.431 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 587 to disk 2020-02-07 11:18:22.443 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 588 to disk 2020-02-07 11:18:22.454 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 589 to disk 2020-02-07 11:18:22.465 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 590 to disk 2020-02-07 11:18:22.477 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 591 to disk 2020-02-07 11:18:22.488 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 592 to disk 2020-02-07 11:18:22.499 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 593 to disk 2020-02-07 11:18:22.511 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 594 to disk so there's each 5000 - 4096 changes in memory, times 10. But when actually calling the output plugin (at the commit record), we start with loading changes back into memory from the start of each subtransaction. That first entails spilling the tail of that transaction to disk, and then loading the start: 2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG: StartSubTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0 2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG: StartSubTransaction(2) name: replay; blockState: SUB BEGIN; state: INPROGR, xid/subid/cid: 0/2/0 2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG: spill 904 changes in XID 585 to disk 2020-02-07 11:18:22.524 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 585 into memory 2020-02-07 11:18:22.524 PST [1136134][3/2] DEBUG: spill 904 changes in XID 586 to disk 2020-02-07 11:18:22.534 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 586 into memory 2020-02-07 11:18:22.534 PST [1136134][3/2] DEBUG: spill 904 changes in XID 587 to disk 2020-02-07 11:18:22.544 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 587 into memory 2020-02-07 11:18:22.544 PST [1136134][3/2] DEBUG: spill 904 changes in XID 588 to disk 2020-02-07 11:18:22.554 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 588 into memory 2020-02-07 11:18:22.554 PST [1136134][3/2] DEBUG: spill 904 changes in XID 589 to disk TopMemoryContext: 161440 total in 7 blocks; 80240 free (68 chunks); 81200 used ... Because each transaction has 4096 changes in memory, we actually need more memory here than we did during the decoding phase, where all but the "current" subtransaction only have 5000 - 4096 changes in memory. If we instead change the test to insert 4096*2 - 1 tuples each, we run out of memory earlier: 2020-02-07 11:23:20.540 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 610 to disk 2020-02-07 11:23:20.565 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 611 to disk 2020-02-07 11:23:20.587 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 612 to disk 2020-02-07 11:23:20.608 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 613 to disk 2020-02-07 11:23:20.630 PST [1136134][3/
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Fri, Feb 07, 2020 at 10:33:48AM -0800, Andres Freund wrote: Hi, On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote: And, the issue got reproduced with the same error: WARNING: problem in Generation Tuples: number of free chunks 0 in block 0x7fe9e9e74010 exceeds 1018 allocated That seems like a problem in generation.c - because this should be unreachable, I think? Tomas? That's rather strange. How could we print this message? The code looks like this if (block->nfree >= block->nchunks) elog(WARNING, "problem in Generation %s: number of free chunks %d in block %p exceeds %d allocated", name, block->nfree, block, block->nchunks); so this says 0 >= 1018. Or am I missing something? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Thu, 6 Feb 2020 at 09:44, Amit Kapila wrote: > > On Thu, Feb 6, 2020 at 1:57 AM Mahendra Singh Thalor wrote: > > > > On Wed, 5 Feb 2020 at 12:07, Masahiko Sawada wrote: > > > > > > On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila wrote: > > > > > > > > On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada < sawada.m...@gmail.com> wrote: > > > > > > > > > > On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas < robertmh...@gmail.com> wrote: > > > > > > On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund < and...@anarazel.de> wrote: > > > > > >>> I think the real question is whether the scenario is common enough to > > > > > >>> worry about. In practice, you'd have to be extremely unlucky to be > > > > > >>> doing many bulk loads at the same time that all happened to hash to > > > > > >>> the same bucket. > > > > > >> > > > > > >> With a bunch of parallel bulkloads into partitioned tables that really > > > > > >> doesn't seem that unlikely? > > > > > > > > > > > > It increases the likelihood of collisions, but probably decreases the > > > > > > number of cases where the contention gets really bad. > > > > > > > > > > > > For example, suppose each table has 100 partitions and you are > > > > > > bulk-loading 10 of them at a time. It's virtually certain that you > > > > > > will have some collisions, but the amount of contention within each > > > > > > bucket will remain fairly low because each backend spends only 1% of > > > > > > its time in the bucket corresponding to any given partition. > > > > > > > > > > > > > > > > I share another result of performance evaluation between current HEAD > > > > > and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024). > > > > > > > > > > Type of table: normal table, unlogged table > > > > > Number of child tables : 16, 64 (all tables are located on the same tablespace) > > > > > Number of clients : 32 > > > > > Number of trials : 100 > > > > > Duration: 180 seconds for each trials > > > > > > > > > > The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB > > > > > RAM, NVMe SSD 1.5TB. > > > > > Each clients load 10kB random data across all partitioned tables. > > > > > > > > > > Here is the result. > > > > > > > > > > childs | type | target | avg_tps | diff with HEAD > > > > > +--+-++-- > > > > > 16 | normal | HEAD| 1643.833 | > > > > > 16 | normal | Patched | 1619.5404 | 0.985222 > > > > > 16 | unlogged | HEAD| 9069.3543 | > > > > > 16 | unlogged | Patched | 9368.0263 | 1.032932 > > > > > 64 | normal | HEAD| 1598.698 | > > > > > 64 | normal | Patched | 1587.5906 | 0.993052 > > > > > 64 | unlogged | HEAD| 9629.7315 | > > > > > 64 | unlogged | Patched | 10208.2196 | 1.060073 > > > > > (8 rows) > > > > > > > > > > For normal tables, loading tps decreased 1% ~ 2% with this patch > > > > > whereas it increased 3% ~ 6% for unlogged tables. There were > > > > > collisions at 0 ~ 5 relation extension lock slots between 2 relations > > > > > in the 64 child tables case but it didn't seem to affect the tps. > > > > > > > > > > > > > AFAIU, this resembles the workload that Andres was worried about. I > > > > think we should once run this test in a different environment, but > > > > considering this to be correct and repeatable, where do we go with > > > > this patch especially when we know it improves many workloads [1] as > > > > well. We know that on a pathological case constructed by Mithun [2], > > > > this causes regression as well. I am not sure if the test done by > > > > Mithun really mimics any real-world workload as he has tested by > > > > making N_RELEXTLOCK_ENTS = 1 to hit the worst case. > > > > > > > > Sawada-San, if you have a script or data for the test done by you, > > > > then please share it so that others can also try to reproduce it. > > > > > > Unfortunately the environment I used for performance verification is > > > no longer available. > > > > > > I agree to run this test in a different environment. I've attached the > > > rebased version patch. I'm measuring the performance with/without > > > patch, so will share the results. > > > > > > > Thanks Sawada-san for patch. > > > > From last few days, I was reading this thread and was reviewing v13 patch. To debug and test, I did re-base of v13 patch. I compared my re-based patch and v14 patch. I think, ordering of header files is not alphabetically in v14 patch. (I haven't reviewed v14 patch fully because before review, I wanted to test false sharing). While debugging, I didn't noticed any hang or lock related issue. > > > > I did some testing to test false sharing(bulk insert, COPY data, bulk insert into partitions tables). Below is the testing summary. > > > > Test setup(Bulk insert into partition tables): > > autovacuum=off > > shared_buffers=512MB -c max_wal_size=20GB -c checkpoint_timeout=12min > > > > Basically, I created a table with 13 partit
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Hi, On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote: > I performed the same test in pg11 and reproduced the issue on the > commit prior to a4ccc1cef5a04 (Generational memory allocator). > > ulimit -s 1024 > ulimit -v 30 > > wal_level = logical > max_replication_slots = 4 > > [...] > After that, I applied the "Generational memory allocator" patch and > that solved the issue. From the error message, it is evident that the > underlying code is trying to allocate a MaxTupleSize memory for each > tuple. So, I re-introduced the following lines (which are removed by > a4ccc1cef5a04) on top of the patch: > --- a/src/backend/replication/logical/reorderbuffer.c > +++ b/src/backend/replication/logical/reorderbuffer.c > @@ -417,6 +417,9 @@ ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size > tuple_len) > > alloc_len = tuple_len + SizeofHeapTupleHeader; > > + if (alloc_len < MaxHeapTupleSize) > + alloc_len = MaxHeapTupleSize; Maybe I'm being slow here - but what does this actually prove? Before the generation contexts were introduced we avoided fragmentation (which would make things unusably slow) using a a brute force method (namely forcing all tuple allocations to be of the same/maximum size). Which means that yes, we'll need more memory than necessary. Do you think you see anything but that here? It's good that the situation is better now, but I don't think this means we need to necessarily backpatch something nontrivial? Greetings, Andres Freund
Re: [Proposal] Global temporary tables
pá 7. 2. 2020 v 18:28 odesílatel Konstantin Knizhnik < k.knizh...@postgrespro.ru> napsal: > > > On 07.02.2020 18:15, Robert Haas wrote: > > On Wed, Feb 5, 2020 at 10:48 AM Konstantin Knizhnik > > wrote: > > My answer is - yes. > >> Just because: > >> - Such behavior is compatible with regular tables. So it will not > >> confuse users and doesn't require some complex explanations. > >> - It is compatible with Oracle. > >> - It is what DBA usually want when creating index. > >> - > >> There are several arguments against such behavior: > >> - Concurrent building of index in multiple sessions can consume a lot of > >> memory > >> - Building index can increase query execution time (which can be not > >> expected by clients) > > I think those are good arguments, especially the second one. There's > > no limit on how long building a new index might take, and it could be > > several minutes. A user who was running a query that could have > > completed in a few seconds or even milliseconds will be unhappy to > > suddenly wait a long time for a new index to be built. And that is an > > entirely realistic scenario, because the new index might be better, > > but only marginally. > Yes, I agree that this arguments are important. > But IMHO less important than incompatible behavior (Pavel doesn't agree > with word "incompatible" in this context > since semantic of temp tables is in any case different with semantic of > regular tables). > > Just want to notice that if we have huge GTT (so that creation of index > takes significant amount of time) > sequential scan of this table also will not be fast. > > But in any case, if we agree that we can control thus behavior using GUC > or index property, > then it is ok for me. > > > > > > > Also, an important point to which I've already alluded a few times is > > that creating an index can fail. Now, one way it can fail is that > > there could be some problem writing to disk, or you could run out of > > memory, or whatever. However, it can also fail because the new index > > is UNIQUE and the data this backend has in the table doesn't conform > > to the associated constraint. It will be confusing if all access to a > > table suddenly starts complaining about uniqueness violations. > > Yes, building index can fail (as any other operation with database). > What's wring with it? > If it is fatal error, then backend is terminated and content of its temp > table is disappeared. > If it is non-fatal error, then current transaction is aborted: > > > Session1: > postgres=# create global temp table gtt(x integer); > CREATE TABLE > postgres=# insert into gtt values (generate_series(1,10)); > INSERT 0 10 > > Session2: > postgres=# insert into gtt values (generate_series(1,10)); > INSERT 0 10 > postgres=# insert into gtt values (1); > INSERT 0 1 > What when session 2 has active transaction? Then to be correct, you should to wait with index creation to end of transaction. > Session1: > postgres=# create unique index on gtt(x); > CREATE INDEX > > Sessin2: > postgres=# explain select * from gtt where x=1; > ERROR: could not create unique index "gtt_x_idx" > DETAIL: Key (x)=(1) is duplicated. > This is little bit unexpected behave (probably nobody expect so any SELECT fail with error "could not create index" - I understand exactly to reason and context, but this side effect is something what I afraid. > > > I don't believe that the feature you are proposing can be correctly > > implemented in 10 lines of code. I would be pleasantly surprised if it > > can be done in 1000. > > > Right now I do not see any sources of extra complexity. > Will be pleased if you can point them to me. > > -- > Konstantin Knizhnik > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company > >
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Hi, On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote: > And, the issue got reproduced with the same error: > WARNING: problem in Generation Tuples: number of free chunks 0 in > block 0x7fe9e9e74010 exceeds 1018 allocated That seems like a problem in generation.c - because this should be unreachable, I think? Tomas? Regards, Andres
Re: Assumptions about the number of parallel workers
Hi, On 2020-02-07 09:44:34 +0100, Antonin Houska wrote: > Those Gather nodes still have non-zero num_workers, see this part of > standard_planner: > > if (force_parallel_mode != FORCE_PARALLEL_OFF && top_plan->parallel_safe) > { > ... > gather->num_workers = 1; > gather->single_copy = true; Ick. Looks like you might be right... > Also, if it num_workers was zero for any reason, my patch would probably make > regression tests fail. However I haven't noticed any assertion failure. That however, is not at all guaranteed. The regression tests don't run (or at least not much) with force_parallel_mode set. You can try yourself though, with something like PGOPTIONS='-c force_parallel_mode=regress' make check Greetings, Andres Freund
Re: Internal key management system
Hi, On 2020-02-07 20:44:31 +0900, Masahiko Sawada wrote: > Yeah I'm not going to use pgcrypto for transparent data encryption. > The KMS patch includes the new basic infrastructure for cryptographic > functions (mainly AES-CBC). I'm thinking we can expand that > infrastructure so that we can also use it for TDE purpose by > supporting new cryptographic functions such as AES-CTR. Anyway, I > agree to not have it depend on pgcrypto. I thought for a minute, before checking the patch, that you were saying above that the KMS patch includes its *own* implementation of cryptographic functions. I think it's pretty crucial that it continues not to do that... Greetings, Andres Freund
Re: In PG12, query with float calculations is slower than PG11
Hi, On 2020-02-07 17:17:21 +0900, Amit Langote wrote: > I did some tests using two relatively recent compilers: gcc 8 and > clang-7 and here are the results: Hm, these very much look like they've been done in an unoptimized build? > 40.62% postgres postgres [.] ExecInterpExpr > 9.74% postgres postgres [.] float8_accum > 6.12% postgres libc-2.17.so [.] __isinf > 5.96% postgres postgres [.] float8mul > 5.33% postgres postgres [.] dsqrt > 3.90% postgres postgres [.] ftod > 3.53% postgres postgres [.] Float8GetDatum > 2.34% postgres postgres [.] DatumGetFloat8 > 2.15% postgres postgres [.] AggCheckCallContext > 2.03% postgres postgres [.] slot_deform_tuple > 1.95% postgres libm-2.17.so [.] __sqrt > 1.19% postgres postgres [.] check_float8_array > HEAD > > latency average = 549.071 ms > > 31.74% postgres postgres [.] ExecInterpExpr > 11.02% postgres libc-2.17.so [.] __isinf > 10.58% postgres postgres [.] float8_accum > 4.84% postgres postgres [.] check_float8_val > 4.66% postgres postgres [.] dsqrt > 3.91% postgres postgres [.] float8mul > 3.56% postgres postgres [.] ftod > 3.26% postgres postgres [.] Float8GetDatum > 2.91% postgres postgres [.] float8_mul > 2.30% postgres postgres [.] DatumGetFloat8 > 2.19% postgres postgres [.] slot_deform_heap_tuple > 1.81% postgres postgres [.] AggCheckCallContext > 1.31% postgres libm-2.17.so [.] __sqrt > 1.25% postgres postgres [.] check_float8_array Because DatumGetFloat8, Float8GetDatum, etc aren't functions that normally stay separate. Greetings, Andres Freund
Re: In PG12, query with float calculations is slower than PG11
Moin, On 2020-02-07 15:42, Emre Hasegeli wrote: > The patch looks unduly invasive to me, but I think that it might be > right that we should go back to a macro-based implementation, because > otherwise we don't have a good way to be certain that the function > parameter won't get evaluated first. I'd first like to see some actual evidence of this being a problem, rather than just the order of the checks. There seem to be enough evidence of this being the problem. We are better off going back to the macro-based implementation. I polished Keisuke Kuroda's patch commenting about the performance issue, removed the check_float*_val() functions completely, and added unlikely() as Tom Lane suggested. It is attached. I confirmed with different compilers that the macro, and unlikely() makes this noticeably faster. Hm, the diff has the macro tests as: + if (unlikely(isinf(val) && !(inf_is_valid))) ... + if (unlikely((val) == 0.0 && !(zero_is_valid))) But the comment does not explain that this test has to be in that order, or the compiler will for non-constant arguments evalute the (now) right-side first. E.g. if I understand this correctly: + if (!(zero_is_valid) && unlikely((val) == 0.0) would have the same problem of evaluating "zero_is_valid" (which might be an isinf(arg1) || isinf(arg2)) first and so be the same thing we try to avoid with the macro? Maybe adding this bit of info to the comment makes it clearer? Also, a few places use the macro as: + CHECKFLOATVAL(result, true, true); which evaluates to a complete NOP in both cases. IMHO this could be replaced with a comment like: + // No CHECKFLOATVAL() needed, as both inf and 0.0 are valid (or something along the lines of "no error can occur"), as otherwise CHECKFLOATVAL() implies to the casual reader that there are some checks done, while in reality no real checks are done at all (and hopefully the compiler optimizes everything away, which might not be true for debug builds). -- Best regards, TelsFrom e869373ad093e668872f08833de2c5c614aab673 Mon Sep 17 00:00:00 2001 From: Emre Hasegeli Date: Fri, 7 Feb 2020 10:27:25 + Subject: [PATCH] Bring back CHECKFLOATVAL() macro The inline functions added by 6bf0bc842b caused the conditions of overflow/underflow checks to be evaluated when no overflow/underflow happen. This slowed down floating point operations. This commit brings back the macro that was in use before 6bf0bc842b to fix the performace regression. Reported-by: Keisuke Kuroda Author: Keisuke Kuroda Discussion: https://www.postgresql.org/message-id/CANDwggLe1Gc1OrRqvPfGE%3DkM9K0FSfia0hbeFCEmwabhLz95AA%40mail.gmail.com --- src/backend/utils/adt/float.c | 66 ++--- src/backend/utils/adt/geo_ops.c | 2 +- src/include/utils/float.h | 75 ++--- 3 files changed, 66 insertions(+), 77 deletions(-) diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c index a90d4db215..5885719850 100644 --- a/src/backend/utils/adt/float.c +++ b/src/backend/utils/adt/float.c @@ -1184,21 +1184,21 @@ ftod(PG_FUNCTION_ARGS) /* * dtof - converts a float8 number to a float4 number */ Datum dtof(PG_FUNCTION_ARGS) { float8 num = PG_GETARG_FLOAT8(0); - check_float4_val((float4) num, isinf(num), num == 0); + CHECKFLOATVAL((float4) num, isinf(num), num == 0); PG_RETURN_FLOAT4((float4) num); } /* * dtoi4 - converts a float8 number to an int4 number */ Datum dtoi4(PG_FUNCTION_ARGS) @@ -1438,36 +1438,36 @@ dsqrt(PG_FUNCTION_ARGS) float8 arg1 = PG_GETARG_FLOAT8(0); float8 result; if (arg1 < 0) ereport(ERROR, (errcode(ERRCODE_INVALID_ARGUMENT_FOR_POWER_FUNCTION), errmsg("cannot take square root of a negative number"))); result = sqrt(arg1); - check_float8_val(result, isinf(arg1), arg1 == 0); + CHECKFLOATVAL(result, isinf(arg1), arg1 == 0); PG_RETURN_FLOAT8(result); } /* * dcbrt - returns cube root of arg1 */ Datum dcbrt(PG_FUNCTION_ARGS) { float8 arg1 = PG_GETARG_FLOAT8(0); float8 result; result = cbrt(arg1); - check_float8_val(result, isinf(arg1), arg1 == 0); + CHECKFLOATVAL(result, isinf(arg1), arg1 == 0); PG_RETURN_FLOAT8(result); } /* * dpow - returns pow(arg1,arg2) */ Datum dpow(PG_FUNCTION_ARGS) { @@ -1525,40 +1525,40 @@ dpow(PG_FUNCTION_ARGS) /* The sign of Inf is not significant in this case. */ result = get_float8_infinity(); else if (fabs(arg1) != 1) result = 0; else result = 1; } else if (errno == ERANGE && result != 0 && !isinf(result)) result = get_float8_infinity(); - check_float8_val(result, isinf(arg1) || isinf(arg2), arg1 == 0); + CHECKFLOATVAL(result, isinf(arg1) || isinf(arg2), arg1 == 0); PG_RETURN_FLOAT8(result); } /* * dexp - returns the exponential function of arg1 */ Datum dexp(PG_FUNCTION_ARGS) { float8 arg1 = PG_GETARG_FLOAT8(0); float8 result; errno = 0;
Re: [Proposal] Global temporary tables
On Fri, Feb 7, 2020 at 12:28 PM Konstantin Knizhnik wrote: > But in any case, if we agree that we can control thus behavior using GUC > or index property, > then it is ok for me. Nope, I am not going to agree to that, and I don't believe that any other committer will, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Global temporary tables
Fix GTT index initialization. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c index 33e2d28..93059ef 100644 --- a/contrib/pg_prewarm/pg_prewarm.c +++ b/contrib/pg_prewarm/pg_prewarm.c @@ -178,7 +178,7 @@ pg_prewarm(PG_FUNCTION_ARGS) for (block = first_block; block <= last_block; ++block) { CHECK_FOR_INTERRUPTS(); - smgrread(rel->rd_smgr, forkNumber, block, blockbuffer.data); + smgrread(rel->rd_smgr, forkNumber, block, blockbuffer.data, false); ++blocks_done; } } diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 79430d2..39baddc 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -158,6 +158,19 @@ static relopt_bool boolRelOpts[] = }, true }, + /* + * For global temp table only + * use AccessExclusiveLock for ensure safety + */ + { + { + "on_commit_delete_rows", + "global temp table on commit options", + RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED, + ShareUpdateExclusiveLock + }, + false + }, /* list terminator */ {{NULL}} }; @@ -1486,6 +1499,8 @@ bytea * default_reloptions(Datum reloptions, bool validate, relopt_kind kind) { static const relopt_parse_elt tab[] = { + {"on_commit_delete_rows", RELOPT_TYPE_BOOL, + offsetof(StdRdOptions, on_commit_delete_rows)}, {"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)}, {"autovacuum_enabled", RELOPT_TYPE_BOOL, offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)}, @@ -1586,13 +1601,17 @@ build_reloptions(Datum reloptions, bool validate, bytea * partitioned_table_reloptions(Datum reloptions, bool validate) { + static const relopt_parse_elt tab[] = { + {"on_commit_delete_rows", RELOPT_TYPE_BOOL, + offsetof(StdRdOptions, on_commit_delete_rows)} + }; /* * There are no options for partitioned tables yet, but this is able to do * some validation. */ return (bytea *) build_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED, - 0, NULL, 0); + sizeof(StdRdOptions), tab, lengthof(tab)); } /* diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 3fa4b76..a86de50 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -670,6 +670,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode) * init fork of an unlogged relation. */ if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT || +rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION || (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED && forkNum == INIT_FORKNUM)) log_smgrcreate(newrnode, forkNum); diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c index 7d6acae..7c48e5c 100644 --- a/src/backend/catalog/catalog.c +++ b/src/backend/catalog/catalog.c @@ -396,6 +396,9 @@ GetNewRelFileNode(Oid reltablespace, Relation pg_class, char relpersistence) case RELPERSISTENCE_TEMP: backend = BackendIdForTempRelations(); break; + case RELPERSISTENCE_SESSION: + backend = BackendIdForSessionRelations(); + break; case RELPERSISTENCE_UNLOGGED: case RELPERSISTENCE_PERMANENT: backend = InvalidBackendId; diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 8880586..22ce895 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -3707,7 +3707,7 @@ reindex_relation(Oid relid, int flags, int options) if (flags & REINDEX_REL_FORCE_INDEXES_UNLOGGED) persistence = RELPERSISTENCE_UNLOGGED; else if (flags & REINDEX_REL_FORCE_INDEXES_PERMANENT) - persistence = RELPERSISTENCE_PERMANENT; + persistence = rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION ? RELPERSISTENCE_SESSION : RELPERSISTENCE_PERMANENT; else persistence = rel->rd_rel->relpersistence; diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index fddfbf1..9747835 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -92,6 +92,10 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) backend = InvalidBackendId; needs_wal = false; break; + case RELPERSISTENCE_SESSION: + backend = BackendIdForSessionRelations(); + needs_wal = false; + break; case RELPERSISTENCE_PERMANENT: backend = InvalidBackendId; needs_wal = true; @@ -367,7 +371,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, /* If we got a cancel signal during the copy of the data, quit */ CHECK_FOR_INTERRUPTS(); - smgrread(src, forkNum, blkno, buf.data); + smgrread(src, forkNum, blkno, buf.data, false); if (!PageIsVerified(page, blkno)) ereport(ERROR, diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index c9e6060..1f5e52b 10064
allow frontend use of the backend's core hashing functions
Late last year, I did some work to make it possible to use simplehash in frontend code.[1] However, a hash table is not much good unless one also has some hash functions that one can use to hash the keys that need to be inserted into that hash table. I initially thought that solving this problem was going to be pretty annoying, but when I looked at it again today I found what I think is a pretty simple way to adapt things so that the hashing routines we use in the backend are easily available to frontend code. Here are some patches for that. It may make sense to combine some of these in terms of actually getting this committed, but I have separated them here so that it is, hopefully, easy to see what I did and why I did it. There are three basic problems which are solved by the three preparatory patches: 0001 - hashfn.c has a couple of routines that depend on bitmapsets, and bitmapset.c is currently backend-only. Fix by moving this code near related code in bitmapset.c. 0002 - some of the prototypes for functions in hashfn.c are in hsearch.h, mixed with the dynahash stuff, and others are in hashutils.c. Fix by making hashutils.h the one true header for hashfn.c. 0003 - Some of hashfn.c's routines return Datum, but that's backend-only. Fix by renaming the functions and changing the return types to uint32 and uint64, and add static inline wrappers with the old names that convert to Datum. Just changing the return types of the existing functions seemed like it would've required a lot more code churn, and also seems like it could cause silent breakage in the future. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] http://postgr.es/m/CA+Tgmob8oyh02NrZW=xCScB+5GyJ-jVowE3+TWTUmPF=fsg...@mail.gmail.com 0001-Move-bitmap_hash-and-bitmap_match-to-bitmapset.c.patch Description: Binary data 0004-Move-src-backend-utils-hash-hashfn.c-to-src-common.patch Description: Binary data 0003-Adapt-hashfn.c-and-hashutils.h-for-frontend-use.patch Description: Binary data 0002-Put-all-the-prototypes-for-hashfn.c-into-the-same-he.patch Description: Binary data
Re: [Proposal] Global temporary tables
On 07.02.2020 18:15, Robert Haas wrote: On Wed, Feb 5, 2020 at 10:48 AM Konstantin Knizhnik wrote: My answer is - yes. Just because: - Such behavior is compatible with regular tables. So it will not confuse users and doesn't require some complex explanations. - It is compatible with Oracle. - It is what DBA usually want when creating index. - There are several arguments against such behavior: - Concurrent building of index in multiple sessions can consume a lot of memory - Building index can increase query execution time (which can be not expected by clients) I think those are good arguments, especially the second one. There's no limit on how long building a new index might take, and it could be several minutes. A user who was running a query that could have completed in a few seconds or even milliseconds will be unhappy to suddenly wait a long time for a new index to be built. And that is an entirely realistic scenario, because the new index might be better, but only marginally. Yes, I agree that this arguments are important. But IMHO less important than incompatible behavior (Pavel doesn't agree with word "incompatible" in this context since semantic of temp tables is in any case different with semantic of regular tables). Just want to notice that if we have huge GTT (so that creation of index takes significant amount of time) sequential scan of this table also will not be fast. But in any case, if we agree that we can control thus behavior using GUC or index property, then it is ok for me. Also, an important point to which I've already alluded a few times is that creating an index can fail. Now, one way it can fail is that there could be some problem writing to disk, or you could run out of memory, or whatever. However, it can also fail because the new index is UNIQUE and the data this backend has in the table doesn't conform to the associated constraint. It will be confusing if all access to a table suddenly starts complaining about uniqueness violations. Yes, building index can fail (as any other operation with database). What's wring with it? If it is fatal error, then backend is terminated and content of its temp table is disappeared. If it is non-fatal error, then current transaction is aborted: Session1: postgres=# create global temp table gtt(x integer); CREATE TABLE postgres=# insert into gtt values (generate_series(1,10)); INSERT 0 10 Session2: postgres=# insert into gtt values (generate_series(1,10)); INSERT 0 10 postgres=# insert into gtt values (1); INSERT 0 1 Session1: postgres=# create unique index on gtt(x); CREATE INDEX Sessin2: postgres=# explain select * from gtt where x=1; ERROR: could not create unique index "gtt_x_idx" DETAIL: Key (x)=(1) is duplicated. I don't believe that the feature you are proposing can be correctly implemented in 10 lines of code. I would be pleasantly surprised if it can be done in 1000. Right now I do not see any sources of extra complexity. Will be pleased if you can point them to me. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] emergency outage requiring database restart
On Tue, Jan 3, 2017 at 1:05 PM Peter Eisentraut wrote: > > On 11/7/16 5:31 PM, Merlin Moncure wrote: > > Regardless, it seems like you might be on to something, and I'm > > inclined to patch your change, test it, and roll it out to production. > > If it helps or at least narrows the problem down, we ought to give it > > consideration for inclusion (unless someone else can think of a good > > reason not to do that, heh!). > > Any results yet? Not yet. But I do have some interesting findings. At this point I do not think the problem is within pl/sh itself, but that when a process is invoked from pl/sh misbehaves that misbehavior can penetrate into the database processes. I also believe that this problem is fd related, so that the 'close on exec' might reasonably fix it. All cases of database damage I have observed remain completely mitigated by enabling database checksums. Recently, a sqsh process kicked off via pl/sh crashed with signal 11 but the database process was otherwise intact and fine. This is strong supporting evidence to my points above, I think. I've also turned up a fairly reliable reproduction case from some unrelated application changes. If I can demonstrate that close on exec flag works and prevents these occurrences we can close the book on this. merlin
Re: Index Skip Scan
> On Thu, Feb 06, 2020 at 09:22:20PM +0900, Kyotaro Horiguchi wrote: > At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> > wrote in > > > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote: > > > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> > > > wrote in > > > We could add an additional parameter "in_cursor" to > > > ExecSupportBackwardScan and let skip scan return false if in_cursor is > > > true, but I'm not sure it's acceptable. > > > > I also was thinking about whether it's possible to use > > ExecSupportBackwardScan here, but skip scan is just a mode of an > > index/indexonly scan. Which means that ExecSupportBackwardScan also need > > to know somehow if this mode is being used, and then, since this > > function is called after it's already decided to use skip scan in the > > resulting plan, somehow correct the plan (exclude skipping and try to > > find next best path?) - do I understand your suggestion correct? > > I didn't thought so hardly, but a bit of confirmation told me that > IndexSupportsBackwardScan returns fixed flag for AM. It seems that > things are not that simple. Yes, I've mentioned that already in one of the previous emails :) The simplest way I see to achieve what we want is to do something like in attached modified version with a new hasDeclaredCursor field. It's not a final version though, but posted just for discussion, so feel free to suggest any improvements or alternatives. >From 22e6b4ccd5f79ca069bd5cd90ba3696dd97f76ea Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Tue, 9 Jul 2019 06:44:57 -0400 Subject: [PATCH v33 1/2] Unique key Design by David Rowley. Author: Jesper Pedersen --- src/backend/nodes/outfuncs.c | 14 +++ src/backend/nodes/print.c | 39 +++ src/backend/optimizer/path/Makefile| 3 +- src/backend/optimizer/path/allpaths.c | 8 ++ src/backend/optimizer/path/indxpath.c | 41 +++ src/backend/optimizer/path/pathkeys.c | 71 ++-- src/backend/optimizer/path/uniquekey.c | 147 + src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planmain.c | 1 + src/backend/optimizer/plan/planner.c | 17 ++- src/backend/optimizer/util/pathnode.c | 12 ++ src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 18 +++ src/include/nodes/print.h | 1 + src/include/optimizer/pathnode.h | 1 + src/include/optimizer/paths.h | 11 ++ 16 files changed, 373 insertions(+), 13 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekey.c diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index d76fae44b8..16083e7a7e 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node) WRITE_FLOAT_FIELD(startup_cost, "%.2f"); WRITE_FLOAT_FIELD(total_cost, "%.2f"); WRITE_NODE_FIELD(pathkeys); + WRITE_NODE_FIELD(uniquekeys); } /* @@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(eq_classes); WRITE_BOOL_FIELD(ec_merging_done); WRITE_NODE_FIELD(canon_pathkeys); + WRITE_NODE_FIELD(canon_uniquekeys); WRITE_NODE_FIELD(left_join_clauses); WRITE_NODE_FIELD(right_join_clauses); WRITE_NODE_FIELD(full_join_clauses); @@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(placeholder_list); WRITE_NODE_FIELD(fkey_list); WRITE_NODE_FIELD(query_pathkeys); + WRITE_NODE_FIELD(query_uniquekeys); WRITE_NODE_FIELD(group_pathkeys); WRITE_NODE_FIELD(window_pathkeys); WRITE_NODE_FIELD(distinct_pathkeys); @@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node) WRITE_BOOL_FIELD(pk_nulls_first); } +static void +_outUniqueKey(StringInfo str, const UniqueKey *node) +{ + WRITE_NODE_TYPE("UNIQUEKEY"); + + WRITE_NODE_FIELD(eq_clause); +} + static void _outPathTarget(StringInfo str, const PathTarget *node) { @@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj) case T_PathKey: _outPathKey(str, obj); break; + case T_UniqueKey: +_outUniqueKey(str, obj); +break; case T_PathTarget: _outPathTarget(str, obj); break; diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c index 42476724d8..d286b34544 100644 --- a/src/backend/nodes/print.c +++ b/src/backend/nodes/print.c @@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable) printf(")\n"); } +/* + * print_uniquekeys - + * uniquekeys list of UniqueKeys + */ +void +print_uniquekeys(const List *uniquekeys, const List *rtable) +{ + ListCell *l; + + printf("("); + foreach(l, uniquekeys) + { + UniqueKey *unique_key = (UniqueKey *) lfirst(l); + EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause; + ListCell *k; + bool first = true; + + /* chase up */ + while (eclass->ec_merged)
Re: proposal: schema variables
Hi rebase Regards Pavel 0002-transactional-variables-20200207.patch.gz Description: application/gzip 0001-schema-variables-20200207.patch.gz Description: application/gzip
Does recovery write to backup_label ?
Just saw this in a PG 11.6 cluster starting a recovery: 2020-02-07 10:45:40 EST FATAL: 42501: could not open file "backup_label": Permission denied 2020-02-07 10:45:40 EST LOCATION: fsync_fname_ext, fd.c:3531 The label file was written with mode 0400 by a script that got the contents from pg_stop_backup(boolean,boolean). But during recovery, it is being poked at by fsync_fname_ext which wants to open it O_RDWR. I had assumed the label file would be treated as readonly during recovery. If the file needs to have 0600 permissions, should there be a note in the nonexclusive-mode backup docs to say so? Regards, -Chap
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi Andy, What might help is to add more description to your email message like giving examples to explain your idea. Anyway, I looked at the testcases you added for examples. +create table select_distinct_a(a int, b char(20), c char(20) not null, d int, e int, primary key(a, b)); +set enable_mergejoin to off; +set enable_hashjoin to off; +-- no node for distinct. +explain (costs off) select distinct * from select_distinct_a; + QUERY PLAN +--- + Seq Scan on select_distinct_a +(1 row) >From this example, it seems that the distinct operation can be dropped because (a, b) is a primary key. Is my understanding correct? I like the idea since it eliminates one expensive operation. However the patch as presented has some problems 1. What happens if the primary key constraint or NOT NULL constraint gets dropped between a prepare and execute? The plan will no more be valid and thus execution may produce non-distinct results. PostgreSQL has similar concept of allowing non-grouping expression as part of targetlist when those expressions can be proved to be functionally dependent on the GROUP BY clause. See check_functional_grouping() and its caller. I think, DISTINCT elimination should work on similar lines. 2. For the same reason described in check_functional_grouping(), using unique indexes for eliminating DISTINCT should be discouraged. 3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY as well 4. The patch works only at the query level, but that functionality can be expanded generally to other places which add Unique/HashAggregate/Group nodes if the underlying relation can be proved to produce distinct rows. But that's probably more work since we will have to label paths with unique keys similar to pathkeys. 5. Have you tested this OUTER joins, which can render inner side nullable? On Thu, Feb 6, 2020 at 11:31 AM Andy Fan wrote: > update the patch with considering the semi/anti join. > > Can anyone help to review this patch? > > Thanks > > > On Fri, Jan 31, 2020 at 8:39 PM Andy Fan wrote: > >> Hi: >> >> I wrote a patch to erase the distinctClause if the result is unique by >> definition, I find this because a user switch this code from oracle >> to PG and find the performance is bad due to this, so I adapt pg for >> this as well. >> >> This patch doesn't work for a well-written SQL, but some drawback >> of a SQL may be not very obvious, since the cost of checking is pretty >> low as well, so I think it would be ok to add.. >> >> Please see the patch for details. >> >> Thank you. >> > -- -- Best Wishes, Ashutosh Bapat
Re: improve transparency of bitmap-only heap scans
Hello, It took me a while to figure out what those names mean. "unfetched", as you call it on the code, may be more descriptive than "avoided" for the new label. However I think the other two are more confusing. It may be a good idea to change them together with this. It'll be sad if this patch is forgotten only because of the words choice. I've changed it all to "unfetched" for at least not to call the same thing differently in the code and in the output, and also rebased it and fit in 80 lines width limit. Best, Alex diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index d901dc4a50..c7200d2a21 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -2777,6 +2777,8 @@ show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es) { if (es->format != EXPLAIN_FORMAT_TEXT) { + ExplainPropertyInteger("Unfetched Heap Blocks", NULL, + planstate->unfetched_pages, es); ExplainPropertyInteger("Exact Heap Blocks", NULL, planstate->exact_pages, es); ExplainPropertyInteger("Lossy Heap Blocks", NULL, @@ -2784,10 +2786,14 @@ show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es) } else { - if (planstate->exact_pages > 0 || planstate->lossy_pages > 0) + if (planstate->unfetched_pages > 0 || planstate->exact_pages > 0 + || planstate->lossy_pages > 0) { ExplainIndentText(es); appendStringInfoString(es->str, "Heap Blocks:"); + if (planstate->unfetched_pages > 0) +appendStringInfo(es->str, " unfetched=%ld", + planstate->unfetched_pages); if (planstate->exact_pages > 0) appendStringInfo(es->str, " exact=%ld", planstate->exact_pages); if (planstate->lossy_pages > 0) diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index ae8a11da30..0456f8592b 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -231,17 +231,20 @@ BitmapHeapNext(BitmapHeapScanState *node) * node->return_empty_tuples. */ node->return_empty_tuples = tbmres->ntuples; +node->unfetched_pages++; } else if (!table_scan_bitmap_next_block(scan, tbmres)) { /* AM doesn't think this block is valid, skip */ continue; } - - if (tbmres->ntuples >= 0) -node->exact_pages++; else -node->lossy_pages++; + { +if (tbmres->ntuples >= 0) + node->exact_pages++; +else + node->lossy_pages++; + } /* Adjust the prefetch target */ BitmapAdjustPrefetchTarget(node); diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 5d5b38b879..0bd75c329c 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -1564,6 +1564,7 @@ typedef struct ParallelBitmapHeapState * pvmbuffer ditto, for prefetched pages * exact_pages total number of exact pages retrieved * lossy_pages total number of lossy pages retrieved + * unfetched_pages total number of pages not retrieved due to vm * prefetch_iterator iterator for prefetching ahead of current page * prefetch_pages # pages prefetch iterator is ahead of current * prefetch_targetcurrent target prefetch distance @@ -1588,6 +1589,7 @@ typedef struct BitmapHeapScanState Buffer pvmbuffer; long exact_pages; long lossy_pages; + long unfetched_pages; TBMIterator *prefetch_iterator; int prefetch_pages; int prefetch_target;
Re: Assumptions about the number of parallel workers
On Wed, Feb 5, 2020 at 4:49 AM Antonin Houska wrote: > I can't figure out why ExecGather/ExecGatherMerge do check whether num_workers > is non-zero. I think the code would be a bit clearer if these tests were > replaced with Assert() statements, as the attached patch does. Hmm. There are some cases where we plan on using a Gather node but then can't actually fire up parallelism because we run out of DSM segments or we run out of background workers. But the Gather is just part of the plan, so it would still have num_workers > 0 in those cases. This might just have been a thinko on my part, but I'm not totally sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Proposal] Global temporary tables
On Wed, Feb 5, 2020 at 10:48 AM Konstantin Knizhnik wrote: > > I don't understand. A global temporary table, as I understand it, is a > > table for which each session sees separate contents. So you would > > never need to populate it with existing data. > Session 1: > create global temp table gtt(x integer); > insert into gtt values (generate_series(1,10)); > > Session 2: > insert into gtt values (generate_series(1,20)); > > Session1: > create index on gtt(x); > explain select * from gtt where x = 1; > > Session2: > explain select * from gtt where x = 1; > ??? Should we use index here? OK, I see where you're coming from now. > My answer is - yes. > Just because: > - Such behavior is compatible with regular tables. So it will not > confuse users and doesn't require some complex explanations. > - It is compatible with Oracle. > - It is what DBA usually want when creating index. > - > There are several arguments against such behavior: > - Concurrent building of index in multiple sessions can consume a lot of > memory > - Building index can increase query execution time (which can be not > expected by clients) I think those are good arguments, especially the second one. There's no limit on how long building a new index might take, and it could be several minutes. A user who was running a query that could have completed in a few seconds or even milliseconds will be unhappy to suddenly wait a long time for a new index to be built. And that is an entirely realistic scenario, because the new index might be better, but only marginally. Also, an important point to which I've already alluded a few times is that creating an index can fail. Now, one way it can fail is that there could be some problem writing to disk, or you could run out of memory, or whatever. However, it can also fail because the new index is UNIQUE and the data this backend has in the table doesn't conform to the associated constraint. It will be confusing if all access to a table suddenly starts complaining about uniqueness violations. > That is all - just 10 line of code. I don't believe that the feature you are proposing can be correctly implemented in 10 lines of code. I would be pleasantly surprised if it can be done in 1000. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: In PG12, query with float calculations is slower than PG11
> > The patch looks unduly invasive to me, but I think that it might be > > right that we should go back to a macro-based implementation, because > > otherwise we don't have a good way to be certain that the function > > parameter won't get evaluated first. > > I'd first like to see some actual evidence of this being a problem, > rather than just the order of the checks. There seem to be enough evidence of this being the problem. We are better off going back to the macro-based implementation. I polished Keisuke Kuroda's patch commenting about the performance issue, removed the check_float*_val() functions completely, and added unlikely() as Tom Lane suggested. It is attached. I confirmed with different compilers that the macro, and unlikely() makes this noticeably faster. From e869373ad093e668872f08833de2c5c614aab673 Mon Sep 17 00:00:00 2001 From: Emre Hasegeli Date: Fri, 7 Feb 2020 10:27:25 + Subject: [PATCH] Bring back CHECKFLOATVAL() macro The inline functions added by 6bf0bc842b caused the conditions of overflow/underflow checks to be evaluated when no overflow/underflow happen. This slowed down floating point operations. This commit brings back the macro that was in use before 6bf0bc842b to fix the performace regression. Reported-by: Keisuke Kuroda Author: Keisuke Kuroda Discussion: https://www.postgresql.org/message-id/CANDwggLe1Gc1OrRqvPfGE%3DkM9K0FSfia0hbeFCEmwabhLz95AA%40mail.gmail.com --- src/backend/utils/adt/float.c | 66 ++--- src/backend/utils/adt/geo_ops.c | 2 +- src/include/utils/float.h | 75 ++--- 3 files changed, 66 insertions(+), 77 deletions(-) diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c index a90d4db215..5885719850 100644 --- a/src/backend/utils/adt/float.c +++ b/src/backend/utils/adt/float.c @@ -1184,21 +1184,21 @@ ftod(PG_FUNCTION_ARGS) /* * dtof - converts a float8 number to a float4 number */ Datum dtof(PG_FUNCTION_ARGS) { float8 num = PG_GETARG_FLOAT8(0); - check_float4_val((float4) num, isinf(num), num == 0); + CHECKFLOATVAL((float4) num, isinf(num), num == 0); PG_RETURN_FLOAT4((float4) num); } /* * dtoi4 - converts a float8 number to an int4 number */ Datum dtoi4(PG_FUNCTION_ARGS) @@ -1438,36 +1438,36 @@ dsqrt(PG_FUNCTION_ARGS) float8 arg1 = PG_GETARG_FLOAT8(0); float8 result; if (arg1 < 0) ereport(ERROR, (errcode(ERRCODE_INVALID_ARGUMENT_FOR_POWER_FUNCTION), errmsg("cannot take square root of a negative number"))); result = sqrt(arg1); - check_float8_val(result, isinf(arg1), arg1 == 0); + CHECKFLOATVAL(result, isinf(arg1), arg1 == 0); PG_RETURN_FLOAT8(result); } /* * dcbrt - returns cube root of arg1 */ Datum dcbrt(PG_FUNCTION_ARGS) { float8 arg1 = PG_GETARG_FLOAT8(0); float8 result; result = cbrt(arg1); - check_float8_val(result, isinf(arg1), arg1 == 0); + CHECKFLOATVAL(result, isinf(arg1), arg1 == 0); PG_RETURN_FLOAT8(result); } /* * dpow - returns pow(arg1,arg2) */ Datum dpow(PG_FUNCTION_ARGS) { @@ -1525,40 +1525,40 @@ dpow(PG_FUNCTION_ARGS) /* The sign of Inf is not significant in this case. */ result = get_float8_infinity(); else if (fabs(arg1) != 1) result = 0; else result = 1; } else if (errno == ERANGE && result != 0 && !isinf(result)) result = get_float8_infinity(); - check_float8_val(result, isinf(arg1) || isinf(arg2), arg1 == 0); + CHECKFLOATVAL(result, isinf(arg1) || isinf(arg2), arg1 == 0); PG_RETURN_FLOAT8(result); } /* * dexp - returns the exponential function of arg1 */ Datum dexp(PG_FUNCTION_ARGS) { float8 arg1 = PG_GETARG_FLOAT8(0); float8 result; errno = 0; result = exp(arg1); if (errno == ERANGE && result != 0 && !isinf(result)) result = get_float8_infinity(); - check_float8_val(result, isinf(arg1), false); + CHECKFLOATVAL(result, isinf(arg1), false); PG_RETURN_FLOAT8(result); } /* * dlog1 - returns the natural logarithm of arg1 */ Datum dlog1(PG_FUNCTION_ARGS) { @@ -1573,21 +1573,21 @@ dlog1(PG_FUNCTION_ARGS) ereport(ERROR, (errcode(ERRCODE_INVALID_ARGUMENT_FOR_LOG), errmsg("cannot take logarithm of zero"))); if (arg1 < 0) ereport(ERROR, (errcode(ERRCODE_INVALID_ARGUMENT_FOR_LOG), errmsg("cannot take logarithm of a negative number"))); result = log(arg1); - check_float8_val(result, isinf(arg1), arg1 == 1); + CHECKFLOATVAL(result, isinf(arg1), arg1 == 1); PG_RETURN_FLOAT8(result); } /* * dlog10 - returns the base 10 logarithm of arg1 */ Datum dlog10(PG_FUNCTION_ARGS) { @@ -1603,21 +1603,21 @@ dlog10(PG_FUNCTION_ARGS) ereport(ERROR, (errcode(ERRCODE_INVALID_ARGUMENT_FOR_LOG), errmsg("cannot take logarithm of zero"))); if (arg1 < 0) ereport(ERROR, (errcode(ERRCODE_INVALID_ARGUMENT_FOR_LOG), errmsg("cannot take logarithm of a negative number"))); result = log10(arg1);
Re: ALTER tbl rewrite loses CLUSTER ON index (consider moving indisclustered to pg_class)
On Thu, Feb 06, 2020 at 02:24:47PM -0300, Alvaro Herrera wrote: > On 2020-Feb-06, Justin Pryzby wrote: > > > I wondered if it wouldn't be better if CLUSTER ON was stored in pg_class as > > the > > Oid of a clustered index, rather than a boolean in pg_index. > > Maybe. Do you want to try a patch? I think the attached is 80% complete (I didn't touch pg_dump). One objection to this change would be that all relations (including indices) end up with relclustered fields, and pg_index already has a number of bools, so it's not like this one bool is wasting a byte. I think relisclustered was a's clever way of avoiding that overhead (c0ad5953). So I would be -0.5 on moving it to pg_class.. But I think 0001 and 0002 are worthy. Maybe the test in 0002 should live somewhere else. >From 7eea0a17e495fe13379ffd589b551f2f145f5672 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 6 Feb 2020 21:48:13 -0600 Subject: [PATCH v1 1/3] Update comment obsolete since b9b8831a --- src/backend/commands/cluster.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index e9d7a7f..3adcbeb 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -1539,9 +1539,9 @@ get_tables_to_cluster(MemoryContext cluster_context) /* * Get all indexes that have indisclustered set and are owned by - * appropriate user. System relations or nailed-in relations cannot ever - * have indisclustered set, because CLUSTER will refuse to set it when - * called with one of them as argument. + * appropriate user. Shared relations cannot ever have indisclustered + * set, because CLUSTER will refuse to set it when called with one as + * an argument. */ indRelation = table_open(IndexRelationId, AccessShareLock); ScanKeyInit(&entry, -- 2.7.4 >From 4777be522a7aa8b8c77b13f765cbd02043438f2a Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Fri, 7 Feb 2020 08:12:50 -0600 Subject: [PATCH v1 2/3] Give developer a helpful kick in the pants if they change natts in one place but not another --- src/backend/bootstrap/bootstrap.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c index bfc629c..d5e1888 100644 --- a/src/backend/bootstrap/bootstrap.c +++ b/src/backend/bootstrap/bootstrap.c @@ -25,7 +25,9 @@ #include "access/xlog_internal.h" #include "bootstrap/bootstrap.h" #include "catalog/index.h" +#include "catalog/pg_class.h" #include "catalog/pg_collation.h" +#include "catalog/pg_proc.h" #include "catalog/pg_type.h" #include "common/link-canary.h" #include "libpq/pqsignal.h" @@ -49,6 +51,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/relmapper.h" +#include "utils/syscache.h" uint32 bootstrap_data_checksum_version = 0; /* No checksum */ @@ -602,6 +605,26 @@ boot_openrel(char *relname) TableScanDesc scan; HeapTuple tup; + /* Check that pg_class data is consistent now, rather than failing obscurely later */ + struct { Oid oid; int natts; } + checknatts[] = { + {RelationRelationId, Natts_pg_class,}, + {TypeRelationId, Natts_pg_type,}, + {AttributeRelationId, Natts_pg_attribute,}, + {ProcedureRelationId, Natts_pg_proc,}, + }; + + for (int i=0; irelnatts); + ReleaseSysCache(tuple); + } + if (strlen(relname) >= NAMEDATALEN) relname[NAMEDATALEN - 1] = '\0'; -- 2.7.4 >From ed886f8202486dea8069b719d35a5d0db7f3277c Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 6 Feb 2020 12:56:34 -0600 Subject: [PATCH v1 3/3] Make cluster a property of table in pg_index.. ..rather than of indexes in pg_index. The only issue with this is that it makes pg_class larger, and the new column applies not only to tables, but to indices. --- doc/src/sgml/catalogs.sgml | 14 +-- src/backend/catalog/heap.c | 1 + src/backend/catalog/index.c| 6 -- src/backend/commands/cluster.c | 172 + src/backend/commands/tablecmds.c | 5 +- src/backend/utils/cache/relcache.c | 1 - src/bin/psql/describe.c| 4 +- src/include/catalog/pg_class.dat | 2 +- src/include/catalog/pg_class.h | 3 + src/include/catalog/pg_index.h | 1 - 10 files changed, 93 insertions(+), 116 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index a10b665..8efeaff 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -1752,6 +1752,13 @@ SCRAM-SHA-256$:&l + relclustered + oid + + The OID of the index last clustered, or zero + + + relpages int4 @@ -3808,13 +3815,6 @@ SCRAM-SHA-256$ :&l - indisclustered - bool - - If true, the table was last clustered on this index - - - indisvalid bool diff --git a/src/back
Re: Is custom MemoryContext prohibited?
On Wed, Feb 5, 2020 at 10:23 PM Andres Freund wrote: > I still don't get what reason there is to not use T_MemoryContext as the > magic number, instead of something randomly new. It's really not > problematic to expose those numerical values. And it means that the > first bit of visual inspection is going to be the same as it always has > been, and the same as it works for most other types one regularly > inspects in postgres. Without trying to say that my thought process is necessarily correct, I can explain my thought process. Many years ago, I had Tom slap down a patch of mine for making something participate in the Node system unnecessarily. He pointed out, then and at other times, that there was no reason for everything to be part of the giant enum just because many things need to be. At the time, I was a bit perplexed by his feedback, but over time I've come to see the point. We've got lots of "enum" fields all over the backend whose purpose it is to decide whether a particular object is of one sub-type or another. We've also got NodeTag, which is that same thing at a very large scale. I used to think that the reason why we had jammed everything into NodeTag was just programming laziness or some kind of desire for consistency, but now I think that the real point is that making something a node allows it to participate in readfuncs.c, outfuncs.c, copyfuncs.c, equalfuncs.c; so that a complex data structure made entirely of nodes can be easily copied, serialized, etc. The MemoryContext stuff participates in none of that machinery, and it would be difficult to make it do so, and therefore does not really need to be part of the Node system at all. The fact that it *is* a part of the node system is just a historical accident, or so I think. Sure, it's not an inconvenient thing to see a NodeTag on random stuff that you're inspecting with a debugger, but if we took that argument to its logical conclusion we would, I think, end up needing to add node tags to a lot of stuff that doesn't have them now - like TupleTableSlots, for example. Also, as a general rule, every Node of a given type is expected to have the same structure, which wouldn't be true here, because there are multiple types of memory contexts that can exist, and T_MemoryContext would identify them all. It's true that there are some other weird exceptions, but it doesn't seem like a great idea to create more. Between those concerns, and those I wrote about in my last post, it seemed to me that it made more sense to try to break the dependency between palloc.h and nodes.h rather than to embrace it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: In PG12, query with float calculations is slower than PG11
> Fwiw, also tried the patch that Kuroda-san had posted yesterday. I run the same test case too: clang version 7.0.0: HEAD 2548.119 ms with patch 2320.974 ms clang version 8.0.0: HEAD 2431.766 ms with patch 2419.439 ms clang version 9.0.0: HEAD 2477.493 ms with patch 2365.509 ms gcc version 7.4.0: HEAD 2451.261 ms with patch 2343.393 ms gcc version 8.3.0: HEAD 2540.626 ms with patch 2299.653 ms
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)
On Fri, Feb 7, 2020 at 10:09 PM Fujii Masao wrote: > Pushed! Thanks! Thanks Fujii. -- Tatsuhito Kasahara kasahara.tatsuhito _at_ gmail.com
Re: Is custom MemoryContext prohibited?
On Wed, Feb 5, 2020 at 10:09 PM Andres Freund wrote: > I wasn't advocating for making plannodes.h etc frontend usable. I think > that's a fairly different discussion than making enum NodeTag, > pg_list.h, memutils.h available. I don't see them having access to the > numerical value of node tag for backend structs as something actually > problematic (I'm pretty sure you can do that today already if you really > want to - but why would you?). > > I don't buy that having a separate magic number for various types that > we may want to use both frontend and backend is better than largely just > having one set of such magic type identifiers. To be honest, and I realize this is probably going to blow your mind and/or make you think that I'm completely insane, one concern that I have here is that I have seen multiple people fail to understand that the frontend and backend are, ah, not the same process. And they try to write code in frontend environments that makes no sense whatsoever there. The fact that nodes.h could hypothetically be included in frontend code doesn't really contribute to confusion in this area, but I'm concerned that including it in every file might, because it means that a whole lot of backend-only stuff suddenly becomes visible in any code that anyone writes anywhere. And as things stand that would the effect of adding #include "utils/palloc.h" to "postgres_fe.h". Perhaps I worrying too much. On a broader level, I am not convinced that having one "enum" to rule them all is a good design. If we go that direction, then it means that frontend code code that wants to add its own node types (and why shouldn't it want to do that?) would have to have them be visible to the backend and to all other frontend processes. That doesn't seem like a disaster, but I don't think it's great. I also don't really like the fact that we have one central registry of node types that has to be edited to add more node types, because it means that extensions are not free to do so. I know we're some distance from allowing any real extensibility around new node types and perhaps we never will, but on principle a centralized registry sucks, and I'd prefer a more decentralized solution if we could find one that would be acceptable. I don't know what that would be, though. Even though I'm not as trenchant about debuggability as you and Tom, having a magic number at the beginning of every type of node in lieu of an enum would drive me nuts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
control max length of parameter values logged
Hello Patch ba79cb5 (for full discussion see [1]) introduces a feature to log bind parameter values on error, which greatly helps to reproduce errors artificially having only server log -- thanks everyone who reviewed and improved it! However, it cuts the values, as some of the reviewers were worried log could fill up too quickly. This applies both to the new case of logging parameter values and to the existing ones due to log_min_duration_statement or log_statement. This is a backwards-incompatible change, and also ruins the idea of reproducible errors -- sorry Tom I failed to second this idea [2] in time, before the change was pushed. I personally don't think that we necessarily need to cut the values, we can rely on the users being careful when using this feature -- in the same way we trusted them use similarly dangerous log_min_duration_statement and especially log_statement for ages. At least it's better than having no option to disable it. Alvaro's opinion was different [3]. What do you think of adding a parameter to limit max logged parameter length? See patch attached. Best, Alex [1] https://postgr.es/m/0146a67b-a22a-0519-9082-bc29756b9...@imap.cc [2] https://postgr.es/m/11425.1575927321%40sss.pgh.pa.us [3] https://postgr.es/m/20191209200531.GA19848@alvherre.pgsql diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index c1128f89ec..0f40246c2d 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6645,6 +6645,28 @@ log_line_prefix = '%m [%p] %q%u@%d/%a ' + + log_parameter_max_length (boolean) + + log_parameter_max_length configuration parameter + + + + +Controls whether to log parameters in full or cut them to a certain length. +Bind parameters can appear in the log as a result of +, + or + +settings. +Zero (the default) disables cutting. +If this value is specified without units, it is taken as bytes. +Due to multibyte character and ellipsis, truncated values may be slightly shorter. +Only superusers can change this setting. + + + + log_statement (enum) diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 0a6f80963b..acc31899d6 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -1839,13 +1839,22 @@ exec_bind_message(StringInfo input_message) if (knownTextValues == NULL) knownTextValues = palloc0(numParams * sizeof(char *)); - /* - * Note: must copy at least two more full characters - * than BuildParamLogString wants to see; otherwise it - * might fail to include the ellipsis. - */ - knownTextValues[paramno] = - pnstrdup(pstring, 64 + 2 * MAX_MULTIBYTE_CHAR_LEN); + + if (log_parameter_max_length != 0) + { + /* + * Note: must copy at least two more full characters + * than BuildParamLogString wants to see; + * otherwise it might fail to include the ellipsis. + */ + knownTextValues[paramno] = +pnstrdup(pstring, + log_parameter_max_length + + 2 * MAX_MULTIBYTE_CHAR_LEN); + } + else + knownTextValues[paramno] = pstrdup(pstring); + MemoryContextSwitchTo(oldcxt); } if (pstring != pbuf.data) @@ -1908,7 +1917,9 @@ exec_bind_message(StringInfo input_message) */ if (log_parameters_on_error) params->paramValuesStr = -BuildParamLogString(params, knownTextValues, 64); +BuildParamLogString(params, + knownTextValues, + log_parameter_max_length); } else params = NULL; @@ -2397,7 +2408,7 @@ errdetail_params(ParamListInfo params) { char *str; - str = BuildParamLogString(params, NULL, 0); + str = BuildParamLogString(params, NULL, log_parameter_max_length); if (str && str[0] != '\0') errdetail("parameters: %s", str); } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 8228e1f390..f27efc6c24 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -543,6 +543,7 @@ int log_min_messages = WARNING; int client_min_messages = NOTICE; int log_min_duration_sample = -1; int log_min_duration_statement = -1; +int log_parameter_max_length = 0; int log_temp_files = -1; double log_statement_sample_rate = 1.0; double log_xact_sample_rate = 0; @@ -2834,6 +2835,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"log_parameter_max_length", PGC_SUSET, LOGGING_WHAT, + gettext_noop("When logging a parameter value, only log first N bytes."), + gettext_noop("Zero to print values in full."), + GUC_UNIT_BYTE + }, + &log_parameter_max_length, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + { {"bgwriter_delay", PGC_SIGHUP, RESOURCES_BGWRITER, gettext_noop("Background writer sleep time between rounds."), diff -
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)
On 2020/02/07 17:28, Kasahara Tatsuhito wrote: Hi, On Fri, Feb 7, 2020 at 5:02 PM Fujii Masao wrote: BTW, commit 147e3722f7 causing the issue changed currtid_byreloid() and currtid_byrelname() so that they also call table_beginscan(). I'm not sure what those functions are, but probably we should fix them so that table_beginscan_tid() is called instead. Thought? +1, sorry, I overlooked it. Both functions are used to check whether a valid tid or not with a relation name (or oid), and both perform a tid scan internally. So, these functions should call table_beginscan_tid(). Perhaps unnecessary, I will attach a patch. Pushed! Thanks! Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
Re: [HACKERS] advanced partition matching algorithm for partition-wise join
On Thu, Feb 6, 2020 at 3:55 AM Mark Dilger wrote: > The patches apply and pass all tests. A review of the patch vs. master looks > reasonable. Thanks for the review! > The partition_join.sql test has multiple levels of partitioning, but when > your patch extends that test with “advanced partition-wise join”, none of the > tables for the new section have multiple levels. I spent a little while > reviewing the code and inventing multiple level partitioning tests for > advanced partition-wise join and did not encounter any problems. I don’t > care whether you use this particular example, but do you want to have > multiple level partitioning in the new test section? Yes, I do. > CREATE TABLE alpha (a double precision, b double precision) PARTITION BY > RANGE (a); > CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM ('-Infinity') TO > (0) PARTITION BY RANGE (b); > CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO ('Infinity') > PARTITION BY RANGE (b); > CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity') TO > ('NaN'); > CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM > ('-Infinity') TO (0); > CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0) TO > ('Infinity'); > CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM > ('Infinity') TO ('NaN'); > CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM > ('-Infinity') TO (0); > CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0) TO > ('Infinity'); > CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM > ('Infinity') TO ('NaN'); > INSERT INTO alpha (a, b) > (SELECT * FROM > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), > ('Infinity'::float8)) a, > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), > ('Infinity'::float8)) b > ); > ANALYZE alpha; > ANALYZE alpha_neg; > ANALYZE alpha_pos; > ANALYZE alpha_nan; > ANALYZE alpha_neg_neg; > ANALYZE alpha_neg_pos; > ANALYZE alpha_neg_nan; > ANALYZE alpha_pos_neg; > ANALYZE alpha_pos_pos; > ANALYZE alpha_pos_nan; > CREATE TABLE beta (a double precision, b double precision) PARTITION BY RANGE > (a, b); > CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0, 0); > CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0, 5); > CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5, 5); > INSERT INTO beta (a, b) > (SELECT * FROM > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a, > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b > ); > ANALYZE beta; > ANALYZE beta_lo; > ANALYZE beta_me; > ANALYZE beta_hi; > EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND alpha.b > = beta.b) WHERE alpha.a = 1 AND beta.b = 1; > QUERY PLAN > --- > Nested Loop (cost=0.00..2.11 rows=1 width=32) >-> Seq Scan on alpha_pos_pos alpha (cost=0.00..1.06 rows=1 width=16) > Filter: ((b = '1'::double precision) AND (a = '1'::double precision)) >-> Seq Scan on beta_hi beta (cost=0.00..1.04 rows=1 width=16) > Filter: ((b = '1'::double precision) AND (a = '1'::double precision)) > (5 rows) Hmm, I'm not sure this is a good test case for that, because this result would be due to partition pruning applied to each side of the join before considering partition-wise join; you could get the same result even with enable_partitionwise_join=off. I think it's important that the partition-wise join logic doesn't break this query, though. Best regards, Etsuro Fujita
Re: [HACKERS] advanced partition matching algorithm for partition-wise join
On Thu, Feb 6, 2020 at 2:03 AM Mark Dilger wrote: > > On Feb 5, 2020, at 4:51 AM, Etsuro Fujita wrote: > >> On Tue, Jan 28, 2020 at 1:39 PM Mark Dilger > >> wrote: > >>> I have added tests checking correctness and showing some partition > >>> pruning limitations. Find three patches, attached. > >>> > >>> The v31-0001-… patch merely applies your patches as a starting point for > >>> the next two patches. It is your work, not mine. > >>> > >>> The v31-0002-… patch adds more regression tests for range partitioning. > >>> The commit message contains my comments about that. > >>> > >>> The v31-0003-… patch adds more regression tests for list partitioning, > >>> and again, the commit message contains my comments about that. > > I tested the new test patches. The patches are applied to the > > partition_join.sql regression test cleanly, but the new tests failed > > in my environment (even with make check LANG=C). I think I should set > > the right locale for the testing. Which one did you use? > > I did not set a locale in the tests. My environment has: > > LANG="en_US.UTF-8" > LC_COLLATE="en_US.UTF-8" > LC_CTYPE="en_US.UTF-8" > LC_MESSAGES="en_US.UTF-8" > LC_MONETARY="en_US.UTF-8" > LC_NUMERIC="en_US.UTF-8" > LC_TIME="en_US.UTF-8" > LC_ALL= Thanks for the information! > > Maybe I'm > > missing something, but let me comment on the new tests. This is the > > one proposed in the v31-0002 patch: > > > > +EXPLAIN (COSTS OFF) > > +SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) > > WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); > > +QUERY PLAN > > +-- > > + Hash Join > > + Hash Cond: (t2.a = t1.a) > > + -> Append > > + -> Seq Scan on beta_a t2_1 > > + -> Seq Scan on beta_b t2_2 > > + -> Seq Scan on beta_c t2_3 > > + -> Seq Scan on beta_d t2_4 > > + -> Seq Scan on beta_e t2_5 > > + -> Seq Scan on beta_f t2_6 > > + -> Seq Scan on beta_g t2_7 > > + -> Seq Scan on beta_h t2_8 > > + -> Seq Scan on beta_default t2_9 > > + -> Hash > > + -> Append > > + -> Seq Scan on alpha_e t1_1 > > + Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[])) > > + -> Seq Scan on alpha_default t1_2 > > + Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[])) > > +(18 rows) > > > > The commit message says: > > > >When joining with > > > >alpha.a = beta.a and alpha.a IN ('äbç', 'ὀδυσσεύς') > > > >the planner does the right thing for one side of the query, but > >hits all partitions for the other side, which it doesn't need to > >do. > > > > Yeah, I agree that the resulting plan is not efficient. The reason > > for that would be that the planner doesn't generate a qual on the > > outer side matching the ScalarArrayOpExpr qual "a = ANY > > ('{äbç,ὀδυσσεύς}'::text[])" on the inner side, which I think would be > > a restriction caused by the equivalence machinery not by the > > partitionwise join logic IIUC. > > It’s fine if this is beyond the scope of the patch. > > > I think the critique would be useful, > > so I don't object to adding this test case, but the critique would be > > more about query planning that is actually not related to the > > partitionwise join logic, so I'm not sure that the partition_join.sql > > regression test is the best place to add it. I guess that there would > > be much better places than partition_join.sql. > > You don’t need to add the test anywhere. It’s enough for me that you looked > at it and considered whether the partition-wise join patch should do anything > differently in this case. Again, it sounds like this is beyond the scope of > the patch. OK > > (This is nitpicking; > > but another thing I noticed about this test case is that the join > > query contains only a single join condition "t1.a = t2.a", but the > > queried tables alpha and beta are range-partitioned by multiple > > columns a and b, so the query should have a join condition for each of > > the columns like "t1.a = t2.a AND t1.b = t2.b" if adding this as a > > test case for partitionwise join.) > > Well, it is important that partition-wise join does not break such queries. > I added the column ‘b’ to the partitioning logic to verify that did not > confuse the code in your patch. OK, thanks for the testing! Best regards, Etsuro Fujita
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Tue, Feb 4, 2020 at 2:40 PM Amit Kapila wrote: > > I don't think we can just back-patch that part of code as it is linked > to the way we are maintaining a cache (~8MB) for frequently allocated > objects. See the comments around the definition of > max_cached_tuplebufs. But probably, we can do something once we reach > such a limit, basically, once we know that we have already allocated > max_cached_tuplebufs number of tuples of size MaxHeapTupleSize, we > don't need to allocate more of that size. Does this make sense? > Yeah, this makes sense. I've attached a patch that implements the same. It solves the problem reported earlier. This solution will at least slow down the process of going OOM even for very small sized tuples. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com 0001-Restrict-memory-allocation-in-reorderbuffer-context.patch Description: Binary data
Re: Internal key management system
On Fri, 7 Feb 2020 at 11:36, Andres Freund wrote: > > Hi, > > On 2020-02-07 11:18:29 +0900, Masahiko Sawada wrote: > > Another idea we discussed is to internally integrate pgcrypto with the > > key management system. > > Perhaps this has already been discussed (I only briefly looked): I'd > strongly advise against having any new infrastrure depend on > pgcrypto. Its code quality imo is well below our standards and contains > serious red flags like very outdated copies of cryptography algorithm > implementations. I think we should consider deprecating and removing > it, not expanding its use. It certainly shouldn't be involved in any > potential disk encryption system at a later stage. Thank you for the advise. Yeah I'm not going to use pgcrypto for transparent data encryption. The KMS patch includes the new basic infrastructure for cryptographic functions (mainly AES-CBC). I'm thinking we can expand that infrastructure so that we can also use it for TDE purpose by supporting new cryptographic functions such as AES-CTR. Anyway, I agree to not have it depend on pgcrypto. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: typedef SegmentNumber
On Fri, Feb 07, 2020 at 01:00:00PM +1300, Thomas Munro wrote: > Hello, > > Here's a rebase of a refactoring patch that got lost behind a filing > cabinet on another thread even though there seemed to be some > agreement that we probably want something like this[1]. It introduces > a new type SegmentNumber, instead of using BlockNumber to represent > segment numbers. +1, and looks good to me!
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila wrote: > > On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar wrote: > > > > Fixed in the latest version sent upthread. > > > > Okay, thanks. I haven't looked at the latest version of patch series > as I was reviewing the previous version and I think all of these > comments are in the patch which is not modified. Here are my > comments: > > I think we don't need to maintain > v8-0007-Support-logical_decoding_work_mem-set-from-create as per > discussion in one of the above emails [1] as its usage is not clear. > > v8-0008-Add-support-for-streaming-to-built-in-replication > 1. > - information. The allowed options are slot_name and > - synchronous_commit > + information. The allowed options are slot_name, > + synchronous_commit, work_mem > + and streaming. > > As per the discussion above [1], I don't think we need work_mem here. > You might want to remove the other usage from the patch as well. After putting more thought on this it appears that there could be some use cases for setting the work_mem from the subscription, Assume a case where data are coming from two different origins and based on the origin ids different slots might collect different type of changes, So isn't it good to have different work_mem for different slots? I am not saying that the current way of implementing is the best one but that we can improve. First, we need to decide whether we have a use case for this or not. Please let me know your thought on the same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Thu, Feb 06, 2020 at 02:59:09PM -0500, Robert Haas wrote: > On Wed, Feb 5, 2020 at 9:32 AM Julien Rouhaud wrote: > > There's also the possibility to reserve 1 bit of the hash to know if > > this is a utility command or not, although I don't recall right now > > all the possible issues with utility commands and some special > > handling of them. I'll work on it before the next commitfest. > > FWIW, I don't really see why it would be bad to have 0 mean that > "there's no query ID for some reason" without caring whether that's > because the current statement is a utility statement or because > there's no statement in progress at all or whatever else. The user > probably doesn't need our help to distinguish between "no statement" > and "utility statement", right? Sure, but if we don't fix that it means that we also won't expose any queryid for utility statement, even if pg_stat_statements is configured to track those (with a very poor queryid handling, but still). While looking at this again, I realized that pg_stat_statements doesn't compute a queryid during the post parse analysis hook just to make sure that no query identifier will be set during executorStart and the rest of executor functions. AFAICT, that can't happen anyway since pg_plan_queries() will discard any computed queryid for utility statements. This seems to be an oversight due to original pg_stat_statements implementation, so I fixed this. Then, as processUtility is called between parse analysis and executor, I think that we can simply work around this by computing utility statements query identifier during parse analysis, removing it in pgss_ProcessUtility and keeping a copy of it for the pgss_store calls in that function, as done in the attached v5. This fixes everything except EXECUTE statements, which has to get the underlying query's queryid. The problem is that EXECUTE won't get through parse analysis, so while it's correctly handled for execution and pgss_store, it's not being exposed in pg_stat_activity and log_line_prefix. To fix it, I added an extra call to pgstat_report_queryid in executorStart. As this function is a no-op if a queryid is already exposed, this shouldn't cause any harm and fix any other cases of query execution that don't go through parse analysis. Finally, DEALLOCATE is entirely ignored by pg_stat_statements, so those statements will always be reported with a NULL/0 queryid, but this is consistent as it's also not present in pg_stat_statements() SRF. diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 6f82a671ee..2da810ade6 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -112,6 +112,14 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100; #define JUMBLE_SIZE1024/* query serialization buffer size */ +/* + * Utility statements that pgss_ProcessUtility and pgss_post_parse_analyze + * ignores. + */ +#define PGSS_HANDLED_UTILITY(n)(!IsA(n, ExecuteStmt) && \ + !IsA(n, PrepareStmt) && \ + !IsA(n, DeallocateStmt)) + /* * Extension version number, for supporting older extension versions' objects */ @@ -308,7 +316,8 @@ static void pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString, ProcessUtilityContext context, ParamListInfo params, QueryEnvironment *queryEnv, DestReceiver *dest, char *completionTag); -static uint64 pgss_hash_string(const char *str, int len); +static const char *pgss_clean_querytext(const char *query, int *location, int *len); +static uint64 pgss_compute_utility_queryid(const char *query, int query_len); static void pgss_store(const char *query, uint64 queryId, int query_location, int query_len, double total_time, uint64 rows, @@ -792,16 +801,34 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query) return; /* -* Utility statements get queryId zero. We do this even in cases where -* the statement contains an optimizable statement for which a queryId -* could be derived (such as EXPLAIN or DECLARE CURSOR). For such cases, -* runtime control will first go through ProcessUtility and then the -* executor, and we don't want the executor hooks to do anything, since we -* are already measuring the statement's costs at the utility level. +* We compute a queryId now so that it can get exported in out +* PgBackendStatus. pgss_ProcessUtility will later discard it to
Fix comment for max_cached_tuplebufs definition
Hello Hackers, While working on some issue in logical decoding, I found some inconsistencies in the comment for defining max_cached_tuplebufs in reorderbuffer.c. It only exists till PG10 because after that the definition got removed by the generational memory allocator patch. The variable is defined as follows in reorderbuffer.c: static const Size max_cached_tuplebufs = 4096 * 2; /* ~8MB */ And it gets compared with rb->nr_cached_tuplebufs in ReorderBufferReturnTupleBuf as follows: if (tuple->alloc_tuple_size == MaxHeapTupleSize && rb->nr_cached_tuplebufs < max_cached_tuplebufs) { rb->nr_cached_tuplebufs++; } So, what this variable actually tracks is 4096 * 2 times MaxHeapTupleSize amount of memory which is approximately 64MB. I've attached a patch to modify the comment. But, I'm not sure whether the intention was to keep 8MB cache only. In that case, I can come up with another patch. Thoughts? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com 0001-Fix-comment-for-max_cached_tuplebufs-definition.patch Description: Binary data
Re: [HACKERS] WIP: Aggregation push-down
Richard Guo wrote: > Hi, > > I've been looking at the 'make_join_rel()' part of the patch, and I'm > aware that, if we are joining A and B, a 'grouped join rel (AB)' would > be created besides the 'plain join rel (AB)', and paths are added by 1) > applying partial aggregate to the paths of the 'plain join rel (AB)', or > 2) joining grouped A to plain B or joining plain A to grouped B. > > This is a smart idea. One issue I can see is that some logic would have > to be repeated several times. For example, the validity check for the > same proposed join would be performed at most three times by > join_is_legal(). > > I'm thinking of another idea that, instead of using a separate > RelOptInfo for the grouped rel, we add in RelOptInfo a > 'grouped_pathlist' for the grouped paths, just like how we implement > parallel query via adding 'partial_pathlist'. > > For base rel, if we decide it can produce grouped paths, we create the > grouped paths by applying partial aggregation to the paths in 'pathlist' > and add them into 'grouped_pathlist'. > > For join rel (AB), we can create the grouped paths for it by: > 1) joining a grouped path from 'A->grouped_pathlist' to a plain path > from 'B->pathlist', or vice versa; > 2) applying partial aggregation to the paths in '(AB)->pathlist'. > > This is basically the same idea, just moves the grouped paths from the > grouped rel to a 'grouped_pathlist'. With it we should not need to make > any code changes to 'make_join_rel()'. Most code changes would happen in > 'add_paths_to_joinrel()'. > > Will this idea work? Is it better or worse? I tried such approach in an earlier version of the patch [1], and I think the reason also was to avoid repeated calls of functions like join_is_legal(). However there were objections against such approach, e.g. [2], and I admit that it was more invasive than what the current patch version does. Perhaps we can cache the result of join_is_legal() that we get for the plain relation, and use it for the group relation. I'll consider that. Thanks. [1] https://www.postgresql.org/message-id/18007.1513957437%40localhost [2] https://www.postgresql.org/message-id/CA%2BTgmob8og%2B9HzMg1vM%2B3LwDm2f_bHUi9%2Bg1bqLDTjqpw5s%2BnQ%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: In PG12, query with float calculations is slower than PG11
Fwiw, also tried the patch that Kuroda-san had posted yesterday. On Fri, Feb 7, 2020 at 5:17 PM Amit Langote wrote: > Latency and profiling results: > > gcc 8 (gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)) > > > 11.6 > > latency average = 463.968 ms > > 40.62% postgres postgres [.] ExecInterpExpr > 9.74% postgres postgres [.] float8_accum > 6.12% postgres libc-2.17.so [.] __isinf > 5.96% postgres postgres [.] float8mul > 5.33% postgres postgres [.] dsqrt > 3.90% postgres postgres [.] ftod > 3.53% postgres postgres [.] Float8GetDatum > 2.34% postgres postgres [.] DatumGetFloat8 > 2.15% postgres postgres [.] AggCheckCallContext > 2.03% postgres postgres [.] slot_deform_tuple > 1.95% postgres libm-2.17.so [.] __sqrt > 1.19% postgres postgres [.] check_float8_array > > HEAD > > latency average = 549.071 ms > > 31.74% postgres postgres [.] ExecInterpExpr > 11.02% postgres libc-2.17.so [.] __isinf > 10.58% postgres postgres [.] float8_accum > 4.84% postgres postgres [.] check_float8_val > 4.66% postgres postgres [.] dsqrt > 3.91% postgres postgres [.] float8mul > 3.56% postgres postgres [.] ftod > 3.26% postgres postgres [.] Float8GetDatum > 2.91% postgres postgres [.] float8_mul > 2.30% postgres postgres [.] DatumGetFloat8 > 2.19% postgres postgres [.] slot_deform_heap_tuple > 1.81% postgres postgres [.] AggCheckCallContext > 1.31% postgres libm-2.17.so [.] __sqrt > 1.25% postgres postgres [.] check_float8_array > > HEAD + patch > > latency average = 546.624 ms > > 33.51% postgres postgres [.] ExecInterpExpr > 10.35% postgres postgres [.] float8_accum > 10.06% postgres libc-2.17.so [.] __isinf > 4.58% postgres postgres [.] dsqrt > 4.14% postgres postgres [.] check_float8_val > 4.03% postgres postgres [.] ftod > 3.54% postgres postgres [.] float8mul > 2.96% postgres postgres [.] Float8GetDatum > 2.38% postgres postgres [.] slot_deform_heap_tuple > 2.23% postgres postgres [.] DatumGetFloat8 > 2.09% postgres postgres [.] float8_mul > 1.88% postgres postgres [.] AggCheckCallContext > 1.65% postgres libm-2.17.so [.] __sqrt > 1.22% postgres postgres [.] check_float8_array HEAD + Kuroda-san's patch (compiled with gcc 8) latency average = 484.604 ms 37.41% postgres postgres[.] ExecInterpExpr 10.83% postgres postgres[.] float8_accum 5.62% postgres postgres[.] dsqrt 4.23% postgres libc-2.17.so[.] __isinf 4.05% postgres postgres[.] float8mul 3.85% postgres postgres[.] ftod 3.18% postgres postgres[.] Float8GetDatum 2.81% postgres postgres[.] slot_deform_heap_tuple 2.63% postgres postgres[.] DatumGetFloat8 2.46% postgres postgres[.] float8_mul 1.91% postgres libm-2.17.so[.] __sqrt > clang-7 (clang version 7.0.1 (tags/RELEASE_701/final)) > = > > 11.6 > > latency average = 419.014 ms > > 47.57% postgres postgres [.] ExecInterpExpr > 7.99% postgres postgres [.] float8_accum > 5.96% postgres postgres [.] dsqrt > 4.88% postgres postgres [.] float8mul > 4.23% postgres postgres [.] ftod > 3.30% postgres postgres [.] slot_deform_tuple > 3.19% postgres postgres [.] DatumGetFloat8 > 1.92% postgres libm-2.17.so [.] __sqrt > 1.72% postgres postgres [.] check_float8_array > > HEAD > > latency average = 452.958 ms > > 40.55% postgres postgres [.] ExecInterpExpr > 10.61% postgres postgres [.] float8_accum > 4.58% postgres postgres [.] dsqrt > 3.59% postgres postgres [.] slot_deform_heap_tuple > 3.54% postgres postgres [.] check_float8_val > 3.48% postgres postgres [.] ftod > 3.42% postgres postgres [.] float8mul > 3.22% postgres postgres [.] DatumGetFloat8 > 2.69% postgres postgres [.] Float8GetDatum > 2.46% postgres postgres [.] float8_mul > 2.29% postgres libm-2.17.so [.] __sqrt > 1.47% postgres postgres [.] check_float8_array > > HEAD + patch > > latency average = 452.533 ms > > 41.05% postgres postgres [.] ExecInterpExpr > 10.15%
Re: Assumptions about the number of parallel workers
Andres Freund wrote: > Hi, > > On 2020-02-05 10:50:05 +0100, Antonin Houska wrote: > > I can't figure out why ExecGather/ExecGatherMerge do check whether > > num_workers > > is non-zero. I think the code would be a bit clearer if these tests were > > replaced with Assert() statements, as the attached patch does. > > It's probably related to force_parallel_mode. With that we'll IIRC > generate gather nodes even if num_workers == 0. Those Gather nodes still have non-zero num_workers, see this part of standard_planner: if (force_parallel_mode != FORCE_PARALLEL_OFF && top_plan->parallel_safe) { ... gather->num_workers = 1; gather->single_copy = true; Also, if it num_workers was zero for any reason, my patch would probably make regression tests fail. However I haven't noticed any assertion failure. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: ALTER tbl rewrite loses CLUSTER ON index
On Fri, Feb 7, 2020 at 2:24 AM Alvaro Herrera wrote: > On 2020-Feb-06, Justin Pryzby wrote: > > I wondered if it wouldn't be better if CLUSTER ON was stored in pg_class as > > the > > Oid of a clustered index, rather than a boolean in pg_index. > > Maybe. Do you want to try a patch? +1 Thanksm Amit
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)
At Fri, 7 Feb 2020 17:01:59 +0900, Fujii Masao wrote in > > > On 2020/02/07 15:07, Kasahara Tatsuhito wrote: > > Hi, > > On Fri, Feb 7, 2020 at 1:27 PM Kyotaro Horiguchi > > wrote: > >> It seems that nkeys and key are useless. Since every table_beginscan_* > >> functions have distinct parameter sets, don't we remove them from > >> table_beginscan_tid? > > Yeah, actually, when calling table_beginscan_tid(), nkeys is set to 0 > > and * key is set to NULL, > > so these are not used at the moment. > > I removed unnecessary arguments from table_beginscan_tid(). > > Attache the v5 patch. > > Thanks for updating the patch! The patch looks good to me. > So barring any objection, I will push it and back-patch to v12 *soon* > so that the upcoming minor version can contain it. > > BTW, commit 147e3722f7 causing the issue changed currtid_byreloid() > and currtid_byrelname() so that they also call table_beginscan(). > I'm not sure what those functions are, but probably we should fix them > so that table_beginscan_tid() is called instead. Thought? At least they don't seem to need table_beginscan(), to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)
Hi, On Fri, Feb 7, 2020 at 5:02 PM Fujii Masao wrote: > BTW, commit 147e3722f7 causing the issue changed currtid_byreloid() > and currtid_byrelname() so that they also call table_beginscan(). > I'm not sure what those functions are, but probably we should fix them > so that table_beginscan_tid() is called instead. Thought? +1, sorry, I overlooked it. Both functions are used to check whether a valid tid or not with a relation name (or oid), and both perform a tid scan internally. So, these functions should call table_beginscan_tid(). Perhaps unnecessary, I will attach a patch. Best regards, -- Tatsuhito Kasahara kasahara.tatsuhito _at_ gmail.com fix_tidscan_issues_v6.patch Description: Binary data
Re: setLastTid() and currtid()
On Fri, Apr 12, 2019 at 1:44 PM Michael Paquier wrote: > > On Thu, Apr 11, 2019 at 02:06:13PM -0400, Tom Lane wrote: > > Yeah, if we could simplify the tableam API, that would be sufficient > > reason to remove the stuff in v12, IMO. But if it is outside of that > > API, I'd counsel waiting till v13. > > Yes, I agree that simplifying the table AM interface would be a reason > fine enough to delete this code for v12. If not, v13 sounds better at > this stage. Now we are in the dev of v13, so it's time to rip the functions out? Regards, -- Fujii Masao
Re: In PG12, query with float calculations is slower than PG11
On Fri, Feb 7, 2020 at 4:54 PM Andres Freund wrote: > On February 6, 2020 11:42:30 PM PST, keisuke kuroda > wrote: > >Hi, > > > >I have been testing with newer compiler (clang-7) > >and result is a bit different at least with clang-7. > >Compiling PG 12.1 (even without patch) with clang-7 > >results in __isinf() no longer being a bottleneck, > >that is, you don't see it in profiler at all. > > I don't think that's necessarily the right conclusion. What's quite possibly > happening is that you do not see the external isinf function anymore, because > it is implemented as an intrinsic, but that there still are more > computations being done. Due to the changed order of the isinf checks. You'd > have to compare with 11 using the same compiler. I did some tests using two relatively recent compilers: gcc 8 and clang-7 and here are the results: Setup: create table realtest (a real, b real, c real, d real, e real); insert into realtest select i, i, i, i, i from generate_series(1, 100) i; Test query: /tmp/query.sql select avg(2*dsqrt(a)), avg(2*dsqrt(b)), avg(2*dsqrt(c)), avg(2*dsqrt(d)), avg(2*dsqrt(e)) from realtest; pgbench -n -T 60 -f /tmp/query.sql Latency and profiling results: gcc 8 (gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)) 11.6 latency average = 463.968 ms 40.62% postgres postgres [.] ExecInterpExpr 9.74% postgres postgres [.] float8_accum 6.12% postgres libc-2.17.so [.] __isinf 5.96% postgres postgres [.] float8mul 5.33% postgres postgres [.] dsqrt 3.90% postgres postgres [.] ftod 3.53% postgres postgres [.] Float8GetDatum 2.34% postgres postgres [.] DatumGetFloat8 2.15% postgres postgres [.] AggCheckCallContext 2.03% postgres postgres [.] slot_deform_tuple 1.95% postgres libm-2.17.so [.] __sqrt 1.19% postgres postgres [.] check_float8_array HEAD latency average = 549.071 ms 31.74% postgres postgres [.] ExecInterpExpr 11.02% postgres libc-2.17.so [.] __isinf 10.58% postgres postgres [.] float8_accum 4.84% postgres postgres [.] check_float8_val 4.66% postgres postgres [.] dsqrt 3.91% postgres postgres [.] float8mul 3.56% postgres postgres [.] ftod 3.26% postgres postgres [.] Float8GetDatum 2.91% postgres postgres [.] float8_mul 2.30% postgres postgres [.] DatumGetFloat8 2.19% postgres postgres [.] slot_deform_heap_tuple 1.81% postgres postgres [.] AggCheckCallContext 1.31% postgres libm-2.17.so [.] __sqrt 1.25% postgres postgres [.] check_float8_array HEAD + patch latency average = 546.624 ms 33.51% postgres postgres [.] ExecInterpExpr 10.35% postgres postgres [.] float8_accum 10.06% postgres libc-2.17.so [.] __isinf 4.58% postgres postgres [.] dsqrt 4.14% postgres postgres [.] check_float8_val 4.03% postgres postgres [.] ftod 3.54% postgres postgres [.] float8mul 2.96% postgres postgres [.] Float8GetDatum 2.38% postgres postgres [.] slot_deform_heap_tuple 2.23% postgres postgres [.] DatumGetFloat8 2.09% postgres postgres [.] float8_mul 1.88% postgres postgres [.] AggCheckCallContext 1.65% postgres libm-2.17.so [.] __sqrt 1.22% postgres postgres [.] check_float8_array clang-7 (clang version 7.0.1 (tags/RELEASE_701/final)) = 11.6 latency average = 419.014 ms 47.57% postgres postgres [.] ExecInterpExpr 7.99% postgres postgres [.] float8_accum 5.96% postgres postgres [.] dsqrt 4.88% postgres postgres [.] float8mul 4.23% postgres postgres [.] ftod 3.30% postgres postgres [.] slot_deform_tuple 3.19% postgres postgres [.] DatumGetFloat8 1.92% postgres libm-2.17.so [.] __sqrt 1.72% postgres postgres [.] check_float8_array HEAD latency average = 452.958 ms 40.55% postgres postgres [.] ExecInterpExpr 10.61% postgres postgres [.] float8_accum 4.58% postgres postgres [.] dsqrt 3.59% postgres postgres [.] slot_deform_heap_tuple 3.54% postgres postgres [.] check_float8_val 3.48% postgres postgres [.] ftod 3.42% postgres postgres [.] float8mul 3.22% postgres postgres [.] DatumGetFloat8 2.69% postgres postgres [.] Float8GetDatum 2.46% postgres postgres [.] float8_mul 2.29% postgres libm-2.17.so [.] __sqrt 1.4
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)
On 2020/02/07 15:07, Kasahara Tatsuhito wrote: Hi, On Fri, Feb 7, 2020 at 1:27 PM Kyotaro Horiguchi wrote: It seems that nkeys and key are useless. Since every table_beginscan_* functions have distinct parameter sets, don't we remove them from table_beginscan_tid? Yeah, actually, when calling table_beginscan_tid(), nkeys is set to 0 and * key is set to NULL, so these are not used at the moment. I removed unnecessary arguments from table_beginscan_tid(). Attache the v5 patch. Thanks for updating the patch! The patch looks good to me. So barring any objection, I will push it and back-patch to v12 *soon* so that the upcoming minor version can contain it. BTW, commit 147e3722f7 causing the issue changed currtid_byreloid() and currtid_byrelname() so that they also call table_beginscan(). I'm not sure what those functions are, but probably we should fix them so that table_beginscan_tid() is called instead. Thought? Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters