Re: [HACKERS] Refactor handling of database attributes between pg_dump and pg_dumpall
On Fri, Oct 6, 2017 at 12:29 AM, Robert Haaswrote: > On Wed, Oct 4, 2017 at 3:40 AM, Haribabu Kommi > wrote: > > There are some differences in handling database objects > > between pg_dump and pg_dumpall, To retain both pg_dump > > and pg_dumpall behavior even after refactoring, this option > > is added. Currently this option is used mainly for the three > > purposes. > > > > 1. Don't print unnecessary CREATE DATABASE options like > > ENCODING, LC_COLLATE and LC_CTYPE options if the > > default encoding is same with the above values. > > > > The above behavior is as per the pg_dumpall, but it can be > > changed to print irrespective of the default encoding. > > > > 2. Do not dump postgres and template0 databases. > > > > 3. Set default_transaction_read_only = off. > > To me it seems that a refactoring which requires pg_dump to behave > differently in small ways like this based on whether it is being > called by pg_dumpall or not is probably not a good refactoring. And I > don't see why the proposal from Tom that started this thread would > require such a thing to be true. > Before refactoring, pg_dumpall doesn't print "create database" commands for both tempalte1 and postgres database, but on the other hand pg_dump dump the create database commands with --create option. To keep the behavior of all the database attributes in the dump of both pg_dump and pg_dumpall, the code is unified and moved into pg_dump. But to retain the pg_dumpall behavior of not dumping the "create database" commands, a new option is added to pg_dump to skip dumping the create database commands. The new option name is now "--skip-create-default-db", this can be used normal user also when try to dump the postgres database to not let create the database commands in the dump. >From your list, I would say that (1) and (3) seem like behaviors that > we either want or do not want. Whether pg_dump or pg_dumpall is > involved seems irrelevant. (2) seems like it might need some special > handling, but that could be handled in pg_dumpall by just not calling > pg_dump at all for those database. I didn't any better way other than creating a new option to not let the default db create database commands to dump, so I renamed the older option to better one and change the behavior to use by the normal users also. Updated patch attached. Regards, Hari Babu Fujitsu Australia pg_dump-and-pg_dumpall-database-handling-refactoring_v8.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Thu, Oct 5, 2017 at 7:00 PM, Peter Geogheganwrote: > v3 of the patch series, attached, does it that way -- it adds a > bloom_create(). The new bloom_create() function still allocates its > own memory, but does so while using a FLEXIBLE_ARRAY_MEMBER. A > separate bloom_init() function (that works with dynamic shared memory) > could easily be added later, for the benefit of parallel hash join. Since Peter E's work on making the documentation sgml files more XML-like has broken the v3 patch doc build, attached is v4, which fixes this bit rot. It also has a few small tweaks here and there to the docs. Nothing worth noting specifically, really -- I just don't like to leave my patches with bit rot for long. (Hat-tip to Thomas Munro for making this easy to detect with his new CF continuous integration tooling.) I should point out that I shipped virtually the same code yesterday, as v1.1 of the Github version of amcheck (also known as amcheck_next). Early adopters will be able to use this new "heapallindexed" functionality in the next few days, once packages become available for the apt and yum community repos. Just as before, the Github version will work on versions of Postgres >= 9.4. This seems like good timing on my part, because we know that this new "heapallindexed" verification will detect the "freeze the dead" bugs that the next point release is set to have fixes for -- that is actually kind of how one of the bugs was found [1]. We may even want to advertise the available of this check within amcheck_next, in the release notes for the next Postgres point release. [1] https://www.postgresql.org/message-id/cah2-wznm4rcrhfaiwkpwtpew2bxdtgrozk7jwwgucxeh3d1...@mail.gmail.com -- Peter Geoghegan From 7906c7391a9f52d334c2cbc7d3e245ff014629f2 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Tue, 2 May 2017 00:19:24 -0700 Subject: [PATCH 2/2] Add amcheck verification of indexes against heap. Add a new, optional capability to bt_index_check() and bt_index_parent_check(): callers can check that each heap tuple that ought to have an index entry does in fact have one. This happens at the end of the existing verification checks. This is implemented by using a Bloom filter data structure. The implementation performs set membership tests within a callback (the same type of callback that each index AM registers for CREATE INDEX). The Bloom filter is populated during the initial index verification scan. --- contrib/amcheck/Makefile | 2 +- contrib/amcheck/amcheck--1.0--1.1.sql| 28 +++ contrib/amcheck/amcheck.control | 2 +- contrib/amcheck/expected/check_btree.out | 14 +- contrib/amcheck/sql/check_btree.sql | 9 +- contrib/amcheck/verify_nbtree.c | 298 --- doc/src/sgml/amcheck.sgml| 173 ++ src/include/utils/snapmgr.h | 2 +- 8 files changed, 454 insertions(+), 74 deletions(-) create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile index 43bed91..c5764b5 100644 --- a/contrib/amcheck/Makefile +++ b/contrib/amcheck/Makefile @@ -4,7 +4,7 @@ MODULE_big = amcheck OBJS = verify_nbtree.o $(WIN32RES) EXTENSION = amcheck -DATA = amcheck--1.0.sql +DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql PGFILEDESC = "amcheck - function for verifying relation integrity" REGRESS = check check_btree diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql new file mode 100644 index 000..e6cca0a --- /dev/null +++ b/contrib/amcheck/amcheck--1.0--1.1.sql @@ -0,0 +1,28 @@ +/* contrib/amcheck/amcheck--1.0--1.1.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit + +-- +-- bt_index_check() +-- +DROP FUNCTION bt_index_check(regclass); +CREATE FUNCTION bt_index_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- +-- bt_index_parent_check() +-- +DROP FUNCTION bt_index_parent_check(regclass); +CREATE FUNCTION bt_index_parent_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_parent_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- Don't want these to be available to public +REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC; +REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC; diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control index 05e2861..4690484 100644 --- a/contrib/amcheck/amcheck.control +++ b/contrib/amcheck/amcheck.control @@ -1,5 +1,5 @@ # amcheck extension comment = 'functions for verifying relation integrity' -default_version = '1.0' +default_version = '1.1' module_pathname = '$libdir/amcheck'
Re: [HACKERS] A GUC to prevent leader processes from running subplans?
On Tue, Oct 17, 2017 at 7:27 AM, Thomas Munrowrote: > While testing parallelism work I've wanted to be able to prevent > gather nodes from running the plan in the leader process, and I've > heard others say the same. One way would be to add a GUC > "multiplex_gather", like in the attached patch. If you set it to off, > Gather and Gather Merge won't run the subplan unless they have to > because no workers could be launched. I thought about adding a new > value for force_parallel_mode instead, but someone mentioned they > might want to do this on a production system too and > force_parallel_mode is not really for end users. Better ideas? I don't think overloading force_parallel_mode is a good idea, but having some other GUC for this seems OK to me. Not sure I like multiplex_gather, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Useless(?) asymmetry in parse_func.c
While running down loose ends in my domains-over-composite patch, I wondered why parse_func.c's FuncNameAsType() excludes composite types as possible type names. I could set up the patch to treat composite domains the same way, or not, but it wasn't obvious what to do because the point of the exclusion wasn't clear. Some excavation in our git history shows that the exclusion was added here: commit 990eb8552e69a492840d46b58ceb630a8a295e54 Author: Tom LaneDate: Wed Dec 12 03:28:49 2001 + Don't accept names of complex types (ie, relation types) as being requests for implicit trivial coercions. Prevents sillinesses like this one: regression=# select x.int8_tbl.q1 from int8_tbl x; ERROR: fmgr_info: function 270997776: cache lookup failed I could not find any contemporaneous discussion in the list archives, so I'm guessing that I tripped over this corner case and did not think it was worth spending any great amount of effort on. But in hindsight the fix sure looks like a band-aid that's covering up a deeper problem. Whatever that problem was, it's gone now --- if I dike out the check then the parser correctly concludes that coercing int8_tbl to int8_tbl is a no-op. You need more parens nowadays, but either of these work: select (x.int8_tbl).q1 from int8_tbl x; select (int8_tbl(x)).q1 from int8_tbl x; There might still be an argument for rejecting the case on the grounds that it's confusing or likely to be user error, but I'm not sure. It seems weirdly asymmetric that we allow typename(x) to be a coercion request unless typename is a composite type. And that exception is not documented in any of the places that talk about this syntax, e.g. section 4.2.9, section 10.3 rule 3, or the CREATE CAST man page. So I'm inclined to remove the exception for composite types, effectively reverting 990eb8552. Alternatively, if we keep it, it needs to be documented as to the reasoning for having it. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench - allow to store select results into variables
Here is a v12. Here is a v13, which is just a rebase after the documentation xml-ization. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index e509e6c..44e8896 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -818,6 +818,51 @@ pgbench options d + + + \cset [prefix] or + \gset [prefix] + + + + + These commands may be used to end SQL queries, replacing a semicolon. + \cset replaces an embedded semicolon (\;) within + a compound SQL command, and \gset replaces a final + (;) semicolon which ends the SQL command. + + + + When these commands are used, the preceding SQL query is expected to + return one row, the columns of which are stored into variables named after + column names, and prefixed with prefix if provided. + + + + The following example puts the final account balance from the first query + into variable abalance, and fills variables + one, two and + p_three with integers from a compound query. + +UPDATE pgbench_accounts + SET abalance = abalance + :delta + WHERE aid = :aid + RETURNING abalance \gset +-- compound of two queries +SELECT 1 AS one, 2 AS two \cset +SELECT 3 AS three \gset p_ + + + + + +\cset and \gset commands do not work when +empty SQL queries appear within a compound SQL command. + + + + + \set varname expression diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 5d8a01c..37ed07b 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -375,11 +375,14 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"}; typedef struct { - char *line; /* text of command line */ + char *line; /* first line for short display */ + char *lines; /* full multi-line text of command */ int command_num; /* unique index of this Command struct */ int type; /* command type (SQL_COMMAND or META_COMMAND) */ int argc; /* number of command words */ char *argv[MAX_ARGS]; /* command word list */ + int compound; /* last compound command (number of \;) */ + char **gset; /* per-compound command prefix */ PgBenchExpr *expr; /* parsed expression, if needed */ SimpleStats stats; /* time spent in this command */ } Command; @@ -1254,6 +1257,104 @@ getQueryParams(CState *st, const Command *command, const char **params) params[i] = getVariable(st, command->argv[i + 1]); } +/* read all responses from backend */ +static bool +read_response(CState *st, char **gset) +{ + PGresult *res; + int compound = 0; + + while ((res = PQgetResult(st->con)) != NULL) + { + switch (PQresultStatus(res)) + { + case PGRES_COMMAND_OK: /* non-SELECT commands */ + case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */ +if (gset[compound] != NULL) +{ + fprintf(stderr, + "client %d file %d command %d compound %d: " + "\\gset expects a row\n", + st->id, st->use_file, st->command, compound); + st->ecnt++; + return false; +} +break; /* OK */ + + case PGRES_TUPLES_OK: +if (gset[compound] != NULL) +{ + /* store result into variables */ + int ntuples = PQntuples(res), + nfields = PQnfields(res), + f; + + if (ntuples != 1) + { + fprintf(stderr, +"client %d file %d command %d compound %d: " +"expecting one row, got %d\n", +st->id, st->use_file, st->command, compound, ntuples); + st->ecnt++; + PQclear(res); + discard_response(st); + return false; + } + + for (f = 0; f < nfields ; f++) + { + char *varname = PQfname(res, f); + if (*gset[compound] != '\0') + varname = psprintf("%s%s", gset[compound], varname); + + /* store result as a string */ + if (!putVariable(st, "gset", varname, + PQgetvalue(res, 0, f))) + { + /* internal error, should it rather abort? */ + fprintf(stderr, + "client %d file %d command %d compound %d: " + "error storing into var %s\n", + st->id, st->use_file, st->command, compound, + varname); + st->ecnt++; + PQclear(res); + discard_response(st); + return false; + } + + if (*gset[compound] != '\0') + free(varname); + } +} +break; /* OK */ + + default: +/* everything else is unexpected, so probably an error */ +fprintf(stderr, + "client %d file %d aborted in command %d compound %d: %s", + st->id, st->use_file, st->command, compound, + PQerrorMessage(st->con)); +st->ecnt++; +PQclear(res); +discard_response(st); +return false; + } + + PQclear(res); + compound += 1; + } + + if (compound == 0) + { + fprintf(stderr, "client %d command %d: no results\n", st->id, st->command); + st->ecnt++; + return false; + } + + return
Re: [HACKERS] pgbench more operators & functions
Here is a v13. No code changes, but TAP tests added to maintain pgbench coverage to green. Here is a v14, which is just a rebase after the documentation xml-ization. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index e509e6c..1f55967 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -827,14 +827,32 @@ pgbench options d Sets variable varname to a value calculated from expression. - The expression may contain integer constants such as 5432, + The expression may contain the NULL constant, + boolean constants TRUE and FALSE, + integer constants such as 5432, double constants such as 3.14159, references to variables :variablename, - unary operators (+, -) and binary operators - (+, -, *, /, - %) with their usual precedence and associativity, - function calls, and - parentheses. + operators + with their usual SQL precedence and associativity, + function calls, + SQL CASE generic conditional + expressions and parentheses. + + + + Functions and most operators return NULL on + NULL input. + + + + For conditional purposes, non zero numerical values are + TRUE, zero numerical values and NULL + are FALSE. + + + + When no final ELSE clause is provided to a + CASE, the default value is NULL. @@ -843,6 +861,7 @@ pgbench options d \set ntellers 10 * :scale \set aid (1021 * random(1, 10 * :scale)) % \ (10 * :scale) + 1 +\set divx CASE WHEN :x 0 THEN :y/:x ELSE NULL END @@ -919,6 +938,177 @@ pgbench options d + + Built-In Operators + + + The arithmetic, bitwise, comparison and logical operators listed in +are built into pgbench + and may be used in expressions appearing in + \set. + + + + pgbench Operators by increasing precedence + + + + Operator + Description + Example + Result + + + + + OR + logical or + 5 or 0 + TRUE + + + AND + logical and + 3 and 0 + FALSE + + + NOT + logical not + not false + TRUE + + + IS [NOT] (NULL|TRUE|FALSE) + value tests + 1 is null + FALSE + + + ISNULL|NOTNULL + null tests + 1 notnull + TRUE + + + = + is equal + 5 = 4 + FALSE + + + + is not equal + 5 4 + TRUE + + + != + is not equal + 5 != 5 + FALSE + + + + lower than + 5 4 + FALSE + + + = + lower or equal + 5 = 4 + FALSE + + + + greater than + 5 4 + TRUE + + + = + greater or equal + 5 = 4 + TRUE + + + | + integer bitwise OR + 1 | 2 + 3 + + + # + integer bitwise XOR + 1 # 3 + 2 + + + + integer bitwise AND + 1 3 + 1 + + + ~ + integer bitwise NOT + ~ 1 + -2 + + + + integer bitwise shift left + 1 2 + 4 + + + + integer bitwise shift right + 8 2 + 2 + + + + + addition + 5 + 4 + 9 + + + - + substraction + 3 - 2.0 + 1.0 + + + * + multiplication + 5 * 4 + 20 + + + / + division (integer truncates the results) + 5 / 3 + 1 + + + % + modulo + 3 % 2 + 1 + + + - + opposite + - 2.0 + -2.0 + + + + + + Built-In Functions @@ -965,6 +1155,13 @@ pgbench options d 5432.0 + exp(x) + double + exponential + exp(1.0) + 2.718281828459045 + + greatest(a [, ... ] ) double if any a is double, else integer largest value among arguments @@ -986,6 +1183,20 @@ pgbench options d 2.1 + ln(x) + double + natural logarithm + ln(2.718281828459045) + 1.0 + + + mod(i, bj) + integer + modulo + mod(54, 32) + 22 + + pi() double value of the constant PI diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y index b3a2d9b..770be98 100644 --- a/src/bin/pgbench/exprparse.y +++ b/src/bin/pgbench/exprparse.y @@ -19,13 +19,17 @@ PgBenchExpr *expr_parse_result; static PgBenchExprList *make_elist(PgBenchExpr *exp, PgBenchExprList *list); +static PgBenchExpr *make_null_constant(void); +static PgBenchExpr *make_boolean_constant(bool bval); static PgBenchExpr *make_integer_constant(int64 ival); static PgBenchExpr *make_double_constant(double
Re: [HACKERS] pgbench: Skipping the creating primary keys after initialization
The patch needs a rebase in the documentation because of the xml-ilization of the sgml doc. Thank you for the notification! Attached rebased patch. Ok. The new version works for me. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] Faster processing at Gather node
On Wed, Oct 18, 2017 at 4:30 PM, Andres Freundwrote: > Hm. I'm a bit confused/surprised by that. What'd be the worst that can > happen if we don't immediately detect that the other side is detached? > At least if we only do so in the non-blocking paths, the worst that > seems that could happen is that we read/write at most one superflous > queue entry, rather than reporting an error? Or is the concern that > detaching might be detected *too early*, before reading the last entry > from a queue? Detaching too early is definitely a problem. A worker is allowed to start up, write all of its results into the queue, and then detach without waiting for the leader to read those results. (Reading messages that weren't really written would be a problem too, of course.) > I'm not convinced by this. Imo the multiplying largely comes from > superflous actions, like the per-entry SetLatch calls here, rather than > per batch. > > After all I'd benchmarked this *after* an experimental conversion of > shm_mq to not use spinlocks - so there's actually no external barrier > providing these guarantees that could be combined with SetLatch()'s > barrier. OK. >> All that having been said, a batch variant for reading tuples in bulk >> might make sense. I think when I originally wrote this code I was >> thinking that one process might be filling the queue while another >> process was draining it, and that it might therefore be important to >> free up space as early as possible. But maybe that's not a very good >> intuition. > > I think that's a sensible goal, but I don't think that has to mean that > the draining has to happen after every entry. If you'd e.g. have a > shm_mq_receivev() with 16 iovecs, that'd commonly be only part of a > single tqueue mq (unless your tuples are > 4k). I'll note that afaict > shm_mq_sendv() already batches its SetLatch() calls. But that's a little different -- shm_mq_sendv() sends one message collected from multiple places. There's no more reason for it to wake up the receiver before the whole message is written that there would be for shm_mq_send(); it's the same problem. > I think one important thing a batched drain can avoid is that a worker > awakes to just put one new tuple into the queue and then sleep > again. That's kinda expensive. Yes. Or - part of a tuple, which is worse. >> >b) Use shm_mq_sendv in tqueue.c by batching up insertions into the >> > queue whenever it's not empty when a tuple is ready. >> >> Batching them with what? I hate to postpone sending tuples we've got; >> that sounds nice in the case where we're sending tons of tuples at >> high speed, but there might be other cases where it makes the leader >> wait. > > Yea, that'd need some smarts. How about doing something like batching up > locally as long as the queue contains less than one average sized batch? I don't follow. > No, I don't think that's necesarily true. If that worker's queue is full > every-time, then yes. But I think a common scenario is that the > "current" worker only has a small portion of the queue filled. Draining > that immediately just leads to increased cacheline bouncing. Hmm, OK. > I'd not previously thought about this, but won't staying sticky to the > current worker potentially cause pins on individual tuples be held for a > potentially long time by workers not making any progress? Yes. >> >It might also be a good idea to use a more directed form of wakeups, >> >e.g. using a per-worker latch + a wait event set, to avoid iterating >> >over all workers. >> >> I don't follow. > > Well, right now we're busily checking each worker's queue. That's fine > with a handful of workers, but starts to become not that cheap pretty > soon afterwards. In the surely common case where the workers are the > bottleneck (because that's when parallelism is worthwhile), we'll check > each worker's queue once one of them returned a single tuple. It'd not > be a stupid idea to have a per-worker latch and wait for the latches of > all workers at once. Then targetedly drain the queues of the workers > that WaitEventSetWait(nevents = nworkers) signalled as ready. Hmm, interesting. But we can't completely ignore the process latch either, since among other things a worker erroring out and propagating the error to the leader relies on that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What is the point of setrefs.c's is_converted_whole_row_reference?
On Thu, Oct 19, 2017 at 2:56 PM, Tom Lanewrote: > AFAICS, setrefs.c's special treatment of "converted whole row references" > is completely pointless. Why aren't they just treated by the regular > "non var" code paths, thus saving code space and cycles? Here's what one of Ashutosh's commit messages for one of the patches in the stack said: === set_join_references() turns off outer side's has_non_vars to handle expressions involving nullable side. Hence we can not use has_non_vars to handle ConvertRowtypeExprs. Instead the patch adds has_conv_whole_rows, which is set when there exist one or more of ConvertRowtypeExprs described above. === I think that's referring to this comment: * Now we need to fix up the targetlist and qpqual, which are logically * above the join. This means they should not re-use any input expression * that was computed in the nullable side of an outer join. Vars and * PlaceHolderVars are fine, so we can implement this restriction just by * clearing has_non_vars in the indexed_tlist structs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit queryId?
On Fri, Oct 20, 2017 at 3:44 PM, Robert Haaswrote: > On Thu, Oct 19, 2017 at 2:11 AM, Julien Rouhaud wrote: >> I agree, and I'm perfectly fine with adding a comment around pgssHashKey. >> >> PFA a patch to warn about the danger. > > Committed a somewhat different version of this - hope you are OK with > the result. That's much better than what I proposed. Thanks a lot! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit queryId?
On Thu, Oct 19, 2017 at 2:11 AM, Julien Rouhaudwrote: > I agree, and I'm perfectly fine with adding a comment around pgssHashKey. > > PFA a patch to warn about the danger. Committed a somewhat different version of this - hope you are OK with the result. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)
On Tue, Oct 17, 2017 at 12:06 PM, Tomas Vondrawrote: > In general, it may be acceptable to rely on the elog() checks - which is > pretty much what the FETCH+INSERT+SHARE example I shared in the first > message of this thread does. I agree it's not particularly convenient, > and perhaps we should replace it with checks while planning the queries. Those checks are very much not comprehensive, though. For example, consider commit e9baa5e9fa147e00a2466ab2c40eb99c8a700824, which allowed heap_insert() in parallel mode. Here's the comment: /* + * Parallel operations are required to be strictly read-only in a parallel + * worker. Parallel inserts are not safe even in the leader in the + * general case, because group locking means that heavyweight locks for + * relation extension or GIN page locks will not conflict between members + * of a lock group, but we don't prohibit that case here because there are + * useful special cases that we can safely allow, such as CREATE TABLE AS. + */ +if (IsParallelWorker()) ereport(ERROR, (errcode(ERRCODE_INVALID_TRANSACTION_STATE), + errmsg("cannot insert tuples in a parallel worker"))); Now, as things stand, there's no way for this code to be reached except in the case where the inserts are targeting a new table, or at least I hope there isn't, because that would be a bug. But if we applied your patch then it could be easily done: just start a parallel cursor and then run an INSERT command. It might take up a little work to gin up a test case that shows this really crashing and burning, but I'm very confident that it's possible. I wouldn't have gone to the trouble of installing checks to prevent inserts from running in parallel mode if inserts were safe in parallel mode. >> Also, I doubt this guarantees that we won't try to call >> parallel-unsafe functions will parallel mode is active, so things >> will just blow up in whatever way they do, maybe crashing the server >> or rendering the database inconsistent or whatever. > > Probably. What happens right now is that declaring the cursor switches > the whole transaction into parallel mode (EnterParallelMode), so if you > do FETCH + INSERT + FETCH that will fail with elog(ERROR). But, again, only in the limited cases that the elog() checks catch. A C function can be parallel-unsafe without doing anything that hits the elog() checks; there is no way for the system to protect itself against arbitrary C code. The elog() checks are intended to catch the common or likely ways of breaking the world, but arbitrarily C code can work around those checks -- if nothing else, duplicate one of the functions that has an elog() in it, remove the elog(), and then call it. You just did something parallel-safe, unchecked! That's an artificial example, of course, which is not likely to occur in practice, but I'm pretty confident that there are non-artificial examples also. I think this is a more or less classic kind of programming problem - you're proposing to take a set of checks that were intended to ensure safety under a limited set of circumstances and generalize them to a much broader context than the one for which they were designed. They will not be adequate to those circumstances, and a considerable amount of analysis will be needed to figure out where the gaps are. If somebody wants to do that analysis, I'm willing to review it and try to spot any holes, but I don't really want to spend the time to go do the whole analysis myself. The main points I want to make clearly understood is the current design relies on (1) functions being labeled correctly and (2) other dangerous code paths being unreachable because there's nothing that runs between EnterParallelMode and ExitParallelMode which could invoke them, except by calling a mislabeled function. Your patch expands the vulnerability surface from "executor code that can be reached without calling a mislabeled function" to "any code that can be reached by typing an SQL command". Just rejecting any queries that are parallel-unsafe probably closes a good chunk of the holes, but that still leaves a lot of code that's never been run in parallel mode before potentially now running in parallel mode - e.g. any DDL command you happen to type, transaction control commands, code that only runs when the server is idle like idle_in_transaction_timeout, cursor operations. A lot of that stuff is probably fine, but it's got to be thought through. Error handling might be a problem, too: what happens if a parallel worker is killed while the query is suspended? I suspect that doesn't work very nicely at all. One way to get some ideas about where the problem lies would be to write a test patch that keeps parallel mode active at all times except when running a query that contains something parallel-unsafe. Then run the regression tests and see if anything blows up. That's not
Re: [HACKERS] More stats about skipped vacuums
On Tue, Oct 10, 2017 at 7:26 PM, Kyotaro HORIGUCHIwrote: > Hello. > Once in a while I am asked about table bloat. In most cases the > cause is long lasting transactions and vacuum canceling in some > cases. Whatever the case users don't have enough clues to why > they have bloated tables. > > At the top of the annoyances list for users would be that they > cannot know whether autovacuum decided that a table needs vacuum > or not. I suppose that it could be shown in pg_stat_*_tables. > > n_mod_since_analyze | 2 > + vacuum_requred | true > last_vacuum | 2017-10-10 17:21:54.380805+09 > > If vacuum_required remains true for a certain time, it means that > vacuuming stopped halfway or someone is killing it repeatedly. > That status could be shown in the same view. Because the table statistics is updated at end of the vacuum I think that the autovacuum will process the table at the next cycle if it has stopped halfway or has killed. So you mean that vacuum_required is for uses who want to reclaim garbage without wait for autovacuum retry? > n_mod_since_analyze | 2 > + vacuum_requred | true > last_vacuum | 2017-10-10 17:21:54.380805+09 > last_autovacuum | 2017-10-10 17:21:54.380805+09 > + last_autovacuum_status | Killed by lock conflict > > Where the "Killed by lock conflict" would be one of the followings. > > - Completed (oldest xmin = 8023) > - May not be fully truncated (yielded at 1324 of 6447 expected) > - Truncation skipped > - Skipped by lock failure > - Killed by lock conflict > > > If we want more formal expression, we can show the values in the > following shape. And adding some more values could be useful. > > n_mod_since_analyze | 2 > + vacuum_requred | true > + last_vacuum_oldest_xid | 8023 > + last_vacuum_left_to_truncate | 5123 > + last_vacuum_truncated| 387 > last_vacuum | 2017-10-10 17:21:54.380805+09 > last_autovacuum | 2017-10-10 17:21:54.380805+09 > + last_autovacuum_status | Killed by lock conflict > ... > autovacuum_count | 128 > + incomplete_autovacuum_count | 53 > > # The last one might be needless.. I'm not sure that the above informations will help for users or DBA but personally I sometimes want to have the number of index scans of the last autovacuum in the pg_stat_user_tables view. That value indicates how efficiently vacuums performed and would be a signal to increase the setting of autovacuum_work_mem for user. > Where the "Killed by lock conflict" is one of the followings. > >- Completed >- Truncation skipped >- Partially truncated >- Skipped >- Killed by lock conflict > > This seems enough to find the cause of a table bloat. The same > discussion could be applied to analyze but it might be the > another issue. > > There may be a better way to indicate the vacuum soundness. Any > opinions and suggestions are welcome. > > I'm going to make a patch to do the 'formal' one for the time > being. > Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization
> When we declare a cursor for a select on the mentioned big table, it takes > multiple seconds and a big temp file is created which to me seems like the > materialization took place immediately. Since you mentioned, Postgres already postponed materialization until commit operations we checked again and you were right. When we manually checked, we executed a "declare" statement without opening a transaction block first, which causes instant materialization. When a transaction is opened, it is in fact postponed. Unfortunately we do not use cursors but portals so we can (re-)use prepared statements in multiple instances and there is not "hold" feature for portals so we cannot benefit from the lazy "with hold" of the cursor. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Hello, On 2017-10-19 19:46, Andres Freund wrote: On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote: > > + init_local_spin_delay(); > > The way you moved this around has the disadvantage that we now do this - > a number of writes - even in the very common case where the lwlock can > be acquired directly. Excuse me, I don't understand fine. Do you complain against init_local_spin_delay placed here? Yes. I could place it before perform_spin_delay under `if (!spin_inited)` if you think it is absolutely must. Placing it in other place will complicate code. Or you complain against setting `mask` and `add`? That seems right. In both cases, I think simpler version should be accepted first. It acts as algorithm definition. And it already gives measurable improvement. Well, in scalability. I'm less sure about uncontended performance. > > + * We intentionally do not call finish_spin_delay here, because > > the loop > > + * above usually finished by queuing into the wait list on > > contention, and > > + * doesn't reach spins_per_delay thereby doesn't sleep inside of > > + * perform_spin_delay. Also, different LWLocks has very different > > + * contention pattern, and it is wrong to update spin-lock > > statistic based > > + * on LWLock contention. > > + */ > > Huh? This seems entirely unconvincing. Without adjusting this here we'll > just spin the same way every iteration. Especially for the case where > somebody else holds LW_FLAG_LOCKED that's quite bad. LWLock's are very different. Some of them are always short-term (BufferLock), others are always locked for a long time. That seems not particularly relevant. The same is true for spinlocks. The relevant question isn't how long the lwlock is held, it's how long LW_FLAG_LOCKED is held - and that should only depend on contention (i.e. bus speed, amount of times put into sleep while holding lock, etc), not on how long the lock is held. I've tried to place this delay into lock itself (it has 2 free bytes), but this attempt performed worse. That seems unsurprising - there's a *lot* of locks, and we'd have to tune all of them. Additionally there's a bunch of platforms where we do *not* have free bytes (consider the embedding in BufferTag). Now I understand, that delays should be stored in array indexed by tranche. But I have no time to test this idea. And I doubt it will give cardinally better results (ie > 5%), so I think it is better to accept patch in this way, and then experiment with per-tranche delay. I don't think tranches have any decent predictive power here. Look after "Make acquiring LWLock to look more like spinlock". First `skip_wait_list` iterations there is no attempt to queue self into wait list. It gives most of improvement. (without it there is just no degradation on high number of clients, but a little decline on low clients, because current algorithm is also a little `spinny` (because it attempts to acquire lock again after queueing into wait list)). `skip_wait_list` depends on tranche very much. And per-tranche `skip_wait_list` should be calculated based on ability to acquire lock without any sleep, ie without both queuing itself into wait list and falling into `pg_usleep` inside `perform_spin_delay` . I suppose, it is obvious that `spins_per_delay` have to be proportional to `skip_wait_list` (it should be at least greater than `skip_wait_list`). Therefore `spins_per_delay` is also should be runtime-dependent on lock's tranche. Am I wrong? Without that per-tranche auto-tuning, it is better to not touch `spins_per_delay` inside of `LWLockAttemptLockOrQueue`, I think. With regards, -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Hello. Fully-asynchronous executor needs that every node is stateful and suspendable at the time of requesting for the next tuples to underneath nodes. I tried pure push-base executor but failed. After the miserable patch upthread, I finally managed to make executor nodes suspendable using computational jump and got rid of recursive calls of executor. But it runs about x10 slower for simple SeqScan case. (pgbench ran with 9% degradation.) It doesn't seem recoverable by handy improvements. So I gave up that. Then I returned to single-level asynchrony, in other words, the simple case with async-aware nodes just above async-capable nodes. The motive of using the framework in the previous patch was that we had degradation on the sync (or normal) code paths by polluting ExecProcNode with async stuff and as Tom's suggestion the node->ExecProcNode trick can isolate the async code path. The attached PoC patch theoretically has no impact on the normal code paths and just brings gain in async cases. (Additional members in PlanState made degradation seemingly comes from alignment, though.) But I haven't had enough stable result from performance test. Different builds from the same source code gives apparently different results... Anyway I'll show the best one in the several times run here. original(ms) patched(ms)gain(%) A: simple table scan : 9714.70 9656.73 0.6 B: local partitioning: 4119.44 4131.10-0.3 C: single remote table : 9484.86 9141.89 3.7 D: sharding (single con) : 7114.34 6751.21 5.1 E: sharding (multi con) : 7166.56 1827.9374.5 A and B are degradation checks, which are expected to show no degradation. C is the gain only by postgres_fdw's command presending on a remote table. D is the gain of sharding on a connection. The number of partitions/shards is 4. E is the gain using dedicate connection per shard. regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From fc424c16e124934581a184fcadaed1e05f7672c8 Mon Sep 17 00:00:00 2001 From: Kyotaro HoriguchiDate: Mon, 22 May 2017 12:42:58 +0900 Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c| 2 +- src/backend/storage/ipc/latch.c | 18 ++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 68 +++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 6 files changed, 97 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 754154b..d459f32 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -220,7 +220,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 4eb6e83..e6fc3dd 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_event might need 8 byte alignment on some @@ -591,6 +597,11 @@
Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization
> > The calculations inside the loop are written in some dynamic high-level > > language and cannot easily be translated into SQL. > > You don't really have to --- PG supports functions written in non-SQL > languages. Not sure if your problem is big enough to justify developing > a new PL interface for $random-4GL-language, but that might be something > to consider. Surely it would be a possibilty to loop in SQL and call the other language for each row, but I am not sure Postgres would be too happy if the callee waits for user input and/or wants to do some SQL update operations. > But, to get back to the original point: exactly what "sizeable performance > impact of declaring a cursor "with hold"" do you think there is? It > shouldn't be noticeably more expensive than just selecting the rows would > be. Especially not for a number of rows that wouldn't make the > above-depicted application structure completely untenable. And for sure > I'm not understanding why you think that materializing the result on the > client side instead would be better. For small tables materialization is a non-issue but we have large table where a single select statement over all the rows causes Postgres to create 8GB temp files being busy multiple seconds which is very noticeable. > > Naturally I am now wondering why the postgres cursor/portal is not also > > employing the same trick (at least as an option): Postpone > > materialization of "with hold" cursors until it is required (like a > > commit operation is dispatched). > > We already do that, and have done it since the feature was invented, > AFAIR. When we declare a cursor for a select on the mentioned big table, it takes multiple seconds and a big temp file is created which to me seems like the materialization took place immediately. > FWIW, the primary reason for materializing the cursor contents at commit, > rather than just holding onto the active query as you seem to think would > be better, is that materializing allows us to release locks on the > underlying table(s). If we kept the active query we'd have to keep those > locks, as well as the query's active snapshot, thus certainly blocking > VACUUM cleanup, and possibly blocking subsequent DDL. Of course, providing such a long-lived cursor would be more costly in terms of the resources you describe, but currently the cost of instant-materialization at the opening operation of the cursor is more expensive. > The approach of using a separate connection to read the cursor suffers > from exactly those same problems. Postgres isn't that happy with very > long-lived transactions (neither is any other RDBMS I'm familiar with). > So really I think that materializing the cursor right away and then > doing your application calculations in whatever chunk size seems > convenient is probably your best bet here. In fact we are fetching cursor results in bigger chunks to avoid communication overhead and many cursors reach end of scan immediately but there are cursors on big tables and fetching all the records of them immediately is very expensive. Meanwhile I have implemented a "lazy hold" in my database abstraction layer which pulls the records in at commit which is a okayish trade-off. Better yet, but with the disadvantages you outlined, would be a "with hold" cursor that would hold onto the locks and the snapshot which would avoid materialization entirely (which can be emulated with second database connection). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization
> > The calculations inside the loop are written in some dynamic high-level > > language and cannot easily be translated into SQL. > > > > ???Can you not simply create a second connection to perform the updates? That would be possibe, but I can see some problems: loop { update table1; select something from table1; update table2 based in updated table1; commit; } If we do all the "update" statements in their own transaction, the select statements would not be able to see changes. What we also tried was to give every loop its own connection but that did not scale too well. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization
> I don't know quite how to put this, but it's not clear to me that the > difficulties in this situation are things PostgreSQL could resolve > even with much larger development resources than are currently > available. There does not seem to exist a cursor/portal/pointer semantic that can survive unrelated changes to the database inside a single connection (and is not super expensive like With_Hold). To some instance a similar behavior can be simulated by using a second connection. I assume most people avoid having this situation at all by changing their implementation (to for example a more dynamic UPDATE statement like you suggested). > If you're updating what are perforce small batches of records in the > UI, there's excellent reason to pull only those batches, mark them as > being "in process," process them, then update the marked ones as > "done" or whatever other states they can get to. > > As to "crazy complicated calculations," this is what active databases > are all about. SQL is Turing complete, so you really can do it. Of course all the things we do *could* be done in SQL itself which would be best solution but there is a huge legacy code base in 4GL that one cannot automatically translate into semantically equivalent SQL statements. During a such a loop user input can also be requested for example. > Would you want something that compiles from the user inputs to SQL? > Might that have a more general utility? Well, like I said, a 4GL to SQL conversion would be desirable but would require a lot of effort. Thus one wants to mix the languages and currently one would loop in 4GL, holding a SQL cursor/portal and do some stuff (which might include SQL update statements). One could also imagine looping in SQL and calling the 4GL runtime for each result row to do the computation. I am not sure that is ideal if such an operation waits on user input. Also one would need to analyze every loop looking for update statements and then automatically re-structure them to update statements with dependencies. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers