Re: poc - possibility to write window function in PL languages
Hi so 16. 1. 2021 v 0:09 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > [ plpgsql-window-functions-20210104.patch.gz ] > > I spent some time looking at this patch. It would certainly be > appealing to have some ability to write custom window functions > without descending into C; but I'm not very happy about the details. > > I'm okay with the idea of having a special variable of a new pseudotype. > That's not exactly pretty, but it descends directly from how we handle > the arguments of trigger functions, so at least there's precedent. > What's bugging me though is the "typedvalue" stuff. That seems like a > conceptual mess, a performance loss, and a permanent maintenance time > sink. To avoid performance complaints, eventually this hard-wired set > of conversions would have to bloom to cover every built-in cast, and > as for extension types, you're just out of luck. > I invited typed values with an idea of larger usability. With this type we can implement dynamic iteration over records better than now, when the fields of records should be cast to text or json before operation. With this type I can hold typed value longer time and I can do some like: DECLARE var typedvalue; var := fx(..); IF var IS OF integer THEN var_int := CAST(var AS int); ELSEIF var IS OF date THEN var_date := CAST(var AS date); ELSE var_text := CAST(var AS text); END; Sometimes (when you process some external data) this late (lazy) cast can be better and allows you to use typed values. When I read external data, sometimes I don't know types of these data before reading. I would like to inject a possibility of more dynamic work with values and variables (but still cleanly and safely). It should be more safe and faster than now, when people should use the "text" type. But I understand and I agree with your objections. Probably a lot of people will use this type badly. > One way to avoid that would be to declare the argument-fetching > functions as polymorphics with a dummy argument that just provides > the expected result type. So users would write something like > > create function pl_lag(x numeric) > ... > v := get_input_value_in_partition(windowobject, x, 1, -1, > 'seek_current', false); > > where the argument-fetching function is declared > >get_input_value_in_partition(windowobject, anyelement, int, ...) >returns anyelement > > and internally it could verify that the n'th window function argument > matches the type of its second argument. While this could be made > to work, it's kind of unsatisfying because the argument number "1" is > so obviously redundant with the reference to "x". Ideally one should > only have to write "x". I don't quite see how to make that work, > but maybe there's a way? > > On the whole though, I think your original idea of bespoke plpgsql > syntax is better, ie let's write something like > >GET WINDOW VALUE v := x AT PARTITION CURRENT(-1); > > and hide all the mechanism behind that. The reference to "x" is enough > to provide the argument number and type, and the window object doesn't > have to be explicitly visible at all. > yes, this syntax looks well. The second question is work with partition context value. This should be only one value, and of only one but of any type per function. In this case we cannot use GET statements. I had an idea of enhancing declaration. Some like DECLARE pcx PARTITION CONTEXT (int); -- read partition context BEGIN pcx := 10; -- set partition context What do you think about it? Regards Pavel > Yeah, this will mean that anybody who wants to provide equivalent > functionality in some other PL will have to do more work. But it's > not like it was going to be zero effort for them before. Furthermore, > it's not clear to me that other PLs would want to adopt your current > design anyway. For example, I bet PL/R would like to somehow make > window arguments map into vectors on the R side, but there's no chance > of that with this SQL layer in between. > > regards, tom lane >
[bug?] EXPLAIN outputs 0 for rows and width in cost estimate for update nodes
Hello, While I'm investigating problems with parallel DML on another thread, I encountered a fishy behavior of EXPLAIN on HEAD. Is this a bug? As follows, the rows and width values of Update node is 0. These were 1 and 10 respectively in versions 9.4.26 and 10.12 at hand. postgres=# create table a (c int); CREATE TABLE postgres=# insert into a values(1); INSERT 0 1 postgres=# analyze a; ANALYZE postgres=# begin; BEGIN postgres=*# explain analyze update a set c=2; QUERY PLAN -- Update on a (cost=0.00..1.01 rows=0 width=0) (actual time=0.189..0.191 rows=0 loops=1) -> Seq Scan on a (cost=0.00..1.01 rows=1 width=10) (actual time=0.076..0.079 rows=1 loops=1) Planning Time: 0.688 ms Execution Time: 0.494 ms (4 rows) With RETURNING, the values are not 0 as follows. postgres=*# rollback; ROLLBACK postgres=# begin; BEGIN postgres=# explain analyze update a set c=2 returning *; QUERY PLAN -- Update on a (cost=0.00..1.01 rows=1 width=10) (actual time=0.271..0.278 rows=1 loops=1) -> Seq Scan on a (cost=0.00..1.01 rows=1 width=10) (actual time=0.080..0.082 rows=1 loops=1) Planning Time: 0.308 ms Execution Time: 0.392 ms (4 rows) The above holds true for Insert and Delete nodes as well. In the manual, they are not 0. https://www.postgresql.org/docs/devel/using-explain.html -- EXPLAIN ANALYZE UPDATE tenk1 SET hundred = hundred + 1 WHERE unique1 < 100; QUERY PLAN ---- Update on tenk1 (cost=5.07..229.46 rows=101 width=250) (actual time=14.628..14.628 rows=0 loops=1) -> Bitmap Heap Scan on tenk1 (cost=5.07..229.46 rows=101 width=250) (actual time=0.101..0.439 rows=100 loops=1) ... -- This behavior may possibly be considered as an intended behavior for the reason that Update/Insert/Delete nodes don't output rows without RETURNING. Is this a bug or a correct behavior? Regards Takayuki Tsunakawa
Re: [bug?] EXPLAIN outputs 0 for rows and width in cost estimate for update nodes
On Wed, Jan 20, 2021 at 9:12 PM tsunakawa.ta...@fujitsu.com wrote: > This behavior may possibly be considered as an intended behavior for the > reason that Update/Insert/Delete nodes don't output rows without RETURNING. > Is this a bug or a correct behavior? Hi Tsunakawa-san, This was a change made deliberately. Do you see a problem? commit f0f13a3a08b2757997410f3a1c38bdc22973c525 Author: Thomas Munro Date: Mon Oct 12 20:41:16 2020 +1300 Fix estimates for ModifyTable paths without RETURNING. In the past, we always estimated that a ModifyTable node would emit the same number of rows as its subpaths. Without a RETURNING clause, the correct estimate is zero. Fix, in preparation for a proposed parallel write patch that is sensitive to that number. A remaining problem is that for RETURNING queries, the estimated width is based on subpath output rather than the RETURNING tlist. Reviewed-by: Greg Nancarrow Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJr R3AcrTS3g%40mail.gmail.com
RE: [bug?] EXPLAIN outputs 0 for rows and width in cost estimate for update nodes
Hi Thomas-san, From: Thomas Munro > This was a change made deliberately. Do you see a problem? Thank you, I was surprised at your very quick response. I just wanted to confirm I can believe EXPLAIN output. Then the problem is the sample output in the manual. The fix is attached. Regards Takayuki Tsunakawa 0001-Fix-sample-output-of-EXPLAIN-ANALYZE.patch Description: 0001-Fix-sample-output-of-EXPLAIN-ANALYZE.patch
Re: Wrong usage of RelationNeedsWAL
At Tue, 19 Jan 2021 01:31:52 -0800, Noah Misch wrote in > On Tue, Jan 19, 2021 at 01:48:31PM +0900, Kyotaro Horiguchi wrote: > > I understand that you are suggesting that at least > > TransactionIdLimitedForOldSnapshots should follow not only relation > > persistence but RelationNeedsWAL, based on the discussion on the > > check on LSN of TestForOldSnapshot(). > > > > I still don't think that LSN in the WAL-skipping-created relfilenode > > harm. ALTER TABLE SET TABLESPACE always makes a dead copy of every > > block (except checksum) including page LSN regardless of wal_level. In > > the scenario above, the last select at H correctly reads page LSN set > > by E then copied by G, which is larger than the snapshot LSN at B. So > > doesn't go the fast-return path before actual check performed by > > RelationAllowsEarlyPruning. > > I agree the above procedure doesn't have a problem under your patch versions. > See https://postgr.es/m/20210116043816.ga1644...@rfd.leadboat.com for a > different test case. In more detail: > A> setup: CREATE TABLE t AS SELECT ...; B> xact1: BEGIN; DELETE FROM t; C> xact2: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; -- take snapshot D> xact1: COMMIT; > (plenty of time passes, rows of t are now eligible for early pruning) E> xact3: BEGIN; F> xact3: ALTER TABLE t SET TABLESPACE something; -- start skipping WAL G> xact3: SELECT count(*) FROM t; -- early pruning w/o WAL, w/o LSN changes H> xact3: COMMIT; J> xact2: SELECT count(*) FROM t; -- no error, wanted "snapshot too old" > > I expect that shows no bug for git master, but I expect it does show a bug I didn't see "snapshot too old" at J with the patch, but at the same time the SELECT at G didn't prune tuples almost all of the trial. (the last SELECT returns the correct answer.) I saw it get pruned only once among many trials but I'm not sure how I could get to the situation and haven't succeed to reproduce that. The reason that the SELECT at G doesn't prune is that limited_xmin in heap_page_prune_opt at G is limited up to the oldest xid among active sessions. In this case the xmin of the session 2 is the same with the xid of the session 1. So xmin of the session 3 at G is the same with the xmin of the session 2, which is the same with the xid of the session 1. So pruning doesn't happen. However that is very fragile as the base for asserting that pruning won't happen. Anyway, it seems actually dangerous that cause pruning on wal-skipped relation. > with your patch versions. Could you try implementing both test procedures in > src/test/modules/snapshot_too_old? There's no need to make the test use > wal_level=minimal explicitly; it's sufficient to catch these bugs when > wal_level=minimal is in the TEMP_CONFIG file. In the attached, TestForOldSnapshot() considers XLogIsNeeded(), instead of moving the condition on LSN to TestForOldSnapshot_impl for performance. I'll add the test part in the next version. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 5976674e48fdce3c6e911f0ae63485d92941bfd8 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 18 Jan 2021 14:47:21 +0900 Subject: [PATCH v4] Do not use RelationNeedsWAL to identify relation persistence RelationNeedsWAL() may return false for a permanent relation when wal_level=minimal and the relation is created or truncated in the current transaction. Directly examine relpersistence instead of using the function to know relation persistence. --- src/backend/catalog/pg_publication.c | 2 +- src/backend/optimizer/util/plancat.c | 3 ++- src/include/storage/bufmgr.h | 8 +++- src/include/utils/rel.h | 2 +- src/include/utils/snapmgr.h | 2 +- 5 files changed, 12 insertions(+), 5 deletions(-) diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c index 5f8e1c64e1..84d2efcfd2 100644 --- a/src/backend/catalog/pg_publication.c +++ b/src/backend/catalog/pg_publication.c @@ -67,7 +67,7 @@ check_publication_add_relation(Relation targetrel) errdetail("System tables cannot be added to publications."))); /* UNLOGGED and TEMP relations cannot be part of publication. */ - if (!RelationNeedsWAL(targetrel)) + if (targetrel->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("table \"%s\" cannot be replicated", diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index da322b453e..177e6e336a 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -126,7 +126,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent, relation = table_open(relationObjectId, NoLock); /* Temporary and unlogged relations are inaccessible during recovery. */ - if (!RelationNeedsWAL(relation) && RecoveryInProgress()) + if (relation->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT && + RecoveryInProgress())
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On Wed, Jan 20, 2021 at 11:53 AM Fujii Masao wrote: > So, furthermore, we can use hash_search() to find the target cached > connection, instead of using hash_seq_search(), when the server name > is given. This would simplify the code a bit more? Of course, > hash_seq_search() is necessary when closing all the connections, though. Note that the cache entry key is user mapping oid and to use hash_search() we need the user mapping oid. But in postgres_fdw_disconnect we can get server oid and we can also get user mapping id using GetUserMapping, but it requires GetUserId()/CurrentUserId as an input, I doubt we will have problems if CurrentUserId is changed somehow with the change of current user in the session. And user mapping may be dropped but still the connection can exist if it's in use, in that case GetUserMapping fails in cache lookup. And yes, disconnecting all connections requires hash_seq_search(). Keeping above in mind, I feel we can do hash_seq_search(), as we do currently, even when the server name is given as input. This way, we don't need to bother much on the above points. Thoughts? > + * 2) If no input argument is provided, then it tries to disconnect all > the > + *connections. > > I'm concerned that users can easily forget to specify the argument and > accidentally discard all the connections. So, IMO, to alleviate this > situation, > what about changing the function name (only when closing all the connections) > to something postgres_fdw_disconnect_all(), like we have > pg_advisory_unlock_all() against pg_advisory_unlock()? +1. We will have two functions postgres_fdw_disconnect(server name), postgres_fdw_disconnect_all. > + if (result) > + { > + /* We closed at least one connection, others > are in use. */ > + ereport(WARNING, > + (errmsg("cannot close all > connections because some of them are still in use"))); > + } > > Sorry if this was already discussed upthread. Isn't it more helpful to > emit a warning for every connections that fail to be closed? For example, > > WARNING: cannot close connection for server "loopback1" because it is still > in use > WARNING: cannot close connection for server "loopback2" because it is still > in use > WARNING: cannot close connection for server "loopback3" because it is still > in use > ... > > This enables us to identify easily which server connections cannot be > closed for now. +1. Looks like pg_advisory_unlock is doing that. Given the fact that still in use connections are possible only in explicit txns, we might not have many still in use connections in the real world use case, so I'm okay to change that way. I will address all these comments and post an updated patch set soon. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: [bug?] EXPLAIN outputs 0 for rows and width in cost estimate for update nodes
On Wed, 2021-01-20 at 08:35 +, tsunakawa.ta...@fujitsu.com wrote: > > This was a change made deliberately. Do you see a problem? > > Thank you, I was surprised at your very quick response. > I just wanted to confirm I can believe EXPLAIN output. > Then the problem is the sample output in the manual. > The fix is attached. +1. That was obviously an oversight. Yours, Laurenz Albe
Re: Rethinking plpgsql's assignment implementation
út 19. 1. 2021 v 19:21 odesílatel Pavel Stehule napsal: > Hi > > Now, I am testing subscribing on the jsonb feature, and I found one issue, > that is not supported by parser. > > When the target is scalar, then all is ok. But we can have a plpgsql array > of jsonb values. > > postgres=# do $$ > declare j jsonb[]; > begin > j[1] = '{"b":"Ahoj"}'; > raise notice '%', j; > raise notice '%', (j[1])['b']; > end > $$; > NOTICE: {"{\"b\": \"Ahoj\"}"} > NOTICE: "Ahoj" > DO > > Parenthesis work well in expressions, but are not supported on the left > side of assignment. > > postgres=# do $$ > declare j jsonb[]; > begin > (j[1])['b'] = '"Ahoj"'; > raise notice '%', j; > raise notice '%', j[1]['b']; > end > $$; > ERROR: syntax error at or near "(" > LINE 4: (j[1])['b'] = '"Ahoj"'; > ^ > Assignment for nesting composite types is working better - although there is some inconsistency too: create type t_inner as (x int, y int); create type t_outer as (a t_inner, b t_inner); do $$ declare v t_outer; begin v.a.x := 10; -- parenthesis not allowed here, but not required raise notice '%', v; raise notice '%', (v).a.x; -- parenthesis are required here end; $$; Regards Pavel > Regards > > Pavel > > >
Re: should INSERT SELECT use a BulkInsertState?
On Sat, Dec 05, 2020 at 01:59:41PM -0600, Justin Pryzby wrote: > On Thu, Dec 03, 2020 at 10:59:34AM +0530, Bharath Rupireddy wrote: > > On Wed, Dec 2, 2020 at 10:24 PM Justin Pryzby wrote: > > > > > > One loose end in this patch is how to check for volatile default > > > expressions. > > > > I think we should be doing all the necessary checks in the planner and > > have a flag in the planned stmt to indicate whether to go with multi > > insert or not. For the required checks, we can have a look at how the > > existing COPY decides to go with either CIM_MULTI or CIM_SINGLE. > > Yes, you can see that I've copied the checks from copy. > Like copy, some checks are done once, in ExecInitModifyTable, outside of the > ExecModifyTable "loop". > > This squishes some commits together. > And uses bistate for ON CONFLICT. > And attempts to use memory context for tuple size. Rebased on 9dc718bdf2b1a574481a45624d42b674332e2903 I guess my patch should/may be subsumed by this other one - I'm fine with that. https://commitfest.postgresql.org/31/2871/ Note that my interest here is just in bistate, to avoid leaving behind many dirty buffers, not improved performance of COPY. -- Justin >From 65fd9d9352634e4d0ad49685a988c5e4e157b9d8 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Fri, 8 May 2020 02:17:32 -0500 Subject: [PATCH v9 1/4] INSERT SELECT to use BulkInsertState and multi_insert Renames structures; Move MultipleInsert functions from copyfrom.c to (tentatively) nodeModifyTable.h; Move into MultiInsertInfo: transition_capture and cur_lineno (via cstate->miinfo); Dynamically switch to multi-insert mode based on the number of insertions. This is intended to accomodate 1) the original use case of INSERT using a small ring buffer to avoid leaving behind dirty buffers; and, 2) Automatically using multi-inserts for batch operations; 3) allow the old behavior of leaving behind dirty buffers, which might allow INSERT to run more quickly, at the cost of leaving behind many dirty buffers which other backends may have to write out. XXX: for (1), the bulk-insert state is used even if not multi-insert, including for a VALUES. TODO: use cstate->miinfo.cur_lineno++ instead of mtstate->miinfo->ntuples --- src/backend/commands/copyfrom.c | 394 +-- src/backend/commands/copyfromparse.c | 10 +- src/backend/executor/execMain.c | 2 +- src/backend/executor/execPartition.c | 2 +- src/backend/executor/nodeModifyTable.c | 196 +-- src/backend/utils/misc/guc.c | 10 + src/include/commands/copyfrom_internal.h | 5 +- src/include/executor/nodeModifyTable.h | 371 + src/include/nodes/execnodes.h| 16 +- src/test/regress/expected/insert.out | 43 +++ src/test/regress/sql/insert.sql | 20 ++ src/tools/pgindent/typedefs.list | 4 +- 12 files changed, 658 insertions(+), 415 deletions(-) diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c index c39cc736ed..8221a2c5d3 100644 --- a/src/backend/commands/copyfrom.c +++ b/src/backend/commands/copyfrom.c @@ -46,54 +46,6 @@ #include "utils/rel.h" #include "utils/snapmgr.h" -/* - * No more than this many tuples per CopyMultiInsertBuffer - * - * Caution: Don't make this too big, as we could end up with this many - * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's - * multiInsertBuffers list. Increasing this can cause quadratic growth in - * memory requirements during copies into partitioned tables with a large - * number of partitions. - */ -#define MAX_BUFFERED_TUPLES 1000 - -/* - * Flush buffers if there are >= this many bytes, as counted by the input - * size, of tuples stored. - */ -#define MAX_BUFFERED_BYTES 65535 - -/* Trim the list of buffers back down to this number after flushing */ -#define MAX_PARTITION_BUFFERS 32 - -/* Stores multi-insert data related to a single relation in CopyFrom. */ -typedef struct CopyMultiInsertBuffer -{ - TupleTableSlot *slots[MAX_BUFFERED_TUPLES]; /* Array to store tuples */ - ResultRelInfo *resultRelInfo; /* ResultRelInfo for 'relid' */ - BulkInsertState bistate; /* BulkInsertState for this rel */ - int nused; /* number of 'slots' containing tuples */ - uint64 linenos[MAX_BUFFERED_TUPLES]; /* Line # of tuple in copy - * stream */ -} CopyMultiInsertBuffer; - -/* - * Stores one or many CopyMultiInsertBuffers and details about the size and - * number of tuples which are stored in them. This allows multiple buffers to - * exist at once when COPYing into a partitioned table. - */ -typedef struct CopyMultiInsertInfo -{ - List *multiInsertBuffers; /* List of tracked CopyMultiInsertBuffers */ - int bufferedTuples; /* number of tuples buffered over all buffers */ - int bufferedBytes; /* number of bytes from all buffered tuples */ - CopyFromState cstate; /* Copy state for this CopyMultiInsertInfo */ - EState *estate; /* Executor state used for COPY */
Re: pg_stat_statements oddity with track = all
Hi, On Tue, Jan 19, 2021 at 4:55 PM Masahiro Ikeda wrote: > > Thanks for making the patch to add "toplevel" column in > pg_stat_statements. > This is a review comment. Thanks a lot for the thorough review! > I tested the "update" command can work. > postgres=# ALTER EXTENSION pg_stat_statements UPDATE TO '1.10'; > > Although the "toplevel" column of all queries which already stored is > 'false', > we have to decide the default. I think 'false' is ok. Mmm, I'm not sure that I understand this result. The "toplevel" value is tracked by the C code loaded at startup, so it should have a correct value even if you used to have the extension in a previous version, it's just that you can't access the toplevel field before doing the ALTER EXTENSION pg_stat_statements UPDATE. There's also a change in the magic number, so you can't use the stored stat file from a version before this patch. I also fail to reproduce this behavior. I did the following: - start postgres with pg_stat_statements v1.10 (so with this patch) in shared_preload_libraries - CREATE EXTENSION pg_stat_statements VERSION '1.9'; - execute a few queries - ALTER EXTENSION pg_stat_statements UPDATE; - SELECT * FROM pg_stat_statements reports the query with toplvel to TRUE Can you share a way to reproduce your findings? > 2. usability review > > [...] > Although one concern is whether only top-level is enough or not, > I couldn't come up with any use-case to use nested level, so I think > it's ok. I agree, I don't see how tracking statistics per nesting level would really help. The only additional use case I see would tracking triggers/FK query execution, but the nesting level won't help with that. > 3. coding review > = > [...] > B. add a argument of the pg_stat_statements_reset(). > > Now, pg_stat_statements supports a flexible reset feature. > > pg_stat_statements_reset(userid Oid, dbid Oid, queryid bigint) > > Although I wondered whether we need to add a top-level flag to the > arguments, > I couldn't come up with any use-case to reset only top-level queries or > not top-level queries. Ah, I didn't think of the reset function. I also fail to see a reasonable use case to provide a switch for the reset function. > 4. others > == > > These comments are not related to this patch. > > A. about the topic of commitfests > > Since I think this feature is for monitoring, > it's better to change the topic from "System Administration" > to "Monitoring & Control". I agree, thanks for the change! > B. check logic whether a query is top-level or not in pg_stat_kcache > > This patch uses the only exec_nested_level to check whether a query is > top-level or not. > The reason is nested_level is less useful and I agree. Do you mean plan_nest_level is less useful? > But, pg_stat_kcache uses plan_nested_level too. > Although the check result is the same, it's better to change it > corresponding to this patch after it's committed. I agree, we should be consistent here. I'll take care of the needed changes if and when this patch is commited!
Re: [bug?] EXPLAIN outputs 0 for rows and width in cost estimate for update nodes
On Wed, Jan 20, 2021 at 9:42 PM Laurenz Albe wrote: > On Wed, 2021-01-20 at 08:35 +, tsunakawa.ta...@fujitsu.com wrote: > > > This was a change made deliberately. Do you see a problem? > > > > Thank you, I was surprised at your very quick response. > > I just wanted to confirm I can believe EXPLAIN output. > > Then the problem is the sample output in the manual. > > The fix is attached. > > +1. That was obviously an oversight. Pushed. Thanks.
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2021/01/20 17:41, Bharath Rupireddy wrote: On Wed, Jan 20, 2021 at 11:53 AM Fujii Masao wrote: So, furthermore, we can use hash_search() to find the target cached connection, instead of using hash_seq_search(), when the server name is given. This would simplify the code a bit more? Of course, hash_seq_search() is necessary when closing all the connections, though. Note that the cache entry key is user mapping oid and to use hash_search() we need the user mapping oid. But in postgres_fdw_disconnect we can get server oid and we can also get user mapping id using GetUserMapping, but it requires GetUserId()/CurrentUserId as an input, I doubt we will have problems if CurrentUserId is changed somehow with the change of current user in the session. And user mapping may be dropped but still the connection can exist if it's in use, in that case GetUserMapping fails in cache lookup. And yes, disconnecting all connections requires hash_seq_search(). Keeping above in mind, I feel we can do hash_seq_search(), as we do currently, even when the server name is given as input. This way, we don't need to bother much on the above points. Thoughts? Thanks for explaining this! You're right. I'd withdraw my suggestion. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Parallel INSERT (INTO ... SELECT ...)
On Fri, Jan 8, 2021 at 8:16 PM Amit Kapila wrote: > > > - if (pcxt->nworkers_launched > 0) > + if (pcxt->nworkers_launched > 0 && !(isParallelModifyLeader && > !isParallelModifyWithReturning)) > { > > I think this check could be simplified to if (pcxt->nworkers_launched > > 0 && isParallelModifyWithReturning) or something like that. > Not quite. The existing check is correct, because it needs to account for existing Parallel SELECT functionality (not just new Parallel INSERT). But I will re-write the test as an equivalent expression, so it's hopefully more readable (taking into account Antonin's suggested variable-name changes): if (pcxt->nworkers_launched > 0 && (!isModify || isModifyWithReturning)) > > > > @@ -252,6 +252,7 @@ set_plan_references(PlannerInfo *root, Plan *plan) > > PlannerGlobal *glob = root->glob; > > int rtoffset = list_length(glob->finalrtable); > > ListCell *lc; > > + Plan *finalPlan; > > > > /* > > * Add all the query's RTEs to the flattened rangetable. The live > > ones > > @@ -302,7 +303,17 @@ set_plan_references(PlannerInfo *root, Plan *plan) > > } > > > > /* Now fix the Plan tree */ > > - return set_plan_refs(root, plan, rtoffset); > > + finalPlan = set_plan_refs(root, plan, rtoffset); > > + if (finalPlan != NULL && IsA(finalPlan, Gather)) > > + { > > + Plan *subplan = outerPlan(finalPlan); > > + > > + if (IsA(subplan, ModifyTable) && castNode(ModifyTable, > > subplan)->returningLists != NULL) > > + { > > + finalPlan->targetlist = > > copyObject(subplan->targetlist); > > + } > > + } > > + return finalPlan; > > } > > > > I'm not sure if the problem of missing targetlist should be handled here > > (BTW, > > NIL is the constant for an empty list, not NULL). Obviously this is a > > consequence of the fact that the ModifyTable node has no regular targetlist. > > > > I think it is better to add comments along with this change. In this > form, this looks quite hacky to me. > The targetlist on the ModifyTable node has been setup correctly, but it hasn't been propagated to the Gather node. Of course, I have previously tried to elegantly fix this, but struck various problems, using different approaches. Perhaps this code could just be moved into set_plan_refs(). For the moment, I'll just keep the current code, but I'll add a FIXME comment for this. I'll investigate Antonin's suggestions as a lower-priority side-task. Regards, Greg Nancarrow Fujitsu Australia
Re: a misbehavior of partition row movement (?)
On Wed, Jan 20, 2021 at 4:13 PM Peter Eisentraut wrote: > On 2021-01-08 09:54, Amit Langote wrote: > >>> I don't quite recall if the decision to implement it like this was > >>> based on assuming that this is what users would like to see happen in > >>> this case or the perceived difficulty of implementing it the other way > >>> around, that is, of firing AFTER UPDATE triggers in this case. > >> I tried to look that up, but I couldn't find any discussion about this. Do > >> you have any ideas in which thread that was handled? > > It was discussed here: > > > > https://www.postgresql.org/message-id/flat/CAJ3gD9do9o2ccQ7j7%2BtSgiE1REY65XRiMb%3DyJO3u3QhyP8EEPQ%40mail.gmail.com > > > > It's a huge discussion, so you'll have to ctrl+f "trigger" to spot > > relevant emails. You might notice that the developers who > > participated in that discussion gave various opinions and what we have > > today got there as a result of a majority of them voting for the > > current approach. Someone also said this during the discussion: > > "Regarding the trigger issue, I can't claim to have a terribly strong > > opinion on this. I think that practically anything we do here might > > upset somebody, but probably any halfway-reasonable thing we choose to > > do will be OK for most people." So what we've got is that > > "halfway-reasonable" thing, YMMV. :) > > Could you summarize here what you are trying to do with respect to what > was decided before? I'm a bit confused, looking through the patches you > have posted. The first patch you posted hard-coded FK trigger OIDs > specifically, other patches talk about foreign key triggers in general > or special case internal triggers or talk about all triggers. The original problem statement is this: the way we generally fire row-level triggers of a partitioned table can lead to some unexpected behaviors of the foreign keys pointing to that partitioned table during its cross-partition updates. Let's start with an example that shows how triggers are fired during a cross-partition update: create table p (a numeric primary key) partition by list (a); create table p1 partition of p for values in (1); create table p2 partition of p for values in (2); create or replace function report_trig_details() returns trigger as $$ begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op = 'DELETE' then return old; end if; return new; end; $$ language plpgsql; create trigger trig1 before update on p for each row execute function report_trig_details(); create trigger trig2 after update on p for each row execute function report_trig_details(); create trigger trig3 before delete on p for each row execute function report_trig_details(); create trigger trig4 after delete on p for each row execute function report_trig_details(); create trigger trig5 before insert on p for each row execute function report_trig_details(); create trigger trig6 after insert on p for each row execute function report_trig_details(); insert into p values (1); update p set a = 2; NOTICE: BEFORE UPDATE on p1 NOTICE: BEFORE DELETE on p1 NOTICE: BEFORE INSERT on p2 NOTICE: AFTER DELETE on p1 NOTICE: AFTER INSERT on p2 UPDATE 1 (AR update triggers are not fired.) For contrast, here is an intra-partition update: update p set a = a; NOTICE: BEFORE UPDATE on p2 NOTICE: AFTER UPDATE on p2 UPDATE 1 Now, the trigger machinery makes no distinction between user-defined and internal triggers, which has implications for the foreign key enforcing triggers on partitions. Consider the following example: create table q (a bigint references p); insert into q values (2); update p set a = 1; NOTICE: BEFORE UPDATE on p2 NOTICE: BEFORE DELETE on p2 NOTICE: BEFORE INSERT on p1 ERROR: update or delete on table "p2" violates foreign key constraint "q_a_fkey2" on table "q" DETAIL: Key (a)=(2) is still referenced from table "q". So the RI delete trigger (NOT update) on p2 prevents the DELETE that occurs as part of the row movement. One might make the updates cascade and expect that to prevent the error: drop table q; create table q (a bigint references p on update cascade); insert into q values (2); update p set a = 1; NOTICE: BEFORE UPDATE on p2 NOTICE: BEFORE DELETE on p2 NOTICE: BEFORE INSERT on p1 ERROR: update or delete on table "p2" violates foreign key constraint "q_a_fkey2" on table "q" DETAIL: Key (a)=(2) is still referenced from table "q". No luck, because again it's the RI delete trigger on p2 that gets fired. If you make deletes cascade too, an even worse thing happens: drop table q; create table q (a bigint references p on update cascade on delete cascade); insert into q values (2); update p set a = 1; NOTICE: BEFORE UPDATE on p2 NOTICE: BEFORE DELETE on p2 NOTICE: BEFORE INSERT on p1 NOTICE: AFTER DELETE on p2 NOTICE: AFTER INSERT on p1 UPDATE 1 select * from q; a --- (0 rows) The RI delete trigger deleted 2 from q, whereas the expected result would've been for q to be updated to change 2
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On Wed, Jan 20, 2021 at 3:24 PM Fujii Masao wrote: > > Keeping above in mind, I feel we can do hash_seq_search(), as we do > > currently, even when the server name is given as input. This way, we > > don't need to bother much on the above points. > > > > Thoughts? > > Thanks for explaining this! You're right. I'd withdraw my suggestion. Attaching v14 patch set with review comments addressed. Please review it further. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v14-0001-postgres_fdw-function-to-discard-cached-connecti.patch Description: Binary data v14-0002-postgres_fdw-add-keep_connections-GUC-to-not-cac.patch Description: Binary data v14-0003-postgres_fdw-server-level-option-keep_connection.patch Description: Binary data
Re: pg_stat_statements oddity with track = all
On 2021-01-20 18:14, Julien Rouhaud wrote: On Tue, Jan 19, 2021 at 4:55 PM Masahiro Ikeda wrote: I tested the "update" command can work. postgres=# ALTER EXTENSION pg_stat_statements UPDATE TO '1.10'; Although the "toplevel" column of all queries which already stored is 'false', we have to decide the default. I think 'false' is ok. Mmm, I'm not sure that I understand this result. The "toplevel" value is tracked by the C code loaded at startup, so it should have a correct value even if you used to have the extension in a previous version, it's just that you can't access the toplevel field before doing the ALTER EXTENSION pg_stat_statements UPDATE. There's also a change in the magic number, so you can't use the stored stat file from a version before this patch. I also fail to reproduce this behavior. I did the following: - start postgres with pg_stat_statements v1.10 (so with this patch) in shared_preload_libraries - CREATE EXTENSION pg_stat_statements VERSION '1.9'; - execute a few queries - ALTER EXTENSION pg_stat_statements UPDATE; - SELECT * FROM pg_stat_statements reports the query with toplvel to TRUE Can you share a way to reproduce your findings? Sorry for making you confused. I can't reproduce although I tried now. I think my procedure was something wrong. So, please ignore this comment, sorry about that. B. check logic whether a query is top-level or not in pg_stat_kcache This patch uses the only exec_nested_level to check whether a query is top-level or not. The reason is nested_level is less useful and I agree. Do you mean plan_nest_level is less useful? I think so. Anyway, it's important to correspond core's implementation because pg_stat_statements and pg_stat_kcache are used at the same time. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On Mon, Jan 18, 2021 at 02:37:57AM -0600, Justin Pryzby wrote: > Attached. I will re-review these myself tomorrow. I have begun looking at 0001 and 0002... +/* + * This is mostly duplicating ATExecSetTableSpaceNoStorage, + * which should maybe be factored out to a library function. + */ Wouldn't it be better to do first the refactoring of 0002 and then 0001 so as REINDEX can use the new routine, instead of putting that into a comment? + This specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM is specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. What is an unsuitable relation? How can the end user know that? This is missing ACL checks when moving the index into a new location, so this requires some pg_tablespace_aclcheck() calls, and the other patches share the same issue. + else if (partkind == RELKIND_PARTITIONED_TABLE) + { + Relation rel = table_open(partoid, ShareLock); + List*indexIds = RelationGetIndexList(rel); + ListCell *lc; + + table_close(rel, NoLock); + foreach (lc, indexIds) + { + Oid indexid = lfirst_oid(lc); + (void) set_rel_tablespace(indexid, params->tablespaceOid); + } + } This is really a good question. ReindexPartitions() would trigger one transaction per leaf to work on. Changing the tablespace of the partitioned table(s) before doing any work has the advantage to tell any new partition to use the new tablespace. Now, I see a struggling point here: what should we do if the processing fails in the middle of the move, leaving a portion of the leaves in the previous tablespace? On a follow-up reindex with the same command, should the command force a reindex even on the partitions that have been moved? Or could there be a point in skipping the partitions that are already on the new tablespace and only process the ones on the previous tablespace? It seems to me that the first scenario makes the most sense as currently a REINDEX works on all the relations defined, though there could be use cases for the second case. This should be documented, I think. There are no tests for partitioned tables, aka we'd want to make sure that the new partitioned index is on the correct tablespace, as well as all its leaves. It may be better to have at least two levels of partitioned tables, as well as a partitioned table with no leaves in the cases dealt with. +* +* Even if a table's indexes were moved to a new tablespace, the index +* on its toast table is not normally moved. */ Still, REINDEX (TABLESPACE) TABLE should move all of them to be consistent with ALTER TABLE SET TABLESPACE, but that's not the case with this code, no? This requires proper test coverage, but there is nothing of the kind in this patch. -- Michael signature.asc Description: PGP signature
Re: Stronger safeguard for archive recovery not to miss data
On Wed, 2021-01-20 at 13:10 +0900, Fujii Masao wrote: > +errhint("Run recovery again from a new base > backup taken after setting wal_level higher than minimal"))); > > Isn't it impossible to do this in normal archive recovery case? In that case, > since the server crashed and the database got corrupted, probably > we cannot take a new base backup. > > Originally even when users accidentally set wal_level to minimal, they could > start the server from the backup by disabling hot_standby and salvage the > data. > But with the patch, we lost the way to do that. Right? I was wondering that > WARNING was used intentionally there for that case. I would argue that if you set your "wal_level" to minimal, do some work, reset it to "replica" and recover past that, two things could happen: 1. Since "archive_mode" was off, you are missing some WAL segments and cannot recover past that point anyway. 2. You are missing some relevant WAL records, and your recovered database is corrupted. This is very likely, because you probably switched to "minimal" with the intention to generate less WAL. Now I see your point that a corrupted database may be better than no database at all, but wouldn't you agree that a warning in the log (that nobody reads) is too little information? Normally we don't take such a relaxed attitude towards data corruption. Perhaps there could be a GUC "recovery_allow_data_corruption" to override this check, but I'd say that a warning is too little. > if (ControlFile->wal_level < WAL_LEVEL_REPLICA) > ereport(ERROR, > (errmsg("hot standby is not possible > because wal_level was not set to \"replica\" or higher on the primary > server"), > errhint("Either set wal_level to > \"replica\" on the primary, or turn off hot_standby here."))); > > With the patch, we never reach the above code? Right, that would have to go. I didn't notice that. Yours, Laurenz Albe
Re: TOAST condition for column size
On 2021-01-19 19:32, Amit Kapila wrote: On Mon, Jan 18, 2021 at 7:53 PM torikoshia Because no benefit is to be expected by compressing it. The size will be mostly the same. Also, even if we somehow try to fit this data via toast, I think reading speed will be slower because for all such columns an extra fetch from toast would be required. Another thing is you or others can still face the same problem with 17-byte column data. I don't this is the right way to fix it. I don't have many good ideas but I think you can try by (a) increasing block size during configure, (b) reduce the number of columns, (c) create char columns of somewhat bigger size say greater than 24 bytes to accommodate your case. I know none of these are good workarounds but at this moment I can't think of better alternatives. Thanks for your explanation and workarounds! On 2021-01-20 00:40, Tom Lane wrote: Dilip Kumar writes: On Tue, 19 Jan 2021 at 6:28 PM, Amit Kapila wrote: Won't it be safe because we don't align individual attrs of type varchar where length is less than equal to 127? Yeah right, I just missed that point. Yeah, the minimum on biggest_size has nothing to do with alignment decisions. It's just a filter to decide whether it's worth trying to toast anything. Having said that, I'm pretty skeptical of this patch: I think its most likely real-world effect is going to be to waste cycles (and create TOAST-table bloat) on the way to failing anyway. I do not think that toasting a 20-byte field down to 18 bytes is likely to be a productive thing to do in typical situations. The given example looks like a cherry-picked edge case rather than a useful case to worry about. I agree with you, it seems only work when there are many columns with 19 ~ 23 bytes of data and it's not a normal case. I'm not sure, but a rare exception might be some geographic data. That's the situation I heard that problem happened. Regards, -- Atsushi Torikoshi
Re: Added schema level support for publication.
Hi Vignesh, > I have handled the above scenario(drop schema should automatically > remove the schema entry from publication schema relation) & addition > of tests in the new v2 patch attached. > Thoughts? > Please see some initial comments: 1. I think there should be more tests to show that the schema data is actually replicated to the subscriber. Currently, I am not seeing the data being replicated when I use FOR SCHEMA. 2. How does replication behave when a table is added or removed from a subscribed schema using ALTER TABLE SET SCHEMA? 3. Can we have a default schema like a public or current schema that gets replicated in case the user didn't specify one, this can be handy to replicate current schema tables. 4. + The fourth, fifth and sixth variants change which schemas are part of the + publication. The SET TABLE clause will replace the list + of schemas in the publication with the specified one. The ADD There is a typo above s/SET TABLE/SET SCHEMA Thank you, Rahila Syed
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2021/01/20 19:17, Bharath Rupireddy wrote: On Wed, Jan 20, 2021 at 3:24 PM Fujii Masao wrote: Keeping above in mind, I feel we can do hash_seq_search(), as we do currently, even when the server name is given as input. This way, we don't need to bother much on the above points. Thoughts? Thanks for explaining this! You're right. I'd withdraw my suggestion. Attaching v14 patch set with review comments addressed. Please review it further. Thanks for updating the patch! + * It checks if the cache has a connection for the given foreign server that's + * not being used within current transaction, then returns true. If the + * connection is in use, then it emits a warning and returns false. The comment also should mention the case where no open connection for the given server is found? What about rewriting this to the following? - If the cached connection for the given foreign server is found and has not been used within current transaction yet, close the connection and return true. Even when it's found, if it's already used, keep the connection, emit a warning and return false. If it's not found, return false. - + * It returns true, if it closes at least one connection, otherwise false. + * + * It returns false, if the cache doesn't exit. The above second comment looks redundant. + if (ConnectionHash) + result = disconnect_cached_connections(0, true); Isn't it smarter to make disconnect_cached_connections() check ConnectionHash and return false if it's NULL? If we do that, we can simplify the code of postgres_fdw_disconnect() and _all(). + * current transaction are disconnected. Otherwise, the unused entries with the + * given hashvalue are disconnected. In the above second comment, a singular form should be used instead? Because there must be no multiple entries with the given hashvalue. + server = GetForeignServer(entry->serverid); This seems to cause an error "cache lookup failed" if postgres_fdw_disconnect_all() is called when there is a connection in use but its server is dropped. To avoid this error, GetForeignServerExtended() with FSV_MISSING_OK should be used instead, like postgres_fdw_get_connections() does? + if (entry->server_hashvalue == hashvalue && + (entry->xact_depth > 0 || result)) + { + hash_seq_term(&scan); + break; entry->server_hashvalue can be 0? If yes, since postgres_fdw_disconnect_all() specifies 0 as hashvalue, ISTM that the above condition can be true unexpectedly. Can we replace this condition with just "if (!all)"? +-- Closes loopback connection, returns true and issues a warning as loopback2 +-- connection is still in use and can not be closed. +SELECT * FROM postgres_fdw_disconnect_all(); +WARNING: cannot close connection for server "loopback2" because it is still in use + postgres_fdw_disconnect_all +- + t +(1 row) After the above test, isn't it better to call postgres_fdw_get_connections() to check that loopback is not output? +WARNING: cannot close connection for server "loopback" because it is still in use +WARNING: cannot close connection for server "loopback2" because it is still in use Just in the case please let me confirm that the order of these warning messages is always stable? + +postgres_fdw_disconnect(IN servername text) returns boolean I think that "IN" of "IN servername text" is not necessary. I'd like to replace "servername" with "server_name" because postgres_fdw_get_connections() uses "server_name" as the output column name. + + + When called in local session with foreign server name as input, it + discards the unused open connection previously made to the foreign server + and returns true. "unused open connection" sounds confusing to me. What about the following? - This function discards the open connection that postgres_fdw established from the local session to the foriegn server with the given name if it's not used in the current local transaction yet, and then returns true. If it's already used, the function doesn't discard the connection, emits a warning and then returns false. If there is no open connection to the given foreign server, false is returned. If no foreign server with the given name is found, an error is emitted. Example usage of the function: - +postgres=# SELECT * FROM postgres_fdw_disconnect('loopback1'); "SELECT postgres_fdw_disconnect('loopback1')" is more common? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Deleting older versions in unique indexes to avoid page splits
On Wed, Jan 20, 2021 at 10:58 AM Peter Geoghegan wrote: > > On Tue, Jan 19, 2021 at 7:54 PM Amit Kapila wrote: > > The worst cases could be (a) when there is just one such duplicate > > (indexval logically unchanged) on the page and that happens to be the > > last item and others are new insertions, (b) same as (a) and along > > with it lets say there is an open transaction due to which we can't > > remove even that duplicate version. Have we checked the overhead or > > results by simulating such workloads? > > There is no such thing as a workload that has page splits caused by > non-HOT updaters, but almost no actual version churn from the same > non-HOT updaters. It's possible that a small number of individual page > splits will work out like that, of course, but they'll be extremely > rare, and impossible to see in any kind of consistent way. > > That just leaves long running transactions. Of course it's true that > eventually a long-running transaction will make it impossible to > perform any cleanup, for the usual reasons. And at that point this > mechanism is bound to fail (which costs additional cycles -- the > wasted access to a single heap page, some CPU cycles). But it's still > a bargain to try. Even with a long running transactions there will be > a great many bottom-up deletion passes that still succeed earlier on > (because at least some of the dups are deletable, and we can still > delete those that became garbage right before the long running > snapshot was acquired). > How many of the dups are deletable till there is an open long-running transaction in the system before the transaction that has performed an update? I tried a simple test to check this. create table t1(c1 int, c2 int, c3 char(10)); create index idx_t1 on t1(c1); create index idx_t2 on t1(c2); insert into t1 values(generate_series(1,5000),1,'aa'); update t1 set c2 = 2; The update will try to remove the tuples via bottom-up cleanup mechanism for index 'idx_t1' and won't be able to remove any tuple because the duplicates are from the same transaction. > Victor independently came up with a benchmark that ran over several > hours, with cleanup consistently held back by ~5 minutes by a long > running transaction: > AFAICS, the long-running transaction used in the test is below: SELECT abalance, pg_sleep(300) FROM pgbench_accounts WHERE mtime > now() - INTERVAL '15min' ORDER BY aid LIMIT 1; This shouldn't generate a transaction id so would it be sufficient to hold back the clean-up? > https://www.postgresql.org/message-id/cagnebogatzs1mwmvx8fzzhmxzudecb10anvwwhctxtibpg3...@mail.gmail.com > > This was actually one of the most favorable cases of all for the patch > -- the patch prevented logically unchanged indexes from growing (this > is a mix of inserts, updates, and deletes, not just updates, so it was > less than perfect -- we did see the indexes grow by a half of one > percent). Whereas without the patch each of the same 3 indexes grew by > 30% - 60%. > > So yes, I did think about long running transactions, and no, the > possibility of wasting one heap block access here and there when the > database is melting down anyway doesn't seem like a big deal to me. > First, it is not clear to me if that has properly simulated the long-running test but even if it is what I intend to say was to have an open long-running transaction possibly for the entire duration of the test? If we do that, we will come to know if there is any overhead and if so how much? > > I feel unlike LP_DEAD optimization this new bottom-up scheme can cost > > us extra CPU and I/O because there seems to be not much consideration > > given to the fact that we might not be able to delete any item (or > > very few) due to long-standing open transactions except that we limit > > ourselves when we are not able to remove even one tuple from any > > particular heap page. > > There was plenty of consideration given to that. It was literally > central to the design, and something I poured over at length. Why > don't you go read some of that now? Or, why don't you demonstrate an > actual regression using a tool like pgbench? > > I do not appreciate being accused of having acted carelessly. You > don't have a single shred of evidence. > I think you have done a good job and I am just trying to see if there are any loose ends which we can tighten-up. Anyway, here are results from some simple performance tests: Test with 2 un-modified indexes === create table t1(c1 int, c2 int, c3 int, c4 char(10)); create index idx_t1 on t1(c1); create index idx_t2 on t1(c2); create index idx_t3 on t1(c3); insert into t1 values(generate_series(1,500), 1, 10, 'aa'); update t1 set c2 = 2; Without nbree mod (without commit d168b66682) === postgres=# update t1 set c2 = 2; UPDATE 500 Time: 46533.530 ms (00:46.534) With HEAD == postgres=# update t1 set c2 = 2; UPDATE 500 Time: 525
Re: Deleting older versions in unique indexes to avoid page splits
On Wed, Jan 20, 2021 at 7:03 PM Amit Kapila wrote: > > On Wed, Jan 20, 2021 at 10:58 AM Peter Geoghegan wrote: > > > > On Tue, Jan 19, 2021 at 7:54 PM Amit Kapila wrote: > > > The worst cases could be (a) when there is just one such duplicate > > > (indexval logically unchanged) on the page and that happens to be the > > > last item and others are new insertions, (b) same as (a) and along > > > with it lets say there is an open transaction due to which we can't > > > remove even that duplicate version. Have we checked the overhead or > > > results by simulating such workloads? > > > > There is no such thing as a workload that has page splits caused by > > non-HOT updaters, but almost no actual version churn from the same > > non-HOT updaters. It's possible that a small number of individual page > > splits will work out like that, of course, but they'll be extremely > > rare, and impossible to see in any kind of consistent way. > > > > That just leaves long running transactions. Of course it's true that > > eventually a long-running transaction will make it impossible to > > perform any cleanup, for the usual reasons. And at that point this > > mechanism is bound to fail (which costs additional cycles -- the > > wasted access to a single heap page, some CPU cycles). But it's still > > a bargain to try. Even with a long running transactions there will be > > a great many bottom-up deletion passes that still succeed earlier on > > (because at least some of the dups are deletable, and we can still > > delete those that became garbage right before the long running > > snapshot was acquired). > > > > How many ... > Typo. /many/any -- With Regards, Amit Kapila.
Re: Printing backtrace of postgres processes
On Wed, Jan 20, 2021 at 2:52 AM Tom Lane wrote: > > Robert Haas writes: > > On Tue, Jan 19, 2021 at 12:50 PM Tom Lane wrote: > >> I think it's got security hazards as well. If we restricted the > >> feature to cause a trace of only one process at a time, and required > >> that process to be logged in as the same database user as the > >> requestor (or at least someone with the privs of that user, such > >> as a superuser), that'd be less of an issue. > > > I am not sure I see a security hazard here. I think the checks for > > this should have the same structure as pg_terminate_backend() and > > pg_cancel_backend(); whatever is required there should be required > > here, too, but not more, unless we have a real clear reason for such > > an inconsistency. > > Yeah, agreed. The "broadcast" option seems inconsistent with doing > things that way, but I don't have a problem with being able to send > a trace signal to the same processes you could terminate. > The current patch supports both getting the trace for all the processes of that instance and getting the trace for a particular process, I'm understanding the concern here with broadcasting to all the connected backends, I will remove the functionality to get trace for all the processes. And I will also change the trace of a single process in such a way that the user can get the trace of only the processes which that user could terminate. > >> I share your estimate that there'll be small-fraction-of-a-percent > >> hazards that could still add up to dangerous instability if people > >> go wild with this. > > > Right. I was more concerned about whether we could, for example, crash > > while inside the function that generates the backtrace, on some > > platforms or in some scenarios. That would be super-sad. I assume that > > the people who wrote the code tried to make sure that didn't happen > > but I don't really know what's reasonable to expect. > > Recursion is scary, but it should (I think) not be possible if this > is driven off CHECK_FOR_INTERRUPTS. There will certainly be none > of those in libbacktrace. > > One point here is that it might be a good idea to suppress elog.c's > calls to functions in the error context stack. As we saw in another > connection recently, allowing that to happen makes for a *very* > large increase in the footprint of code that you are expecting to > work at any random CHECK_FOR_INTERRUPTS call site. > > BTW, it also looks like the patch is doing nothing to prevent the > backtrace from being sent to the connected client. I'm not sure > what I think about whether it'd be okay from a security standpoint > to do that on the connection that requested the trace, but I sure > as heck don't want it to happen on connections that didn't. Also, > whatever you think about security concerns, it seems like there'd be > pretty substantial reentrancy hazards if the interrupt occurs > anywhere in code dealing with normal client communication. > > Maybe, given all of these things, we should forget using elog at > all and just emit the trace with fprintf(stderr). That seems like > it would decrease the odds of trouble by about an order of magnitude. > It would make it more painful to collect the trace in some logging > setups, of course. I would prefer if the trace appears in the log file rather on the console, as you rightly pointed out it will be difficult to collect the trace of fprint(stderr). Let me know if I misunderstood. I think you are concerned about the problem where elog logs the trace to the client also. Can we use LOG_SERVER_ONLY log level that would prevent it from logging to the client. That way we can keep the elog implementation itself. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Deleting older versions in unique indexes to avoid page splits
On Wed, Jan 20, 2021 at 10:50 AM Andres Freund wrote: > > Hi, > > On 2021-01-20 09:24:35 +0530, Amit Kapila wrote: > > I feel extending the deletion mechanism based on the number of LP_DEAD > > items sounds more favorable than giving preference to duplicate > > items. Sure, it will give equally good or better results if there are > > no long-standing open transactions. > > There's a lot of workloads that never set LP_DEAD because all scans are > bitmap index scans. And there's no obvious way to address that. So I > don't think it's wise to purely rely on LP_DEAD. > Right, I understand this point. The point I was trying to make was that in this new technique we might not be able to delete any tuple (or maybe very few) if there are long-running open transactions in the system and still incur a CPU and I/O cost. I am completely in favor of this technique and patch, so don't get me wrong. As mentioned in my reply to Peter, I am just trying to see if there are more ways we can use this optimization and reduce the chances of regression (if there is any). -- With Regards, Amit Kapila.
Re: Parallel INSERT (INTO ... SELECT ...)
On Wed, Jan 20, 2021 at 3:27 PM Greg Nancarrow wrote: > > On Fri, Jan 8, 2021 at 8:16 PM Amit Kapila wrote: > > > > > > I'm not sure if the problem of missing targetlist should be handled here > > > (BTW, > > > NIL is the constant for an empty list, not NULL). Obviously this is a > > > consequence of the fact that the ModifyTable node has no regular > > > targetlist. > > > > > > > I think it is better to add comments along with this change. In this > > form, this looks quite hacky to me. > > > > The targetlist on the ModifyTable node has been setup correctly, but > it hasn't been propagated to the Gather node. > Of course, I have previously tried to elegantly fix this, but struck > various problems, using different approaches. > Perhaps this code could just be moved into set_plan_refs(). > For the moment, I'll just keep the current code, but I'll add a FIXME > comment for this. > I'll investigate Antonin's suggestions as a lower-priority side-task. > +1, that sounds like a good approach because anyway this has to be dealt for the second patch and at this point, we should make the first patch ready. -- With Regards, Amit Kapila.
Re: Discarding DISCARD ALL
I hope to do further review of the patch later this week, but I wanted to at least comment on this piece: On Wed, Jan 20, 2021 at 2:48 AM Peter Eisentraut wrote: > > On 2020-12-23 15:33, Simon Riggs wrote: > > Poolers such as pgbouncer would then be able to connect transaction > > mode pools by setting transaction_cleanup=on at time of connection, > > avoiding any need to issue a server_reset_query, removing the DISCARD > > ALL command from the normal execution path, while still achieving the > > same thing. > > PgBouncer does not send DISCARD ALL in transaction mode. There is a > separate setting to do that, but it's not the default, and it's more of > a workaround for bad client code. So I don't know if this feature would > be of much use for PgBouncer. Other connection poolers might have other > opinions. Yes, to have server_reset_query apply in transaction pooling mode you have to additionally configure pgbouncer with server_reset_query_always enabled. I'd mildly take issue with "a workaround for bad client code". Yes, clients in transaction pooling mode shouldn't issue (for example) `SET ...`, but there's no way I'm aware of in Postgres to prevent session-specific items like those GUCs from being set by a given user, so I view it more like a safeguard than a workaround. In our setup we have server_reset_query_always=1 as such a safeguard, because it's too easy for application code to update, for example, statement_timeout to disastrous results. But we also work to make sure those don't happen (or get cleaned up if they happen to slip in). An alternative approach that occurred to me while typing this reply: a setting in Postgres that would disallow setting session level GUCs (i.e., enforce `SET LOCAL` transaction level usage instead) would remove a large chunk of our need to set server_reset_query_always=1 (and more interestingly it'd highlight when broken code gets pushed). But even with that, I see some value in the proposed setting since there is additional session state beyond GUCs. James
Re: Discarding DISCARD ALL
On Wed, 20 Jan 2021 at 14:21, James Coleman wrote: > An alternative approach that occurred to me while typing this reply: a > setting in Postgres that would disallow setting session level GUCs > (i.e., enforce `SET LOCAL` transaction level usage instead) would > remove a large chunk of our need to set server_reset_query_always=1 > (and more interestingly it'd highlight when broken code gets pushed). > But even with that, I see some value in the proposed setting since > there is additional session state beyond GUCs. With transaction_cleanup=on we could force all SETs to be SET LOCAL. The point is that if we declare ahead of time that the transaction will be reset then we can act differently and more easily for various circumstances, for SETs, for Temp tables and others. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: [HACKERS] [PATCH] Generic type subscripting
On Tue Jan 19, 2021 at 1:42 PM EST, Pavel Stehule wrote: > Hi > > I found minor issues. > > Doc - missing tag > > and three whitespaces issues > > see attached patch > > Following sentence is hard to read due long nested example > > If the > + path contradicts structure of modified jsonb for any > individual > + value (e.g. path val['a']['b']['c'] assumes keys > + 'a' and 'b' have object values > + assigned to them, but if val['a'] or > + val['b'] is null, a string, or a number, then the > path > + contradicts with the existing structure), an error is raised even if > other > + values do conform. > > It can be divided into two sentences - predicate, and example. > > Regards > > Pavel Here's a full editing pass on the documentation, with v45 and Pavel's doc-whitespaces-fix.patch applied. I also corrected a typo in one of the added hints. From 086a34ca860e8513484d829db1cb3f0c17c4ec1e Mon Sep 17 00:00:00 2001 From: Dian M Fay Date: Tue, 19 Jan 2021 23:44:23 -0500 Subject: [PATCH] Revise jsonb subscripting documentation --- doc/src/sgml/json.sgml | 101 +--- src/backend/utils/adt/jsonbsubs.c | 2 +- src/test/regress/expected/jsonb.out | 2 +- 3 files changed, 49 insertions(+), 56 deletions(-) diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml index 4e19fe4fb8..eb3952193a 100644 --- a/doc/src/sgml/json.sgml +++ b/doc/src/sgml/json.sgml @@ -605,97 +605,90 @@ SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"tags": ["qu jsonb Subscripting - jsonb data type supports array-style subscripting expressions - to extract or update particular elements. It's possible to use multiple - subscripting expressions to extract nested values. In this case, a chain of - subscripting expressions follows the same rules as the - path argument in jsonb_set function, - e.g. in case of arrays it is a 0-based operation or that negative integers - that appear in path count from the end of JSON arrays. - The result of subscripting expressions is always jsonb data type. + The jsonb data type supports array-style subscripting expressions + to extract and modify elements. Nested values can be indicated by chaining + subscripting expressions, following the same rules as the path + argument in the jsonb_set function. If a jsonb + value is an array, numeric subscripts start at zero, and negative integers count + backwards from the last element of the array. Slice expressions are not supported. + The result of a subscripting expression is always of the jsonb data type. UPDATE statements may use subscripting in the - SET clause to modify jsonb values. Every - affected value must conform to the path defined by the subscript(s). If the - path contradicts structure of modified jsonb for any individual - value (e.g. path val['a']['b']['c'] assumes keys - 'a' and 'b' have object values - assigned to them, but if val['a'] or - val['b'] is null, a string, or a number, then the path - contradicts with the existing structure), an error is raised even if other - values do conform. + SET clause to modify jsonb values. Object + values being traversed must exist as specified by the subscript path. For + instance, the path val['a']['b']['c'] assumes that + val, val['a'], and val['a']['b'] + are all objects in every record being updated (val['a']['b'] + may or may not contain a field named c, as long as it's an + object). If any individual val, val['a'], + or val['a']['b'] is a non-object such as a string, a number, + or NULL, an error is raised even if other values do conform. + Array values are not subject to this restriction, as detailed below. - + An example of subscripting syntax: + --- Extract value by key +-- Extract object value by key SELECT ('{"a": 1}'::jsonb)['a']; --- Extract nested value by key path +-- Extract nested object value by key path SELECT ('{"a": {"b": {"c": 1}}}'::jsonb)['a']['b']['c']; --- Extract element by index +-- Extract array element by index SELECT ('[1, "2", null]'::jsonb)[1]; --- Update value by key, note the single quotes - the assigned value --- needs to be of jsonb type as well +-- Update object value by key. Note the quotes around '1': the assigned +-- value must be of the jsonb type as well UPDATE table_name SET jsonb_field['key'] = '1'; --- This will raise an error if jsonb_field is {"a": 1} +-- This will raise an error if any record's jsonb_field['a']['b'] is something +-- other than an object. For example, the value {"a": 1} has no 'b' key. UPDATE table_name SET jsonb_field['a']['b']['c'] = '1'; --- Select records using where clause with subscripting. Since the result of --- subscripting is jsonb and we basically want to compare two jsonb objects, we --- need to put the value in double quotes to be able to convert it to jsonb. +-- Filter records using a WHERE clause with subscripting. Since the result of +-- subscripting
Re: Wrong usage of RelationNeedsWAL
At Wed, 20 Jan 2021 17:34:44 +0900 (JST), Kyotaro Horiguchi wrote in > Anyway, it seems actually dangerous that cause pruning on wal-skipped > relation. > > > with your patch versions. Could you try implementing both test procedures > > in > > src/test/modules/snapshot_too_old? There's no need to make the test use > > wal_level=minimal explicitly; it's sufficient to catch these bugs when > > wal_level=minimal is in the TEMP_CONFIG file. > > In the attached, TestForOldSnapshot() considers XLogIsNeeded(), > instead of moving the condition on LSN to TestForOldSnapshot_impl for > performance. > > I'll add the test part in the next version. This is it. However, with the previous patch, two existing tests sto_using_cursor and sto_using_select behaves differently from the master. That change is coming from the omission of actual LSN check in TestForOldSnapshot while wal_level=minimal. So we have no choice other than actually updating page LSN. In the scenario under discussion there are two places we need to do that. one is heap_page_prune, and the other is RelationCopyStorge. As a PoC, I used gistXLogAssignLSN() as is for thie purpose. See the attached third file. If it is the right direction, I will move XLOG_GIST_ASSIGN_LSN to XLOG_ASSIGN_LSN and move gistXLogAssignLSN() to XLogAdvanceLSN() or XLogNop() or such. With the third patch, the test succedds both wal_level = replica and minimal. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 332ac10844befb8a9c00df9f68993eb3696f0a85 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 20 Jan 2021 20:57:03 +0900 Subject: [PATCH v5 1/3] Test for snapshot too old and wal_level=minimal --- src/test/modules/snapshot_too_old/Makefile| 12 - .../expected/sto_wal_optimized.out| 24 ++ .../input_sto/sto_wal_optimized.spec | 44 +++ src/test/modules/snapshot_too_old/sto.conf| 3 ++ 4 files changed, 82 insertions(+), 1 deletion(-) create mode 100644 src/test/modules/snapshot_too_old/expected/sto_wal_optimized.out create mode 100644 src/test/modules/snapshot_too_old/input_sto/sto_wal_optimized.spec diff --git a/src/test/modules/snapshot_too_old/Makefile b/src/test/modules/snapshot_too_old/Makefile index dfb4537f63..1e0b0b6efc 100644 --- a/src/test/modules/snapshot_too_old/Makefile +++ b/src/test/modules/snapshot_too_old/Makefile @@ -4,7 +4,7 @@ # we have to clean those result files explicitly EXTRA_CLEAN = $(pg_regress_clean_files) -ISOLATION = sto_using_cursor sto_using_select sto_using_hash_index +ISOLATION = sto_using_cursor sto_using_select sto_using_hash_index sto_wal_optimized ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/snapshot_too_old/sto.conf # Disabled because these tests require "old_snapshot_threshold" >= 0, which @@ -22,6 +22,16 @@ include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif +.PHONY: tablespace-setup specfile-setup +tablespace-setup: + rm -rf ./testtablespace + mkdir ./testtablespace + +specfile-setup: input_sto/*.spec + sed 's!@srcdir@!$(realpath $(top_srcdir))!g' $? > specs/$(notdir $?) + +check: tablespace-setup specfile-setup + # But it can nonetheless be very helpful to run tests on preexisting # installation, allow to do so, but only if requested explicitly. installcheck-force: diff --git a/src/test/modules/snapshot_too_old/expected/sto_wal_optimized.out b/src/test/modules/snapshot_too_old/expected/sto_wal_optimized.out new file mode 100644 index 00..4038d0a713 --- /dev/null +++ b/src/test/modules/snapshot_too_old/expected/sto_wal_optimized.out @@ -0,0 +1,24 @@ +Parsed test spec with 3 sessions + +starting permutation: s1a1 s1a2 s2a1 s2a2 s1b1 a3-0 a3-1 s3-2 s3-3 s3-4 s2b1 +step s1a1: BEGIN; +step s1a2: DELETE FROM t; +step s2a1: BEGIN ISOLATION LEVEL REPEATABLE READ; +step s2a2: SELECT 1; +?column? + +1 +step s1b1: COMMIT; +step a3-0: SELECT setting, pg_sleep(6) FROM pg_settings WHERE name = 'old_snapshot_threshold'; +settingpg_sleep + +0 +step a3-1: BEGIN; +step s3-2: ALTER TABLE t SET TABLESPACE tsp1; +step s3-3: SELECT count(*) FROM t; +count + +0 +step s3-4: COMMIT; +step s2b1: SELECT count(*) FROM t; +ERROR: snapshot too old diff --git a/src/test/modules/snapshot_too_old/input_sto/sto_wal_optimized.spec b/src/test/modules/snapshot_too_old/input_sto/sto_wal_optimized.spec new file mode 100644 index 00..02adb50581 --- /dev/null +++ b/src/test/modules/snapshot_too_old/input_sto/sto_wal_optimized.spec @@ -0,0 +1,44 @@ +# This test provokes a "snapshot too old" error +# +# The sleep is needed because with a threshold of zero a statement could error +# on changes it made. With more normal settings no external delay is needed, +# but we don't want these tests to run long enough to see that, since +# granularity is in minutes. +# +# Since results depend on the valu
Re: pg_stat_statements oddity with track = all
On Wed, Jan 20, 2021 at 6:15 PM Julien Rouhaud wrote: > > Hi, > > On Tue, Jan 19, 2021 at 4:55 PM Masahiro Ikeda > wrote: > > > > Thanks for making the patch to add "toplevel" column in > > pg_stat_statements. > > This is a review comment. > > Thanks a lot for the thorough review! > > > I tested the "update" command can work. > > postgres=# ALTER EXTENSION pg_stat_statements UPDATE TO '1.10'; > > > > Although the "toplevel" column of all queries which already stored is > > 'false', > > we have to decide the default. I think 'false' is ok. > > Mmm, I'm not sure that I understand this result. The "toplevel" value > is tracked by the C code loaded at startup, so it should have a > correct value even if you used to have the extension in a previous > version, it's just that you can't access the toplevel field before > doing the ALTER EXTENSION pg_stat_statements UPDATE. There's also a > change in the magic number, so you can't use the stored stat file from > a version before this patch. > > I also fail to reproduce this behavior. I did the following: > > - start postgres with pg_stat_statements v1.10 (so with this patch) in > shared_preload_libraries > - CREATE EXTENSION pg_stat_statements VERSION '1.9'; > - execute a few queries > - ALTER EXTENSION pg_stat_statements UPDATE; > - SELECT * FROM pg_stat_statements reports the query with toplvel to TRUE > > Can you share a way to reproduce your findings? > > > 2. usability review > > > > [...] > > Although one concern is whether only top-level is enough or not, > > I couldn't come up with any use-case to use nested level, so I think > > it's ok. > > I agree, I don't see how tracking statistics per nesting level would > really help. The only additional use case I see would tracking > triggers/FK query execution, but the nesting level won't help with > that. > > > 3. coding review > > = > > [...] > > B. add a argument of the pg_stat_statements_reset(). > > > > Now, pg_stat_statements supports a flexible reset feature. > > > pg_stat_statements_reset(userid Oid, dbid Oid, queryid bigint) > > > > Although I wondered whether we need to add a top-level flag to the > > arguments, > > I couldn't come up with any use-case to reset only top-level queries or > > not top-level queries. > > Ah, I didn't think of the reset function. I also fail to see a > reasonable use case to provide a switch for the reset function. > > > 4. others > > == > > > > These comments are not related to this patch. > > > > A. about the topic of commitfests > > > > Since I think this feature is for monitoring, > > it's better to change the topic from "System Administration" > > to "Monitoring & Control". > > I agree, thanks for the change! I've changed the topic accordingly. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-Jan-20, Michael Paquier wrote: > +/* > + * This is mostly duplicating ATExecSetTableSpaceNoStorage, > + * which should maybe be factored out to a library function. > + */ > Wouldn't it be better to do first the refactoring of 0002 and then > 0001 so as REINDEX can use the new routine, instead of putting that > into a comment? I think merging 0001 and 0002 into a single commit is a reasonable approach. I don't oppose an initial refactoring commit if you want to do that, but it doesn't seem necessary. -- Álvaro Herrera39°49'30"S 73°17'W
Re: Discarding DISCARD ALL
On Wed, Jan 20, 2021 at 9:58 AM Simon Riggs wrote: > > On Wed, 20 Jan 2021 at 14:21, James Coleman wrote: > > > An alternative approach that occurred to me while typing this reply: a > > setting in Postgres that would disallow setting session level GUCs > > (i.e., enforce `SET LOCAL` transaction level usage instead) would > > remove a large chunk of our need to set server_reset_query_always=1 > > (and more interestingly it'd highlight when broken code gets pushed). > > But even with that, I see some value in the proposed setting since > > there is additional session state beyond GUCs. > > With transaction_cleanup=on we could force all SETs to be SET LOCAL. > > The point is that if we declare ahead of time that the transaction > will be reset then we can act differently and more easily for various > circumstances, for SETs, for Temp tables and others. Right, I agree it's independently useful. My "alternative" is a subset of that functionality and doesn't cover as many cases. James
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-Jan-20, Alvaro Herrera wrote: > On 2021-Jan-20, Michael Paquier wrote: > > > +/* > > + * This is mostly duplicating ATExecSetTableSpaceNoStorage, > > + * which should maybe be factored out to a library function. > > + */ > > Wouldn't it be better to do first the refactoring of 0002 and then > > 0001 so as REINDEX can use the new routine, instead of putting that > > into a comment? > > I think merging 0001 and 0002 into a single commit is a reasonable > approach. ... except it doesn't make a lot of sense to have set_rel_tablespace in either indexcmds.c or index.c. I think tablecmds.c is a better place for it. (I would have thought catalog/storage.c, but that one's not the right abstraction level it seems.) But surely ATExecSetTableSpaceNoStorage should be using this new routine. (I first thought 0002 was doing that, since that commit is calling itself a "refactoring", but now that I look closer, it's not.) -- Álvaro Herrera39°49'30"S 73°17'W "On the other flipper, one wrong move and we're Fatal Exceptions" (T.U.X.: Term Unit X - http://www.thelinuxreview.com/TUX/)
Re: [HACKERS] [PATCH] Generic type subscripting
> On Tue Jan 19, 2021 at 1:42 PM EST, Pavel Stehule wrote: > > I found minor issues. > > Doc - missing tag > > and three whitespaces issues > > see attached patch Thanks, I need to remember to not skipp doc building for testing process even for such small changes. Hope now I didn't forget anything. > On Wed, Jan 20, 2021 at 09:58:43AM -0500, Dian M Fay wrote: > Here's a full editing pass on the documentation, with v45 and Pavel's > doc-whitespaces-fix.patch applied. I also corrected a typo in one of the > added hints. Great! I've applied almost all of it, except: + A jsonb value will accept assignments to nonexistent subscript + paths as long as the nonexistent elements being traversed are all arrays. Maybe I've misunderstood the intention, but there is no requirement about arrays for creating such an empty path. I've formulated it as: + A jsonb value will accept assignments to nonexistent subscript + paths as long as the last existing path key is an object or an array. >From a4037c651a0cfd2448f38b6c8c932b5a09136b0a Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Fri, 18 Dec 2020 17:19:51 +0100 Subject: [PATCH v46 1/3] Subscripting for jsonb Subscripting implementation for jsonb. It does not support slices, does not have a limit for number of subscripts and for assignment expects a replace value to be of jsonb type. There is also one functional difference in assignment via subscripting from jsonb_set, when an original jsonb container is NULL, subscripting replaces it with an empty jsonb and proceed with assignment. For the sake of code reuse, some parts of jsonb functionality were rearranged to allow use the same functions for jsonb_set and assign subscripting operation. The original idea belongs to Oleg Bartunov. Reviewed-by: Tom Lane, Arthur Zakirov, Pavel Stehule, Dian M Fay --- doc/src/sgml/json.sgml | 51 src/backend/utils/adt/Makefile | 1 + src/backend/utils/adt/jsonb_util.c | 76 - src/backend/utils/adt/jsonbsubs.c | 413 src/backend/utils/adt/jsonfuncs.c | 180 ++-- src/include/catalog/pg_proc.dat | 4 + src/include/catalog/pg_type.dat | 3 +- src/include/utils/jsonb.h | 6 +- src/test/regress/expected/jsonb.out | 272 +- src/test/regress/sql/jsonb.sql | 84 +- 10 files changed, 985 insertions(+), 105 deletions(-) create mode 100644 src/backend/utils/adt/jsonbsubs.c diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml index 5b9a5557a4..3ace5e444b 100644 --- a/doc/src/sgml/json.sgml +++ b/doc/src/sgml/json.sgml @@ -602,6 +602,57 @@ SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"tags": ["qu + + jsonb Subscripting + + The jsonb data type supports array-style subscripting expressions + to extract and modify elements. Nested values can be indicated by chaining + subscripting expressions, following the same rules as the path + argument in the jsonb_set function. If a jsonb + value is an array, numeric subscripts start at zero, and negative integers count + backwards from the last element of the array. Slice expressions are not supported. + The result of a subscripting expression is always of the jsonb data type. + + + + An example of subscripting syntax: + + +-- Extract object value by key +SELECT ('{"a": 1}'::jsonb)['a']; + +-- Extract nested object value by key path +SELECT ('{"a": {"b": {"c": 1}}}'::jsonb)['a']['b']['c']; + +-- Extract array element by index +SELECT ('[1, "2", null]'::jsonb)[1]; + +-- Update object value by key. Note the quotes around '1': the assigned +-- value must be of the jsonb type as well +UPDATE table_name SET jsonb_field['key'] = '1'; + +-- Filter records using a WHERE clause with subscripting. Since the result of +-- subscripting is jsonb, the value we compare it against must also be jsonb. +-- The double quotes make "value" also a valid jsonb string. +SELECT * FROM table_name WHERE jsonb_field['key'] = '"value"'; + + + jsonb assignment via subscripting handles a few edge cases + differently from jsonb_set. When a source jsonb + is NULL, assignment via subscripting will proceed as if + it was an empty JSON object: + + +-- Where jsonb_field was NULL, it is now {"a": 1} +UPDATE table_name SET jsonb_field['a'] = '1'; + +-- Where jsonb_field was NULL, it is now [1] +UPDATE table_name SET jsonb_field[0] = '1'; + + + + + Transforms diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 82732146d3..279ff15ade 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -50,6 +50,7 @@ OBJS = \ jsonb_op.o \ jsonb_util.o \ jsonfuncs.o \ + jsonbsubs.o \ jsonpath.o \ jsonpath_exec.o \ jsonpath_gram.o \ diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 4eeffa1424..41a1c1f9bb 100644 --- a/src/backend/utils/adt/jsonb_util.c +++ b/src/backend/uti
Getting column names/types from select query?
Hi all, I am interested in figuring out how to get the names and types of the columns from an arbitrary query. Essentially, I want to be able to take a query like: CREATE TABLE foo( bar bigserial, baz varchar(256) ); SELECT * FROM foo WHERE bar = 42; and figure out programmatically that the select will return a column "bar" of type bigserial, and a column "foo" of type varchar(256). I would like this to work for more complex queries as well (joins, CTEs, etc). I've found https://wiki.postgresql.org/wiki/Query_Parsing, which talks about related ways to hook into postgres, but that seems to only talk about the parse tree — a lot more detail and processing seems to be required in order to figure out the output types. It seems like there should be somewhere I can hook into in postgres that will get me this information, but I have no familiarity with the codebase, so I don't know the best way to get this. How would you recommend that I approach this? I'm comfortable patching postgres if needed, although if there's a solution that doesn't require that, I'd prefer that. Thanks, :w
Re: [HACKERS] [PATCH] Generic type subscripting
On Wed Jan 20, 2021 at 11:22 AM EST, Dmitry Dolgov wrote: > > On Tue Jan 19, 2021 at 1:42 PM EST, Pavel Stehule wrote: > > > > I found minor issues. > > > > Doc - missing tag > > > > and three whitespaces issues > > > > see attached patch > > Thanks, I need to remember to not skipp doc building for testing process > even for such small changes. Hope now I didn't forget anything. > > > On Wed, Jan 20, 2021 at 09:58:43AM -0500, Dian M Fay wrote: > > > Here's a full editing pass on the documentation, with v45 and Pavel's > > doc-whitespaces-fix.patch applied. I also corrected a typo in one of the > > added hints. > > Great! I've applied almost all of it, except: > > + A jsonb value will accept assignments to nonexistent > subscript > + paths as long as the nonexistent elements being traversed are all > arrays. > > Maybe I've misunderstood the intention, but there is no requirement > about arrays for creating such an empty path. I've formulated it as: > > + A jsonb value will accept assignments to nonexistent > subscript > + paths as long as the last existing path key is an object or an array. My intention there was to highlight the difference between: * SET obj['a']['b']['c'] = '"newvalue"' * SET arr[0][0][3] = '"newvalue"' obj has to conform to {"a": {"b": {...}}} in order to receive the assignment of the nested c. If it doesn't, that's the error case we discussed earlier. But arr can be null, [], and so on, and any missing structure [[[null, null, null, "newvalue"]]] will be created. Take 2: A jsonb value will accept assignments to nonexistent subscript paths as long as object key subscripts can be traversed as described above. The final subscript is not traversed and, if it describes a missing object key, will be created. Nested arrays will always be created and NULL-padded according to the path until the value can be placed appropriately.
Re: Getting column names/types from select query?
"Wesley Aptekar-Cassels" writes: > I am interested in figuring out how to get the names and types of the > columns from an arbitrary query. Where do you need this information? Usually the easiest way is to prepare (plan) the query and then extract metadata, for instance PQprepare and PQdescribePrepared if using libpq. regards, tom lane
Re: Odd, intermittent failure in contrib/pageinspect
Michael Paquier writes: > On Tue, Jan 19, 2021 at 05:03:49PM -0500, Tom Lane wrote: >> In short I propose the attached patch, which also gets rid of >> that duplicate query. > Agreed, +1. Pushed, thanks for looking at it. regards, tom lane
strange error reporting
I just made the mistake of trying to run pgbench without first running createdb and got this: pgbench: error: connection to database "" failed: could not connect to socket "/tmp/.s.PGSQL.5432": FATAL: database "rhaas" does not exist This looks pretty bogus because (1) I was not attempting to connect to a database whose name is the empty string and (2) saying that it couldn't connect to the socket is wrong, else it would not also be showing a server message. I haven't investigated why this is happening; apologies if this is a known issue. -- Robert Haas EDB: http://www.enterprisedb.com
Jsonpath ** vs lax mode
Hi! We have a bug report which says that jsonpath ** operator behaves strangely in the lax mode [1]. Naturally, the result of this query looks counter-intuitive. # select jsonb_path_query_array('[{"a": 1, "b": [{"a": 2}]}]', 'lax $.**.a'); jsonb_path_query_array [1, 1, 2, 2] (1 row) But actually, everything works as designed. ** operator reports both objects and wrapping arrays, while object key accessor automatically unwraps arrays. # select x, jsonb_path_query_array(x, '$.a') from jsonb_path_query('[{"a": 1, "b": [{"a": 2}]}]', 'lax $.**') x; x | jsonb_path_query_array -+ [{"a": 1, "b": [{"a": 2}]}] | [1] {"a": 1, "b": [{"a": 2}]} | [1] 1 | [] [{"a": 2}] | [2] {"a": 2}| [2] 2 | [] (6 rows) At first sight, we may just say that lax mode just sucks and counter-intuitive results are expected. But at the second sight, the lax mode is used by default and current behavior may look too surprising. My proposal is to make everything after the ** operator use strict mode (patch attached). I think this shouldn't be backpatched, just applied to the v14. Other suggestions? Links 1. https://www.postgresql.org/message-id/16828-2b0229babfad2d8c%40postgresql.org -- Regards, Alexander Korotkov jsonpath_double_star_strict_mode.patch Description: Binary data
Re: strange error reporting
Robert Haas writes: > I just made the mistake of trying to run pgbench without first running > createdb and got this: > pgbench: error: connection to database "" failed: could not connect to > socket "/tmp/.s.PGSQL.5432": FATAL: database "rhaas" does not exist > This looks pretty bogus because (1) I was not attempting to connect to > a database whose name is the empty string and (2) saying that it > couldn't connect to the socket is wrong, else it would not also be > showing a server message. I'm not sure about the empty DB name in the first part (presumably that's from pgbench, so what was your pgbench command exactly?). But the 'could not connect to socket' part is a consequence of my recent fiddling with libpq's connection failure reporting, see 52a10224e. We could discuss exactly how that ought to be spelled, but the idea is to consistently identify the host that we were trying to connect to. If you have a multi-host connection string, it's conceivable that "rhaas" exists on some of those hosts and not others, so I do not think the info is irrelevant. Just looking at this, I wonder if we ought to drop pgbench's contribution to the message entirely; it seems like libpq's message is now fairly freestanding. regards, tom lane
Re: Phrase search vs. multi-lexeme tokens
On Thu, Jan 7, 2021 at 6:36 AM Alexander Korotkov wrote: > > > I read your patch over quickly and it seems like a reasonable > > approach (but sadly underdocumented). Can we extend the idea > > to fix the to_tsquery case? > > Sure, I'll provide a revised patch. The next version of the patch is attached. Now, it just makes to_tsquery() and websearch_to_tsquery() use phrase operator to connect multiple lexemes of the same tsquery token. I leave plainto_tsquery() aside because it considers the whole argument as a single token. Changing it would make it an equivalent of phraseto_tsquery(). -- Regards, Alexander Korotkov tsquery_phrase_fix.path Description: Binary data
Re: strange error reporting
On Wed, Jan 20, 2021 at 12:19 PM Tom Lane wrote: > Robert Haas writes: > > I just made the mistake of trying to run pgbench without first running > > createdb and got this: > > > pgbench: error: connection to database "" failed: could not connect to > > socket "/tmp/.s.PGSQL.5432": FATAL: database "rhaas" does not exist > > > This looks pretty bogus because (1) I was not attempting to connect to > > a database whose name is the empty string and (2) saying that it > > couldn't connect to the socket is wrong, else it would not also be > > showing a server message. > > I'm not sure about the empty DB name in the first part (presumably > that's from pgbench, so what was your pgbench command exactly?). I think it was just 'pgbench -i 40'. For sure, I didn't specify a database name. > But the 'could not connect to socket' part is a consequence of my > recent fiddling with libpq's connection failure reporting, see > 52a10224e. We could discuss exactly how that ought to be spelled, > but the idea is to consistently identify the host that we were trying > to connect to. If you have a multi-host connection string, it's > conceivable that "rhaas" exists on some of those hosts and not others, > so I do not think the info is irrelevant. I'm not saying that which socket I used is totally irrelevant although in most cases it's going to be a lot of detail. I'm just saying that, at least for me, when you say you can't connect to a socket, I at least think about the return value of connect(2), which was clearly 0 here. > Just looking at this, I wonder if we ought to drop pgbench's > contribution to the message entirely; it seems like libpq's > message is now fairly freestanding. Maybe it would be better if it said: connection to database at socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "rhaas" does not exist -- Robert Haas EDB: http://www.enterprisedb.com
Re: Getting column names/types from select query?
> Where do you need this information? I'm writing some code that takes a given query, and generates type-safe bindings for it, so people can write SQL queries and get structs (or vectors of structs) out the other end. So I'm pretty flexible about where I get it, given that it'll be part of my build/codegen process. I hadn't seen libpq yet, I'll look into that — thanks!
Re: strange error reporting
Robert Haas writes: > On Wed, Jan 20, 2021 at 12:19 PM Tom Lane wrote: >> But the 'could not connect to socket' part is a consequence of my >> recent fiddling with libpq's connection failure reporting, see >> 52a10224e. We could discuss exactly how that ought to be spelled, >> but the idea is to consistently identify the host that we were trying >> to connect to. If you have a multi-host connection string, it's >> conceivable that "rhaas" exists on some of those hosts and not others, >> so I do not think the info is irrelevant. > I'm not saying that which socket I used is totally irrelevant although > in most cases it's going to be a lot of detail. I'm just saying that, > at least for me, when you say you can't connect to a socket, I at > least think about the return value of connect(2), which was clearly 0 > here. Fair. One possibility, which'd take a few more cycles in libpq but likely not anything significant, is to replace "could not connect to ..." with "while connecting to ..." once we're past the connect() per se. > Maybe it would be better if it said: > connection to database at socket "/tmp/.s.PGSQL.5432" failed: FATAL: > database "rhaas" does not exist I'd be inclined to spell it "connection to server at ... failed", but that sort of wording is surely also possible. regards, tom lane
Re: strange error reporting
On Wed, Jan 20, 2021 at 12:47 PM Tom Lane wrote: > Fair. One possibility, which'd take a few more cycles in libpq but > likely not anything significant, is to replace "could not connect to ..." > with "while connecting to ..." once we're past the connect() per se. Yeah. I think this is kind of a client-side version of errcontext(), except we don't really have that context formally, so we're trying to figure out how to fake it in specific cases. > > Maybe it would be better if it said: > > > connection to database at socket "/tmp/.s.PGSQL.5432" failed: FATAL: > > database "rhaas" does not exist > > I'd be inclined to spell it "connection to server at ... failed", > but that sort of wording is surely also possible. "connection to server" rather than "connection to database" works for me; in fact, I think I like it slightly better. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-20 18:54, Alvaro Herrera wrote: On 2021-Jan-20, Alvaro Herrera wrote: On 2021-Jan-20, Michael Paquier wrote: > +/* > + * This is mostly duplicating ATExecSetTableSpaceNoStorage, > + * which should maybe be factored out to a library function. > + */ > Wouldn't it be better to do first the refactoring of 0002 and then > 0001 so as REINDEX can use the new routine, instead of putting that > into a comment? I think merging 0001 and 0002 into a single commit is a reasonable approach. ... except it doesn't make a lot of sense to have set_rel_tablespace in either indexcmds.c or index.c. I think tablecmds.c is a better place for it. (I would have thought catalog/storage.c, but that one's not the right abstraction level it seems.) I did a refactoring of ATExecSetTableSpaceNoStorage() in the 0001. New function SetRelTablesapce() is placed into the tablecmds.c. Following 0002 gets use of it. Is it close to what you and Michael suggested? But surely ATExecSetTableSpaceNoStorage should be using this new routine. (I first thought 0002 was doing that, since that commit is calling itself a "refactoring", but now that I look closer, it's not.) Yeah, this 'refactoring' was initially referring to refactoring of what Justin added to one of the previous 0001. And it was meant to be merged with 0001, once agreed, but we got distracted by other stuff. I have not yet addressed Michael's concerns regarding reindex of partitions. I am going to look closer on it tomorrow. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-20 21:08, Alexey Kondratov wrote: On 2021-01-20 18:54, Alvaro Herrera wrote: On 2021-Jan-20, Alvaro Herrera wrote: On 2021-Jan-20, Michael Paquier wrote: > +/* > + * This is mostly duplicating ATExecSetTableSpaceNoStorage, > + * which should maybe be factored out to a library function. > + */ > Wouldn't it be better to do first the refactoring of 0002 and then > 0001 so as REINDEX can use the new routine, instead of putting that > into a comment? I think merging 0001 and 0002 into a single commit is a reasonable approach. ... except it doesn't make a lot of sense to have set_rel_tablespace in either indexcmds.c or index.c. I think tablecmds.c is a better place for it. (I would have thought catalog/storage.c, but that one's not the right abstraction level it seems.) I did a refactoring of ATExecSetTableSpaceNoStorage() in the 0001. New function SetRelTablesapce() is placed into the tablecmds.c. Following 0002 gets use of it. Is it close to what you and Michael suggested? Ugh, forgot to attach the patches. Here they are. -- AlexeyFrom 2c3876f99bc8ebdd07c532619992e7ec3093e50a Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 23 Mar 2020 21:10:29 +0300 Subject: [PATCH v2 2/2] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 22 + src/backend/catalog/index.c | 72 ++- src/backend/commands/indexcmds.c | 68 ++- src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 2 + src/test/regress/input/tablespace.source | 53 +++ src/test/regress/output/tablespace.source | 102 ++ 7 files changed, 318 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..4f84060c4d 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,19 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + This specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM is specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. + + + + VERBOSE @@ -210,6 +224,14 @@ REINDEX [ ( option [, ...] ) ] { IN + +new_tablespace + + + The tablespace where indexes will be rebuilt. + + + diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index b8cd35e995..ed98b17483 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -57,6 +57,7 @@ #include "commands/event_trigger.h" #include "commands/progress.h" #include "commands/tablecmds.h" +#include "commands/tablespace.h" #include "commands/trigger.h" #include "executor/executor.h" #include "miscadmin.h" @@ -1394,9 +1395,13 @@ index_update_collation_versions(Oid relid, Oid coll) * Create concurrently an index based on the definition of the one provided by * caller. The index is inserted into catalogs and needs to be built later * on. This is called during concurrent reindex processing. + * + * "tablespaceOid" is the new tablespace to use for this index. If + * InvalidOid, use the tablespace in-use instead. */ Oid -index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char *newName) +index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, + Oid tablespaceOid, const char *newName) { Relation indexRelation; IndexInfo *oldInfo, @@ -1526,7 +1531,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char newInfo, indexColNames, indexRelation->rd_rel->relam, - indexRelation->rd_rel->reltablespace, + OidIsValid(tablespaceOid) ? +tablespaceOid : indexRelation->rd_rel->reltablespace, indexRelation->rd_indcollation, indclass->values, indcoloptions->values, @@ -3591,6 +3597,8 @@ IndexGetRelation(Oid indexId, bool missing_ok) /* * reindex_index - This routine is used to recreate a single index + * + * See comments of reindex_relation() for details about "tablespaceOid". */ void reindex_index(Oid indexId, bool skip_constraint_checks, char persistence, @@ -3603,6 +3611,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence, volatile bool skipped_constraint = false; PGRUsage ru0; bool progress = ((params->options & REINDEXOPT_REPORT_PROGRESS) != 0); + bool set_tablespace = OidIsValid(params->tablespaceOid); pg_rusage_init(&ru0);
Re: Jsonpath ** vs lax mode
On 2021-Jan-20, Alexander Korotkov wrote: > My proposal is to make everything after the ** operator use strict mode > (patch attached). I think this shouldn't be backpatched, just applied to > the v14. Other suggestions? I think changing the mode midway through the operation is strange. What do you think of requiring for ** that mode is strict? That is, if ** is used and the mode is lax, an error is thrown. Thanks -- Álvaro Herrera Valdivia, Chile
Re: strange error reporting
On 2021-Jan-20, Robert Haas wrote: > On Wed, Jan 20, 2021 at 12:19 PM Tom Lane wrote: > > Robert Haas writes: > > > I just made the mistake of trying to run pgbench without first running > > > createdb and got this: > > > > > pgbench: error: connection to database "" failed: could not connect to > > > socket "/tmp/.s.PGSQL.5432": FATAL: database "rhaas" does not exist > > > > > This looks pretty bogus because (1) I was not attempting to connect to > > > a database whose name is the empty string [...] > > > > I'm not sure about the empty DB name in the first part (presumably > > that's from pgbench, so what was your pgbench command exactly?). > > I think it was just 'pgbench -i 40'. For sure, I didn't specify a database > name. That's because pgbench reports the input argument dbname, but since you didn't specify anything, then PQconnectdbParams() uses the libpq behavior. I think we'd have to use PQdb() instead. -- Álvaro Herrera Valdivia, Chile
Re: strange error reporting
On Wed, Jan 20, 2021 at 1:25 PM Alvaro Herrera wrote: > That's because pgbench reports the input argument dbname, but since you > didn't specify anything, then PQconnectdbParams() uses the libpq > behavior. I think we'd have to use PQdb() instead. I figured it was something like that. I don't know whether the right thing is to use something like PQdb() to get the correct database name, or whether we should go with Tom's suggestion and omit that detail altogether, but I think showing the empty string when the user relied on the default is too confusing. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-Jan-20, Alexey Kondratov wrote: > On 2021-01-20 21:08, Alexey Kondratov wrote: > > > > I did a refactoring of ATExecSetTableSpaceNoStorage() in the 0001. New > > function SetRelTablesapce() is placed into the tablecmds.c. Following > > 0002 gets use of it. Is it close to what you and Michael suggested? > > Ugh, forgot to attach the patches. Here they are. Yeah, looks reasonable. > + /* No work if no change in tablespace. */ > + oldTablespaceOid = rd_rel->reltablespace; > + if (tablespaceOid != oldTablespaceOid || > + (tablespaceOid == MyDatabaseTableSpace && > OidIsValid(oldTablespaceOid))) > + { > + /* Update the pg_class row. */ > + rd_rel->reltablespace = (tablespaceOid == MyDatabaseTableSpace) > ? > + InvalidOid : tablespaceOid; > + CatalogTupleUpdate(pg_class, &tuple->t_self, tuple); > + > + changed = true; > + } > + > + if (changed) > + /* Record dependency on tablespace */ > + changeDependencyOnTablespace(RelationRelationId, > + > reloid, rd_rel->reltablespace); Why have a separate "if (changed)" block here instead of merging with the above? -- Álvaro Herrera39°49'30"S 73°17'W
Re: strange error reporting
On 2021-Jan-20, Robert Haas wrote: > On Wed, Jan 20, 2021 at 1:25 PM Alvaro Herrera > wrote: > > That's because pgbench reports the input argument dbname, but since you > > didn't specify anything, then PQconnectdbParams() uses the libpq > > behavior. I think we'd have to use PQdb() instead. > > I figured it was something like that. I don't know whether the right > thing is to use something like PQdb() to get the correct database > name, or whether we should go with Tom's suggestion and omit that > detail altogether, but I think showing the empty string when the user > relied on the default is too confusing. Well, the patch seems small enough, and I don't think it'll be in any way helpful to omit that detail. -- Álvaro Herrera39°49'30"S 73°17'W "Having your biases confirmed independently is how scientific progress is made, and hence made our great society what it is today" (Mary Gardiner) diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index f7da3e1f62..30fde9e9ec 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -1226,7 +1226,7 @@ doConnect(void) if (PQstatus(conn) == CONNECTION_BAD) { pg_log_error("connection to database \"%s\" failed: %s", - dbName, PQerrorMessage(conn)); + PQdb(conn), PQerrorMessage(conn)); PQfinish(conn); return NULL; }
Re: Deleting older versions in unique indexes to avoid page splits
On Wed, Jan 20, 2021 at 5:33 AM Amit Kapila wrote: > > Victor independently came up with a benchmark that ran over several > > hours, with cleanup consistently held back by ~5 minutes by a long > > running transaction: > > > > AFAICS, the long-running transaction used in the test is below: > SELECT abalance, pg_sleep(300) FROM pgbench_accounts WHERE mtime > > now() - INTERVAL '15min' ORDER BY aid LIMIT 1; > > This shouldn't generate a transaction id so would it be sufficient to > hold back the clean-up? It will hold back clean-up because it holds open a snapshot. Whether or not the long running transaction has been allocated a true XID (not just a VXID) is irrelevant. Victor's test case is perfectly valid. In general there are significant benefits for cases with long-running transactions, which will be quite apparent if you do something simple like run pgbench (a script with non-HOT updates) while a REPEATABLE READ transaction runs in psql (and has actually acquired a snapshot by running a simple query -- the xact snapshot is acquired lazily). I understand that this will be surprising if you believe that the problem in these cases is strictly that there are too many "recently dead" versions that still need to be stored (i.e. versions that cleanup simply isn't able to remove, given the xmin horizon invariant). But it's now clear that that's not what really happens in most cases with a long running transaction. What we actually see is that local page-level inefficiencies in cleanup were (and perhaps to some degree still are) a much bigger problem than the inherent inability of cleanup to remove even one or two tuples. This is at least true until the bloat problem becomes a full-blown disaster (because cleanup really is inherently restricted by the global xmin horizon, and therefore hopelessly behind). In reality there are seldom that many individual logical rows that get updated more than a few times in (say) any given one hour period. This is true even with skewed updates -- the skew is almost never visible at the level of an individual leaf page. The behavior we see now that we have bottom-up index deletion is much closer to the true optimal behavior for the general approach Postgres takes to cleanup of garbage tuples, since it tends to make the number of versions for any given logical row as low as possible (within the confines of the global xmin horizon limitations for cleanup). Of course it would also be helpful to have something like zheap -- some mechanism that can store "recently dead" versions some place where they at least don't bloat up the main relation structures. But that's only one part of the big picture for Postgres MVCC garbage. We should not store garbage tuples (i.e. those that are dead rather than just recently dead) *anywhere*. > First, it is not clear to me if that has properly simulated the > long-running test but even if it is what I intend to say was to have > an open long-running transaction possibly for the entire duration of > the test? If we do that, we will come to know if there is any overhead > and if so how much? I am confident that you are completely wrong about regressing cases with long-running transactions, except perhaps in some very narrow sense that is of little practical relevance. Victor's test case did result in a small loss of throughput, for example, but that's a small price to pay to not have your indexes explode (note also that most of the indexes weren't even used for reads, so in the real world it would probably also improve throughput even in the short term). FWIW the small drop in TPS probably had little to do with the cost of visiting the heap for visibility information. Workloads must be made to "live within their means". You can call that a regression if you like, but I think that almost anybody else would take issue with that characterization. Slowing down non-HOT updaters in these extreme cases may actually be a good thing, even when bottom-up deletion finally becomes ineffective. It can be thought of as backpressure. I am not worried about slowing down something that is already hopelessly inefficient and unsustainable. I'd go even further than that, in fact -- I now wonder if we should *deliberately* slow them down some more! > Test with 2 un-modified indexes > === > create table t1(c1 int, c2 int, c3 int, c4 char(10)); > create index idx_t1 on t1(c1); > create index idx_t2 on t1(c2); > create index idx_t3 on t1(c3); > > insert into t1 values(generate_series(1,500), 1, 10, 'aa'); > update t1 set c2 = 2; > postgres=# update t1 set c2 = 2; > UPDATE 500 > Time: 46533.530 ms (00:46.534) > > With HEAD > == > postgres=# update t1 set c2 = 2; > UPDATE 500 > Time: 52529.839 ms (00:52.530) > > I have dropped and recreated the table after each update in the test. Good thing that you remembered to drop and recreate the table, since otherwise bottom-up index deletion would look really good! Besides,
Re: strange error reporting
Alvaro Herrera writes: > On 2021-Jan-20, Robert Haas wrote: >> I figured it was something like that. I don't know whether the right >> thing is to use something like PQdb() to get the correct database >> name, or whether we should go with Tom's suggestion and omit that >> detail altogether, but I think showing the empty string when the user >> relied on the default is too confusing. > Well, the patch seems small enough, and I don't think it'll be in any > way helpful to omit that detail. I'm +1 for applying and back-patching that. I still think we might want to just drop the phrase altogether in HEAD, but we wouldn't do that in the back branches, and the message is surely misleading as-is. regards, tom lane
catalogs.sgml documentation ambiguity
Some catalog tables have references to pg_attribute.attnum. In the documentation, it only says "(references pg_attribute.attnum)" but not which oid column to include in the two-column "foreign key". This would not be a problem if there would only be one reference to pg_class.oid, but some catalog tables have multiple columns that references pg_class.oid. For instance, pg_constraint has two columns (conkey, confkey) referencing pg_attribute, and three columns (conrelid, conindid, confrelid) referencing pg_class. A user might wonder: - Which one of these three columns should be used in combination with the conkey/confkey elements to join pg_attribute? If we would have array foreign key support, I would guess the "foreign keys" should be: FOREIGN KEY (confrelid, EACH ELEMENT OF confkey) REFERENCES pg_catalog.pg_attribute (attrelid, attnum) FOREIGN KEY (conrelid, EACH ELEMENT OF conkey) REFERENCES pg_catalog.pg_attribute (attrelid, attnum) It's of course harder to guess for a machine though, which would need a separate human-produced lookup-table. Could it be meaningful to clarify these multi-key relations in the documentation? As a bonus, machines could then parse the information out of catalogs.sgml. Here is a list of catalogs referencing pg_attribute and with multiple pg_class references: table_name | array_agg --+--- pg_constraint| {confrelid,conindid,conrelid} pg_index | {indexrelid,indrelid} pg_partitioned_table | {partdefid,partrelid} pg_trigger | {tgconstrindid,tgconstrrelid,tgrelid} (4 rows) Produced using query: SELECT b.table_name, array_agg(DISTINCT b.column_name) FROM pit.oid_joins AS a JOIN pit.oid_joins AS b ON b.table_name = a.table_name WHERE a.ref_table_name = 'pg_attribute' AND b.ref_table_name = 'pg_class' GROUP BY b.table_name HAVING cardinality(array_agg(DISTINCT b.column_name)) > 1 ;
Re: Allow matching whole DN from a client certificate
On Mon, 2021-01-18 at 11:23 +0100, Daniel Gustafsson wrote: > + /* use commas instead of slashes */ > + X509_NAME_print_ex(bio, x509name, 0, XN_FLAG_SEP_COMMA_PLUS); > I don't have strong opinions on whether we shold use slashes or commas, but I > think it needs to be documented that commas are required since slashes is the > more common way to print a DN. There's also a XN_FLAG_RFC2253 flag, which claims to print in an RFC- compatible escape system. (It includes XN_FLAG_SEP_COMMA_PLUS.) I think NSS uses a similar RFC-based escape scheme, but I haven't checked to see whether it's fully compatible. I think you'll want to be careful to specify the format as much as possible, both to make sure that other backend TLS implementations can actually use the same escaping system and to ensure that user regexes don't suddenly start matching different things at some point in the future. As a cautionary tale, nginx is stuck with two versions of their similar feature (ssl_client_s_dn_legacy vs ssl_client_s_dn) after their switch to an RFC-compatible system [1]. Even when using RFC 2253 (now 4514) escaping, there are things left to the implementation, such as the order of AVAs inside multivalued RDNs. The RFC warns not to consider these representations to be canonical forms, so it's possible that switching (or upgrading) the TLS library in use could force users to change their regexes at some point in the future. I'm going to test this patch with some UTF-8 DNs later today; I'll share my findings. I'm also going to read up on [2] a bit more. --Jacob [1] https://serverfault.com/questions/829754/why-did-the-format-of-nginx-ssl-client-i-dn-suddenly-change [2] https://frasertweedale.github.io/blog-redhat/posts/2019-05-28-a-dn-is-not-a-string.html
Re: [HACKERS] [PATCH] Generic type subscripting
> On Wed, Jan 20, 2021 at 11:34:16AM -0500, Dian M Fay wrote: > > Thanks, I need to remember to not skipp doc building for testing process > > even for such small changes. Hope now I didn't forget anything. > > > > > On Wed, Jan 20, 2021 at 09:58:43AM -0500, Dian M Fay wrote: > > > > > Here's a full editing pass on the documentation, with v45 and Pavel's > > > doc-whitespaces-fix.patch applied. I also corrected a typo in one of the > > > added hints. > > > > Great! I've applied almost all of it, except: > > > > + A jsonb value will accept assignments to nonexistent > > subscript > > + paths as long as the nonexistent elements being traversed are all > > arrays. > > > > Maybe I've misunderstood the intention, but there is no requirement > > about arrays for creating such an empty path. I've formulated it as: > > > > + A jsonb value will accept assignments to nonexistent > > subscript > > + paths as long as the last existing path key is an object or an array. > > My intention there was to highlight the difference between: > > * SET obj['a']['b']['c'] = '"newvalue"' > * SET arr[0][0][3] = '"newvalue"' > > obj has to conform to {"a": {"b": {...}}} in order to receive the > assignment of the nested c. If it doesn't, that's the error case we > discussed earlier. But arr can be null, [], and so on, and any missing > structure [[[null, null, null, "newvalue"]]] will be created. If arr is 'null', or any other scalar value, such subscripting will work only one level deep because they represented internally as an array of one element. If arr is '[]' the path will comply by definition. So it's essentially the same as for objects with no particular difference. If such a quirk about scalars being treated like arrays is bothering, we could also bend it in this case as well (see the attached version). >From a4037c651a0cfd2448f38b6c8c932b5a09136b0a Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Fri, 18 Dec 2020 17:19:51 +0100 Subject: [PATCH v48 1/3] Subscripting for jsonb Subscripting implementation for jsonb. It does not support slices, does not have a limit for number of subscripts and for assignment expects a replace value to be of jsonb type. There is also one functional difference in assignment via subscripting from jsonb_set, when an original jsonb container is NULL, subscripting replaces it with an empty jsonb and proceed with assignment. For the sake of code reuse, some parts of jsonb functionality were rearranged to allow use the same functions for jsonb_set and assign subscripting operation. The original idea belongs to Oleg Bartunov. Reviewed-by: Tom Lane, Arthur Zakirov, Pavel Stehule, Dian M Fay --- doc/src/sgml/json.sgml | 51 src/backend/utils/adt/Makefile | 1 + src/backend/utils/adt/jsonb_util.c | 76 - src/backend/utils/adt/jsonbsubs.c | 413 src/backend/utils/adt/jsonfuncs.c | 180 ++-- src/include/catalog/pg_proc.dat | 4 + src/include/catalog/pg_type.dat | 3 +- src/include/utils/jsonb.h | 6 +- src/test/regress/expected/jsonb.out | 272 +- src/test/regress/sql/jsonb.sql | 84 +- 10 files changed, 985 insertions(+), 105 deletions(-) create mode 100644 src/backend/utils/adt/jsonbsubs.c diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml index 5b9a5557a4..3ace5e444b 100644 --- a/doc/src/sgml/json.sgml +++ b/doc/src/sgml/json.sgml @@ -602,6 +602,57 @@ SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"tags": ["qu + + jsonb Subscripting + + The jsonb data type supports array-style subscripting expressions + to extract and modify elements. Nested values can be indicated by chaining + subscripting expressions, following the same rules as the path + argument in the jsonb_set function. If a jsonb + value is an array, numeric subscripts start at zero, and negative integers count + backwards from the last element of the array. Slice expressions are not supported. + The result of a subscripting expression is always of the jsonb data type. + + + + An example of subscripting syntax: + + +-- Extract object value by key +SELECT ('{"a": 1}'::jsonb)['a']; + +-- Extract nested object value by key path +SELECT ('{"a": {"b": {"c": 1}}}'::jsonb)['a']['b']['c']; + +-- Extract array element by index +SELECT ('[1, "2", null]'::jsonb)[1]; + +-- Update object value by key. Note the quotes around '1': the assigned +-- value must be of the jsonb type as well +UPDATE table_name SET jsonb_field['key'] = '1'; + +-- Filter records using a WHERE clause with subscripting. Since the result of +-- subscripting is jsonb, the value we compare it against must also be jsonb. +-- The double quotes make "value" also a valid jsonb string. +SELECT * FROM table_name WHERE jsonb_field['key'] = '"value"'; + + + jsonb assignment via subscripting handles a few edge cases + differently from jsonb_set. When a source jsonb + is NULL, assig
Re: Calculation of relids (pull_varnos result) for PlaceHolderVars
I wrote: > ... > 2. pull_varnos() is not passed the planner "root" data structure, > so it can't get at the PlaceHolderInfo list. We can change its > API of course, but that propagates to dozens of places. > ... > The 0001 patch attached goes ahead and makes those API changes. > I think this is perfectly reasonable to do in HEAD, but it most > likely is an unacceptable API/ABI break for the back branches. > ... > A third way is to preserve the existing pull_varnos() API in > the back branches, changing all the internal calls to use a > new function that has the additional "root" parameter. This > seems feasible but I've not attempted to code it yet. Here's a proposed fix that does it like that. The 0001 patch is the same as before, and then 0002 is a delta to be applied only in the back branches. What I did there was install a layer of macros in the relevant .c files that cause calls that look like the HEAD versions to be redirected to the "xxx_new" functions. The idea is to keep the actual code in sync with HEAD, for readability and to minimize back-patching pain. It could be argued that this is too cute and the internal references should just go to the "new" functions in the back branches. I did not bother to preserve ABI for these two functions: indexcol_is_bool_constant_for_query() build_implied_join_equality() because I judged it highly unlikely that any extensions are calling them. If anybody thinks differently, we could hack those in the same way. regards, tom lane diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 2f2d4d171c..48c2f23ae0 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -5709,7 +5709,8 @@ foreign_grouping_ok(PlannerInfo *root, RelOptInfo *grouped_rel, * RestrictInfos, so we must make our own. */ Assert(!IsA(expr, RestrictInfo)); - rinfo = make_restrictinfo(expr, + rinfo = make_restrictinfo(root, + expr, true, false, false, diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 4f5b870d1b..d263ecf082 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -227,7 +227,7 @@ clauselist_selectivity_ext(PlannerInfo *root, } else { -ok = (NumRelids(clause) == 1) && +ok = (NumRelids(root, clause) == 1) && (is_pseudo_constant_clause(lsecond(expr->args)) || (varonleft = false, is_pseudo_constant_clause(linitial(expr->args; @@ -609,7 +609,7 @@ bms_is_subset_singleton(const Bitmapset *s, int x) * restriction or join estimator. Subroutine for clause_selectivity(). */ static inline bool -treat_as_join_clause(Node *clause, RestrictInfo *rinfo, +treat_as_join_clause(PlannerInfo *root, Node *clause, RestrictInfo *rinfo, int varRelid, SpecialJoinInfo *sjinfo) { if (varRelid != 0) @@ -643,7 +643,7 @@ treat_as_join_clause(Node *clause, RestrictInfo *rinfo, if (rinfo) return (bms_membership(rinfo->clause_relids) == BMS_MULTIPLE); else - return (NumRelids(clause) > 1); + return (NumRelids(root, clause) > 1); } } @@ -860,7 +860,7 @@ clause_selectivity_ext(PlannerInfo *root, OpExpr *opclause = (OpExpr *) clause; Oid opno = opclause->opno; - if (treat_as_join_clause(clause, rinfo, varRelid, sjinfo)) + if (treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo)) { /* Estimate selectivity for a join clause. */ s1 = join_selectivity(root, opno, @@ -896,7 +896,7 @@ clause_selectivity_ext(PlannerInfo *root, funcclause->funcid, funcclause->args, funcclause->inputcollid, - treat_as_join_clause(clause, rinfo, + treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo), varRelid, jointype, @@ -907,7 +907,7 @@ clause_selectivity_ext(PlannerInfo *root, /* Use node specific selectivity calculation function */ s1 = scalararraysel(root, (ScalarArrayOpExpr *) clause, - treat_as_join_clause(clause, rinfo, + treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo), varRelid, jointype, diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 380336518f..aab06c7d21 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -1858,7 +1858,7 @@ cost_incremental_sort(Path *path, * Check if the expression contains Var with "varno 0" so that we * don't call estimate_num_groups in that case. */ - if (bms_is_member(0, pull_varnos((Node *) member->em_expr))) + if (bms_is_member(0, pull_varnos(root, (Node *) member->em_expr))) { unknown_varno = true; break; diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index ab6eaaead1..0188c1e9a1 100644 --- a/s
Re: strange error reporting
On Wed, Jan 20, 2021 at 1:54 PM Tom Lane wrote: > Alvaro Herrera writes: > > On 2021-Jan-20, Robert Haas wrote: > >> I figured it was something like that. I don't know whether the right > >> thing is to use something like PQdb() to get the correct database > >> name, or whether we should go with Tom's suggestion and omit that > >> detail altogether, but I think showing the empty string when the user > >> relied on the default is too confusing. > > > Well, the patch seems small enough, and I don't think it'll be in any > > way helpful to omit that detail. > > I'm +1 for applying and back-patching that. I still think we might > want to just drop the phrase altogether in HEAD, but we wouldn't do > that in the back branches, and the message is surely misleading as-is. Sure, that makes sense. -- Robert Haas EDB: http://www.enterprisedb.com
Re: poc - possibility to write window function in PL languages
Pavel Stehule writes: > The second question is work with partition context value. This should be > only one value, and of only one but of any type per function. In this case > we cannot use GET statements. I had an idea of enhancing declaration. Some > like > DECLARE > pcx PARTITION CONTEXT (int); -- read partition context > BEGIN > pcx := 10; -- set partition context > What do you think about it? Uh, what? I don't understand what this "partition context" is. regards, tom lane
Re: poc - possibility to write window function in PL languages
st 20. 1. 2021 v 21:07 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > The second question is work with partition context value. This should be > > only one value, and of only one but of any type per function. In this > case > > we cannot use GET statements. I had an idea of enhancing declaration. > Some > > like > > > DECLARE > > pcx PARTITION CONTEXT (int); -- read partition context > > BEGIN > > pcx := 10; -- set partition context > > > What do you think about it? > > Uh, what? I don't understand what this "partition context" is. > It was my name for an access to window partition local memory - WinGetPartitionLocalMemory We need some interface for this cache Regards Pavel > > regards, tom lane >
Re: poc - possibility to write window function in PL languages
Pavel Stehule writes: > st 20. 1. 2021 v 21:07 odesílatel Tom Lane napsal: >> Uh, what? I don't understand what this "partition context" is. > It was my name for an access to window partition local memory - > WinGetPartitionLocalMemory Ah. > We need some interface for this cache I'm not convinced we need to expose that, or that it'd be very satisfactory to plpgsql users if we did. The fact that it's fixed-size and initializes to zeroes are both things that are okay for C programmers but might be awkward to deal with in plpgsql code. At the very least it would greatly constrain what data types you could usefully store. So I'd be inclined to leave that out, at least for the first version. regards, tom lane
Re: poc - possibility to write window function in PL languages
st 20. 1. 2021 v 21:32 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > st 20. 1. 2021 v 21:07 odesílatel Tom Lane napsal: > >> Uh, what? I don't understand what this "partition context" is. > > > It was my name for an access to window partition local memory - > > WinGetPartitionLocalMemory > > Ah. > > > We need some interface for this cache > > I'm not convinced we need to expose that, or that it'd be very > satisfactory to plpgsql users if we did. The fact that it's fixed-size > and initializes to zeroes are both things that are okay for C programmers > but might be awkward to deal with in plpgsql code. At the very least it > would greatly constrain what data types you could usefully store. > > So I'd be inclined to leave that out, at least for the first version. > I think this functionality is relatively important. If somebody tries to implement own window function, then he starts with some variation of the row_num function. We can support only types of fixed length to begin. Regards Pavel > > regards, tom lane >
Re: Deleting older versions in unique indexes to avoid page splits
On Wed, Jan 20, 2021 at 10:53 AM Peter Geoghegan wrote: > This patch is unusual in that you really need to think about emergent > behaviors to understand it. That is certainly a difficult thing to do, > and it's understandable that even an expert might not grok it without > considering it carefully. I happened to stumble upon a recent blog post that seems like a light, approachable introduction to some of the key concepts here: https://jessitron.com/2021/01/18/when-costs-are-nonlinear-keep-it-small/ Bottom-up index deletion enhances a complex system whose maintenance costs are *dramatically* nonlinear, at least in many important cases. If you apply linear thinking to such a system then you'll probably end up with a bad design. The system as a whole is made efficient by making sure that we're lazy when that makes sense, while also making sure that we're eager when that makes sense. So it almost *has* to be structured as a bottom-up, reactive mechanism -- no other approach is able to ramp up or down in exactly the right way. Talking about small cost differences (things that can easily be empirically measured, perhaps with a microbenchmark) is almost irrelevant to the big picture. It's even irrelevant to the "medium picture". What's more, it's basically a mistake to think of heap page accesses that don't yield any deletable index tuples as wasted effort (even though that's how I describe them myself!). Here's why: we have to access the heap page to learn that it has nothing for us in the first place place! If we somehow knew ahead of time that some useless-to-us heap block was useless, then the whole system wouldn't be bottom-up (by definition). In other words, failing to get any index tuple deletes from an entire heap page *is itself a form of feedback* at the local level -- it guides the entire system's behavior over time. Why should we expect to get that information at zero cost? This is somehow both simple and complicated, which creates huge potential for miscommunications. I tried to describe this in various ways at various points. Perhaps I could have done better with that. -- Peter Geoghegan
Re: [PATCH 1/1] Fix detection of pwritev support for OSX.
On Tue, Jan 19, 2021 at 6:37 PM Tom Lane wrote: > > James Hilliard writes: > > Actually, this looks path looks wrong in general, the value for > > "xcrun --sdk macosx --show-sdk-path" should take precedence over > > "xcrun --show-sdk-path" as the latter may be used for IOS potentially. > > What is "potentially"? Well I'm not sure the SDK parameter always defaults to macos although I guess it probably does as I couldn't figure out a way to change it: $ xcodebuild -showsdks iOS SDKs: iOS 14.3 -sdk iphoneos14.3 iOS Simulator SDKs: Simulator - iOS 14.3 -sdk iphonesimulator14.3 macOS SDKs: DriverKit 20.2-sdk driverkit.macosx20.2 macOS 11.1-sdk macosx11.1 tvOS SDKs: tvOS 14.3 -sdk appletvos14.3 tvOS Simulator SDKs: Simulator - tvOS 14.3 -sdk appletvsimulator14.3 watchOS SDKs: watchOS 7.2 -sdk watchos7.2 watchOS Simulator SDKs: Simulator - watchOS 7.2 -sdk watchsimulator7.2 > I've found no direct means to control the > SDK path at all, but so far it appears that "xcrun --show-sdk-path" > agrees with the compiler's default -isysroot path as seen in the > compiler's -v output. I suspect that this isn't coincidental, > but reflects xcrun actually being used in the compiler launch > process. If it were to flip over to using a IOS SDK, that would > mean that bare "cc" would generate nonfunctional executables, > which just about any onlooker would agree is broken. So there's some more weirdness involved here, whether or not you have the command line install seems to affect the output of the "xcrun --show-sdk-path" command, but not the "xcrun --sdk macosx --show-sdk-path" command. This is what I get without the command line tools: $ xcrun --show-sdk-path /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk $ xcrun --sdk macosx --show-sdk-path /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk this last one is just a symlink to the other path. With command line tools this is different however: $ xcrun --show-sdk-path /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk $ xcrun --sdk macosx --show-sdk-path /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk Note that the /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk is different from the normal SDK and doesn't seem to be able to generate binaries that target a 11.0 deployment target on my 10.15 system, however I am unsure if this behavior can be relied upon. So in terms of what works best, the newer normal SDK has the most flexibility as it can produce both 10.15 target binaries and 11.0 target binaries depending on the MACOSX_DEPLOYMENT_TARGET while the command line tools SDK can only produce 10.15 target binaries it would appear. Note that with my patch the binaries will always be compatible with the host system by default, even if the SDK is capable of producing binaries that are incompatible so building postgres works with and without the command line tools SDK. So I think "xcrun --sdk macosx --show-sdk-path" is probably preferable but either should work as long as we can properly detect deployment target symbol availability, regardless this SDK sysroot selection issue is effectively an entirely different issue from the feature detection not properly respecting the configured deployment target. > > I'm really not excited about trying to make the build work with > a non-native SDK as you are proposing. I think that's just going > to lead to a continuing stream of problems, because of Apple's > opinions about how cross-version compatibility should work. Well the minimum required target version is pretty much strictly based on MACOSX_DEPLOYMENT_TARGET so our feature detection still needs to use that, otherwise cross target compilation for newer or older targets will not work correctly. >From my understanding the reason AC_REPLACE_FUNCS does not throw an error for deployment target incompatible functions is that it only checks if the function exists and not if it is actually useable, this is why I had to add an explicit AC_LANG_PROGRAM compile test to properly trigger a compile failure if the function is not usable for a particular deployment target version, merely checking if the function exists in the header is not sufficient. > It also seems like unnecessary complexity, because there is always > (AFAICS) a native SDK version available. We just need to find it. Best I can tell this is not true, it is some(most?) of the time but it's not something we can rely upon as systems may only contain a newer SDK, but this newer SDK is still capable of producing binaries that can run on the build host system so this shouldn't be an issue as long as we can do target feature detection properly. > > regards, tom lane
Re: list of extended statistics on psql
On 1/20/21 7:41 AM, Tatsuro Yamada wrote: Hi Tomas, On 2021/01/20 11:35, Tatsuro Yamada wrote: Apologies for all the extra work - I haven't realized this flaw when pushing for showing more stuff :-( Don't worry about it. We didn't notice the problem even when viewed by multiple people on -hackers. Let's keep moving forward. :-D I'll send a patch including a regression test on the next patch. I created patches and my test results on PG10, 11, 12, and 14 are fine. 0001: - Fix query to use pg_statistic_ext only - Replace statuses "required" and "built" with "defined" - Remove the size columns - Fix document - Add schema name as a filter condition on the query 0002: - Fix all results of \dX - Add new testcase by non-superuser Please find attached files. :-D Thanks, I've pushed this. I had to tweak the regression tests a bit, for two reasons: 1) to change user in regression tests, don't use \connect, but SET ROLE and RESET ROLE 2) roles in regression tests should use names with regress_ prefix regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_upgrade fails with non-standard ACL
On 03.01.2021 14:29, Noah Misch wrote: On Thu, Jun 11, 2020 at 07:58:43PM +0300, Anastasia Lubennikova wrote: On 08.06.2020 19:31, Alvaro Herrera wrote: I'm thinking what's a good way to have a test that's committable. Maybe it would work to add a TAP test to pg_upgrade that runs initdb, does a few GRANTS as per your attachment, then runs pg_upgrade? Currently the pg_upgrade tests are not TAP ... we'd have to revive https://postgr.es/m/20180126080026.gi17...@paquier.xyz (Some progress was already made on the buildfarm front to permit this) I would be glad to add some test, but it seems to me that the infrastructure changes for cross-version pg_upgrade test is much more complicated task than this modest bugfix. Agreed. Thank you for the review. New version of the patch is attached, though I haven't tested it properly yet. I am planning to do in a couple of days. --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c +static void +check_for_changed_signatures(void) +{ + snprintf(output_path, sizeof(output_path), "revoke_objects.sql"); If an upgrade deleted function pg_last_wal_replay_lsn, upgrading a database in which "REVOKE ALL ON FUNCTION pg_last_wal_replay_lsn FROM public" happened requires a GRANT. Can you use a file name that will still make sense if someone enhances pg_upgrade to generate such GRANT statements? I changed the name to 'fix_system_objects_ACL.sql'. Does it look good to you? + if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL) + pg_fatal("could not open file \"%s\": %s\n", +output_path, strerror(errno)); Use %m instead of passing sterror(errno) to %s. Done. + } + + /* Handle columns separately */ + if (strstr(aclinfo->obj_type, "column") != NULL) + { + char *pdot = last_dot_location(aclinfo->obj_ident); I'd write strrchr(aclinfo->obj_ident, '.') instead of introducing last_dot_location(). This assumes column names don't contain '.'; how difficult would it be to remove that assumption? If it's difficult, a cheap approach would be to ignore any entry containing too many dots. We're not likely to create such column names at initdb time, but v13 does allow: ALTER VIEW pg_locks RENAME COLUMN objid TO "a.b"; GRANT SELECT ("a.b") ON pg_locks TO backup; After which revoke_objects.sql has: psql:./revoke_objects.sql:9: ERROR: syntax error at or near "") ON pg_catalog.pg_locks."" LINE 1: REVOKE ALL (b") ON pg_catalog.pg_locks."a FROM "backup"; While that ALTER VIEW probably should have required allow_system_table_mods, we need to assume such databases exist. I've fixed it by using pg_identify_object_as_address(). Now we don't have to parse obj_identity. Overall, this patch predicts a subset of cases where pg_dump will emit a failing GRANT or REVOKE that targets a pg_catalog object. Can you write a code comment stating what it does and does not detect? I think it's okay to not predict every failure, but let's record the intended coverage. Here are a few examples I know; there may be more. The above query only finds GRANTs to non-pinned roles. pg_dump dumps the following, but the above query doesn't find them: REVOKE ALL ON FUNCTION pg_last_wal_replay_lsn FROM public; GRANT EXECUTE ON FUNCTION pg_reload_conf() TO pg_signal_backend; The above query should test refclassid. This function should not run queries against servers older than 9.6, because pg_dump emits GRANT/REVOKE for pg_catalog objects only when targeting 9.6 or later. An upgrade from 8.4 to master is failing on the above query: Fixed. The patch adds many lines wider than 78 columns, this being one example. In general, break such lines. (Don't break string literal arguments of ereport(), though.) Fixed. nm -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index 43fc297eb6..429e4468f2 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -16,6 +16,7 @@ static void check_new_cluster_is_empty(void); static void check_databases_are_compatible(void); +static void check_for_changed_signatures(void); static void check_locale_and_encoding(DbInfo *olddb, DbInfo *newdb); static bool equivalent_locale(int category, const char *loca, const char *locb); static void check_is_install_user(ClusterInfo *cluster); @@ -151,6 +152,8 @@ check_and_dump_old_cluster(bool live_check) if (GET_MAJOR_VERSION(old_cluster.major_version) <= 804) new_9_0_populate_pg_largeobject_metadata(&old_cluster, true); + get_non_default_acl_infos(&old_cluster); + /* * While not a check option, we do this now because this
Re: ResourceOwner refactoring
On 19/01/2021 11:09, Heikki Linnakangas wrote: On 18/01/2021 18:11, Robert Haas wrote: On Mon, Jan 18, 2021 at 11:11 AM Robert Haas wrote: On Mon, Jan 18, 2021 at 10:19 AM Heikki Linnakangas wrote: On 18/01/2021 16:34, Alvaro Herrera wrote: So according to your performance benchmark, we're willing to accept a 30% performance loss on an allegedly common operation -- numkeep=0 numsnaps=10 becomes 49.8ns from 37.6ns. That seems a bit shocking. Maybe you can claim that these operations aren't exactly hot spots, and so the fact that we remain in the same power-of-ten is sufficient. Is that the argument? That's right. The fast path is fast, and that's important. The slow path becomes 30% slower, but that's acceptable. I don't know whether a 30% slowdown will hurt anybody, but it seems like kind of a lot, and I'm not sure I understand what corresponding benefit we're getting. The benefit is to make it easy for extensions to use resource owners to track whatever resources they need to track. And arguably, the new mechanism is nicer for built-in code, too. I'll see if I can optimize the slow paths, to make it more palatable. Ok, here's a new set of patches, and new test results. I replaced the hash function with a cheaper one. I also added the missing wrappers that Alvaro and Hayato Kuroda commented on, and fixed the typos that Michael Paquier pointed out. In the test script, I increased the number of iterations used in the tests, to make them run longer and produce more stable results. There is still a fair amount of jitter in the results, so take any particular number with a grain of salt, but the overall trend is repeatable. The results now look like this: Unpatched - postgres=# \i snaptest.sql numkeep | numsnaps | lifo_time_ns | fifo_time_ns -+--+--+-- 0 |1 | 32.7 | 32.3 0 |5 | 33.0 | 32.9 0 | 10 | 34.1 | 35.5 0 | 60 | 32.5 | 38.3 0 | 70 | 47.6 | 48.9 0 | 100 | 47.9 | 49.3 0 | 1000 | 52.9 | 52.7 0 |1 | 61.7 | 62.4 9 | 10 | 38.4 | 37.6 9 | 100 | 42.3 | 42.3 9 | 1000 | 43.9 | 45.0 9 |1 | 62.2 | 62.5 65 | 70 | 42.4 | 42.9 65 | 100 | 43.2 | 43.9 65 | 1000 | 44.0 | 45.1 65 |1 | 62.4 | 62.6 (16 rows) Patched --- postgres=# \i snaptest.sql numkeep | numsnaps | lifo_time_ns | fifo_time_ns -+--+--+-- 0 |1 | 32.8 | 34.2 0 |5 | 33.8 | 32.2 0 | 10 | 37.2 | 40.2 0 | 60 | 40.2 | 45.1 0 | 70 | 40.9 | 48.4 0 | 100 | 41.9 | 45.2 0 | 1000 | 49.0 | 55.6 0 |1 | 62.4 | 67.2 9 | 10 | 33.6 | 33.0 9 | 100 | 39.6 | 39.7 9 | 1000 | 42.2 | 45.0 9 |1 | 60.7 | 63.9 65 | 70 | 32.5 | 33.6 65 | 100 | 37.5 | 39.7 65 | 1000 | 42.3 | 46.3 65 |1 | 61.9 | 64.8 (16 rows) For easier side-by-side comparison, here are the same numbers with the patched and unpatched results side by side, and their ratio (ratio < 1 means the patched version is faster): LIFO tests: numkeep | numsnaps | unpatched | patched | ratio -+--+---+-+--- 0 |1 | 32.7 |32.8 | 1.00 0 |5 | 33.0 |33.8 | 1.02 0 | 10 | 34.1 |37.2 | 1.09 0 | 60 | 32.5 |40.2 | 1.24 0 | 70 | 47.6 |40.9 | 0.86 0 | 100 | 47.9 |41.9 | 0.87 0 | 1000 | 52.9 |49.0 | 0.93 0 |1 | 61.7 |62.4 | 1.01 9 | 10 | 38.4 |33.6 | 0.88 9 | 100 | 42.3 |39.6 | 0.94 9 | 1000 | 43.9 |42.2 | 0.96 9 |1 | 62.2 |60.7 | 0.98 65 | 70 | 42.4 |32.5 | 0.77 65 | 100 | 43.2 |37.5 | 0.87 65 | 1000 | 44.0 |42.3 | 0.96 65 |1 | 62.4 |61.9 | 0.99 (16 rows) FIFO tests: numkeep | numsnaps | unpatched | patched | ratio -+--+---+-+--- 0 |1 | 32.3 |34.2 | 1.06 0 |5 | 32.9 |32.2 | 0.98 0 | 10 | 35.5 |40.2 | 1.13
Re: POC: postgres_fdw insert batching
OK, pushed after a little bit of additional polishing (mostly comments). Thanks everyone! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH 1/1] Fix detection of pwritev support for OSX.
James Hilliard writes: > On Tue, Jan 19, 2021 at 6:37 PM Tom Lane wrote: >> I've found no direct means to control the >> SDK path at all, but so far it appears that "xcrun --show-sdk-path" >> agrees with the compiler's default -isysroot path as seen in the >> compiler's -v output. I suspect that this isn't coincidental, >> but reflects xcrun actually being used in the compiler launch >> process. If it were to flip over to using a IOS SDK, that would >> mean that bare "cc" would generate nonfunctional executables, >> which just about any onlooker would agree is broken. > So there's some more weirdness involved here, whether or not you > have the command line install seems to affect the output of the > "xcrun --show-sdk-path" command, but not the > "xcrun --sdk macosx --show-sdk-path" command. Yeah, that's what we discovered in the other thread. It seems that with "--sdk macosx" you'll always get a pointer to the (solitary) SDK under /Applications/Xcode.app, but with the short "xcrun --show-sdk-path" command you might get either that or a pointer to something under /Library/Developer/CommandLineTools. I now believe what is actually happening with the short command is that it's iterating through the available SDKs (according to some not very clear search path) and picking the first one it finds that matches the host system version. That matches the ktrace evidence that shows it reading the SDKSettings.plist file in each SDK directory. The fact that it can seize on either an actual directory or an equivalent symlink might be due to chance ordering of directory entries. (It'd be interesting to see "ls -f" output for your /Library/Developer/CommandLineTools/SDKs directory ... though if you've been experimenting with deinstall/reinstall, there's no reason to suppose the entry order is still the same.) I'm not sure that the case of not having the "command line tools" installed is interesting for our purposes. AFAIK you have to have that in order to have access to required tools like bison and gmake. (That reminds me, I was intending to add something to our docs about how-to-build-from-source to say that you need to install those.) > Note that with my patch the binaries will always be compatible with the > host system by default, even if the SDK is capable of producing binaries > that are incompatible so building postgres works with and without the > command line tools SDK. Yeah. I don't see that as a benefit actually. Adding the -no_weak_imports linker switch (or the other one you're suggesting) means that you *cannot* cross-compile for a newer macOS version, even if you set PG_SYSROOT and/or MACOSX_DEPLOYMENT_TARGET with the intention of doing so. You'll still get a build that reflects the set of kernel calls available on the host system. Admittedly, this is a case that's not likely to be of interest to very many people, but I don't see why a method with that restriction is superior to picking a default SDK that matches the host system (and can be overridden). > So I think "xcrun --sdk macosx --show-sdk-path" is probably preferable > but either should work as long as we can properly detect deployment > target symbol availability, regardless this SDK sysroot selection issue is > effectively an entirely different issue from the feature detection not > properly > respecting the configured deployment target. No, I think it's pretty much equivalent. If we pick the right SDK then we'll get the build we want. regards, tom lane
Re: Allow matching whole DN from a client certificate
On Wed, 2021-01-20 at 19:07 +, Jacob Champion wrote: > I think you'll want to be careful to specify the format as much as > possible, both to make sure that other backend TLS implementations can > actually use the same escaping system and to ensure that user regexes > don't suddenly start matching different things at some point in the > future. Along those lines: the current implementation doesn't escape commas in fields, which means you can inject them to force a bad regex match. For instance, when using the usermap that's in the patch: dn"/^.*OU=Testing,.*$"username if I create a certificate with the Organizational Unit name "Testing, or something", then that will also match. Switching to RFC 2253/4514 quoting fixes comma injection (and reverses the order of the RDNs, which requires a change to the regex patterns). But I think that the regex as supplied still isn't strong enough to prevent problems. For example, the equals sign isn't a delimiter and therefore isn't quoted. So if I get my CA to sign a certificate with some arbitrary field value of "HEY YOU=Testing", then that will also match the above usermap. You'd need to write the regex with extreme attention to detail and a full understanding of the escaping scheme to get around that -- assuming that the scheme is generally parsable with regexes to begin with. > I'm going to test this patch with some UTF-8 DNs later today; I'll share my > findings. UTF-8 has the opposite issue; it's escaped in a way that makes it unusable in a regex match. For example, say I have a (simple for the sake of example, but broken as noted above) usermap of dn"/^CN=([^,]*).*$"\1 which is supposed to emulate the functionality of the "clientname=CN" mode, and two users named "postgres" and "οδυσσέας". The "CN=postgres" user will work just fine, but the UTF-8 CN of "οδυσσέας" will be escaped into "\U03BF\U03B4\U03C5\U03C3\U03C3\U03AD\U03B1\U03C2" and fail to match the internal user. (I'm not seeing an RFC describe the "\U" escaping scheme; maybe it's OpenSSL-specific?) --Jacob
Re: [PATCH 1/1] Fix detection of pwritev support for OSX.
On Wed, Jan 20, 2021 at 4:07 PM Tom Lane wrote: > > James Hilliard writes: > > On Tue, Jan 19, 2021 at 6:37 PM Tom Lane wrote: > >> I've found no direct means to control the > >> SDK path at all, but so far it appears that "xcrun --show-sdk-path" > >> agrees with the compiler's default -isysroot path as seen in the > >> compiler's -v output. I suspect that this isn't coincidental, > >> but reflects xcrun actually being used in the compiler launch > >> process. If it were to flip over to using a IOS SDK, that would > >> mean that bare "cc" would generate nonfunctional executables, > >> which just about any onlooker would agree is broken. > > > So there's some more weirdness involved here, whether or not you > > have the command line install seems to affect the output of the > > "xcrun --show-sdk-path" command, but not the > > "xcrun --sdk macosx --show-sdk-path" command. > > Yeah, that's what we discovered in the other thread. It seems that > with "--sdk macosx" you'll always get a pointer to the (solitary) > SDK under /Applications/Xcode.app, but with the short "xcrun > --show-sdk-path" command you might get either that or a pointer to > something under /Library/Developer/CommandLineTools. > > I now believe what is actually happening with the short command is > that it's iterating through the available SDKs (according to some not > very clear search path) and picking the first one it finds that > matches the host system version. That matches the ktrace evidence > that shows it reading the SDKSettings.plist file in each SDK > directory. The fact that it can seize on either an actual directory > or an equivalent symlink might be due to chance ordering of directory > entries. (It'd be interesting to see "ls -f" output for your > /Library/Developer/CommandLineTools/SDKs directory ... though if Well at the moment I completely deleted that directory...and the build works fine with my patch still. > you've been experimenting with deinstall/reinstall, there's no > reason to suppose the entry order is still the same.) > > I'm not sure that the case of not having the "command line tools" > installed is interesting for our purposes. AFAIK you have to have > that in order to have access to required tools like bison and gmake. > (That reminds me, I was intending to add something to our docs > about how-to-build-from-source to say that you need to install those.) Yeah, not 100% sure but I was able to build just fine after deleting my command line tools. I think it just switched to using the normal SDK toolchain, I guess that's the fallback logic doing that. It would be pretty annoying to have to install an outdated SDK just to build postgres for no other reason than the autoconf feature detection being broken. > > > Note that with my patch the binaries will always be compatible with the > > host system by default, even if the SDK is capable of producing binaries > > that are incompatible so building postgres works with and without the > > command line tools SDK. > > Yeah. I don't see that as a benefit actually. Adding the > -no_weak_imports linker switch (or the other one you're suggesting) > means that you *cannot* cross-compile for a newer macOS version, > even if you set PG_SYSROOT and/or MACOSX_DEPLOYMENT_TARGET with the > intention of doing so. Best I can tell this isn't true, I was able to cross compile for a newer MACOSX_DEPLOYMENT_TARGET than my build host just fine. The binary fails with a "Symbol not found: _pwritev" error when I try to run it on the system that built it. In regards to the -no_weak_imports switch...that is something different from my understanding as it just strips the weak imports forcing the fallback code paths to be taken instead, essentially functioning as if the weak symbols are never available. It's largely separate from the deployment target from my understanding as weak symbols are feature that lets you use newer syscalls while still providing backwards compatible fallbacks for older systems. > You'll still get a build that reflects the set > of kernel calls available on the host system. Admittedly, this is a > case that's not likely to be of interest to very many people, but > I don't see why a method with that restriction is superior to picking > a default SDK that matches the host system (and can be overridden). But to fix the build when using a newer SDK overriding the SDK location does not help, you would have to override the broken feature detection. > > > So I think "xcrun --sdk macosx --show-sdk-path" is probably preferable > > but either should work as long as we can properly detect deployment > > target symbol availability, regardless this SDK sysroot selection issue is > > effectively an entirely different issue from the feature detection not > > properly > > respecting the configured deployment target. > > No, I think it's pretty much equivalent. If we pick the right SDK > then we'll get the build we want. Generally any recent SDK installed should
Re: POC: postgres_fdw insert batching
Hmm, seems that florican doesn't like this :-( https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15 It's a i386 machine running FreeBSD, so not sure what exactly it's picky about. But when I tried running this under valgrind, I get some strange failures in the new chunk in ExecInitModifyTable: /* * Determine if the FDW supports batch insert and determine the batch * size (a FDW may support batching, but it may be disabled for the * server/table). */ if (!resultRelInfo->ri_usesFdwDirectModify && operation == CMD_INSERT && resultRelInfo->ri_FdwRoutine != NULL && resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert) resultRelInfo->ri_BatchSize = resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); else resultRelInfo->ri_BatchSize = 1; Assert(resultRelInfo->ri_BatchSize >= 1); It seems as if the resultRelInfo is not initialized, or something like that. I wouldn't be surprised if the 32-bit machine was pickier and failing because of that. A sample of the valgrind log is attached. It's pretty much just repetitions of these three reports. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company ==186005== Invalid read of size 1 ==186005==at 0x759C12: ExecInitModifyTable (nodeModifyTable.c:2801) ==186005==by 0x720B35: ExecInitNode (execProcnode.c:174) ==186005==by 0x716F99: InitPlan (execMain.c:939) ==186005==by 0x715F47: standard_ExecutorStart (execMain.c:266) ==186005==by 0x715CD3: ExecutorStart (execMain.c:148) ==186005==by 0x9495B1: ProcessQuery (pquery.c:155) ==186005==by 0x94AF24: PortalRunMulti (pquery.c:1267) ==186005==by 0x94A4C8: PortalRun (pquery.c:779) ==186005==by 0x94438E: exec_simple_query (postgres.c:1240) ==186005==by 0x9485F5: PostgresMain (postgres.c:4394) ==186005==by 0x88D1D4: BackendRun (postmaster.c:4484) ==186005==by 0x88CB53: BackendStartup (postmaster.c:4206) ==186005==by 0x889205: ServerLoop (postmaster.c:1730) ==186005==by 0x888AD0: PostmasterMain (postmaster.c:1402) ==186005==by 0x78D466: main (main.c:209) ==186005== Address 0xe594f78 is 1,864 bytes inside a block of size 8,192 alloc'd ==186005==at 0x483A809: malloc (vg_replace_malloc.c:307) ==186005==by 0xAFEFE6: AllocSetContextCreateInternal (aset.c:468) ==186005==by 0x7298F9: CreateExprContextInternal (execUtils.c:252) ==186005==by 0x7299DD: CreateExprContext (execUtils.c:302) ==186005==by 0x729C5E: ExecAssignExprContext (execUtils.c:481) ==186005==by 0x75D24D: ExecInitSeqScan (nodeSeqscan.c:147) ==186005==by 0x720BEF: ExecInitNode (execProcnode.c:207) ==186005==by 0x716F99: InitPlan (execMain.c:939) ==186005==by 0x715F47: standard_ExecutorStart (execMain.c:266) ==186005==by 0x715CD3: ExecutorStart (execMain.c:148) ==186005==by 0x949E64: PortalStart (pquery.c:505) ==186005==by 0x9442A2: exec_simple_query (postgres.c:1201) ==186005==by 0x9485F5: PostgresMain (postgres.c:4394) ==186005==by 0x88D1D4: BackendRun (postmaster.c:4484) ==186005==by 0x88CB53: BackendStartup (postmaster.c:4206) ==186005==by 0x889205: ServerLoop (postmaster.c:1730) ==186005==by 0x888AD0: PostmasterMain (postmaster.c:1402) ==186005==by 0x78D466: main (main.c:209) ==186005== { Memcheck:Addr1 fun:ExecInitModifyTable fun:ExecInitNode fun:InitPlan fun:standard_ExecutorStart fun:ExecutorStart fun:ProcessQuery fun:PortalRunMulti fun:PortalRun fun:exec_simple_query fun:PostgresMain fun:BackendRun fun:BackendStartup fun:ServerLoop fun:PostmasterMain fun:main } ==186005== Invalid write of size 4 ==186005==at 0x759C74: ExecInitModifyTable (nodeModifyTable.c:2809) ==186005==by 0x720B35: ExecInitNode (execProcnode.c:174) ==186005==by 0x716F99: InitPlan (execMain.c:939) ==186005==by 0x715F47: standard_ExecutorStart (execMain.c:266) ==186005==by 0x715CD3: ExecutorStart (execMain.c:148) ==186005==by 0x9495B1: ProcessQuery (pquery.c:155) ==186005==by 0x94AF24: PortalRunMulti (pquery.c:1267) ==186005==by 0x94A4C8: PortalRun (pquery.c:779) ==186005==by 0x94438E: exec_simple_query (postgres.c:1240) ==186005==by 0x9485F5: PostgresMain (postgres.c:4394) ==186005==by 0x88D1D4: BackendRun (postmaster.c:4484) ==186005==by 0x88CB53: BackendStartup (postmaster.c:4206) ==186005==by 0x889205: ServerLoop (postmaster.c:1730) ==186005==by 0x888AD0: PostmasterMain (postmaster.c:1402) ==186005==by 0x78D466: main (main.c:209) ==186005== Address 0xe594f80 is 1,872 bytes inside a block of size 8,192 alloc'd ==186005==at 0x483A809: malloc (vg_replace_malloc.c:307) ==186005==by 0xAFEFE6: AllocSetContextCreateInternal (aset.c:468) ==186005==by 0x7298F9: CreateExprContextInternal (execUtils.c:252) ==186005==by
Re: list of extended statistics on psql
Hi Tomas and hackers, On 2021/01/21 7:00, Tomas Vondra wrote: I created patches and my test results on PG10, 11, 12, and 14 are fine. 0001: - Fix query to use pg_statistic_ext only - Replace statuses "required" and "built" with "defined" - Remove the size columns - Fix document - Add schema name as a filter condition on the query 0002: - Fix all results of \dX - Add new testcase by non-superuser Please find attached files. :-D Thanks, I've pushed this. I had to tweak the regression tests a bit, for two reasons: 1) to change user in regression tests, don't use \connect, but SET ROLE and RESET ROLE 2) roles in regression tests should use names with regress_ prefix Thanks for reviewing many times and committing the feature! I understood 1) and 2). I'll keep that in mind for the next developing patch. Then, If possible, could you add Justin to the commit message as a reviewer? Because I revised the document partly based on his comments. Finally, As extended stats were more used, this feature becomes more useful. I hope it helps DBA. :-D Thanks, Tatsuro Yamada
Re: POC: postgres_fdw insert batching
Tomas Vondra writes: > OK, pushed after a little bit of additional polishing (mostly comments). > Thanks everyone! florican reports this is seriously broken on 32-bit hardware: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15 First guess is incorrect memory-allocation computations ... regards, tom lane
Re: POC: postgres_fdw insert batching
Hi, The assignment to resultRelInfo is done when junk_filter_needed is true: if (junk_filter_needed) { resultRelInfo = mtstate->resultRelInfo; Should the code for determining batch size access mtstate->resultRelInfo directly ? diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index 9c36860704..a6a814454d 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -2798,17 +2798,17 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) * size (a FDW may support batching, but it may be disabled for the * server/table). */ -if (!resultRelInfo->ri_usesFdwDirectModify && +if (!mtstate->resultRelInfo->ri_usesFdwDirectModify && operation == CMD_INSERT && -resultRelInfo->ri_FdwRoutine != NULL && -resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && -resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert) -resultRelInfo->ri_BatchSize = - resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); +mtstate->resultRelInfo->ri_FdwRoutine != NULL && +mtstate->resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && +mtstate->resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert) +mtstate->resultRelInfo->ri_BatchSize = + mtstate->resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(mtstate->resultRelInfo); else -resultRelInfo->ri_BatchSize = 1; +mtstate->resultRelInfo->ri_BatchSize = 1; -Assert(resultRelInfo->ri_BatchSize >= 1); +Assert(mtstate->resultRelInfo->ri_BatchSize >= 1); /* * Lastly, if this is not the primary (canSetTag) ModifyTable node, add it Cheers On Wed, Jan 20, 2021 at 3:52 PM Tomas Vondra wrote: > Hmm, seems that florican doesn't like this :-( > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15 > > It's a i386 machine running FreeBSD, so not sure what exactly it's picky > about. But when I tried running this under valgrind, I get some strange > failures in the new chunk in ExecInitModifyTable: > >/* > * Determine if the FDW supports batch insert and determine the batch > * size (a FDW may support batching, but it may be disabled for the > * server/table). > */ >if (!resultRelInfo->ri_usesFdwDirectModify && >operation == CMD_INSERT && >resultRelInfo->ri_FdwRoutine != NULL && >resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && >resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert) >resultRelInfo->ri_BatchSize = > > resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); >else >resultRelInfo->ri_BatchSize = 1; > >Assert(resultRelInfo->ri_BatchSize >= 1); > > It seems as if the resultRelInfo is not initialized, or something like > that. I wouldn't be surprised if the 32-bit machine was pickier and > failing because of that. > > A sample of the valgrind log is attached. It's pretty much just > repetitions of these three reports. > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Heap's backwards scan scans the incorrect pages with heap_setscanlimits()
Hackers, It looks like both heapgettup() and heapgettup_pagemode() are coded incorrectly when setting the page to start the scan on for a backwards scan when heap_setscanlimits() has been used. It looks like the code was not updated during 7516f5259. The current code is: /* start from last page of the scan */ if (scan->rs_startblock > 0) page = scan->rs_startblock - 1; else page = scan->rs_nblocks - 1; Where rs_startblock is either the sync scan start location, or the start page set by heap_setscanlimits(). rs_nblocks is the number of blocks in the relation. Let's say we have a 100 block relation and we want to scan blocks 10 to 30 in a backwards order. We'll do heap_setscanlimits(scan, 10, 21); to indicate that we want to scan 21 blocks starting at page 10 and finishing after scanning page 30. What the code above does is wrong. Since rs_startblock is > 0 we'll execute: page = scan->rs_nblocks - 1; i.e. 99. then proceed to scan blocks all blocks down to 78. Oops. Not quite the 10 to 30 that we asked for. Now, it does not appear that there are any live bugs here, in core at least. The only usage I see of heap_setscanlimits() is in heapam_index_build_range_scan() to which I see the scan is a forward scan. I only noticed the bug as I'm in the middle of fixing up [1] to implement backwards TID Range scans. Proposed patch attached. Since this is not a live bug, is it worth a backpatch? I guess some extensions could suffer from this, I'm just not sure how likely that is as if anyone is doing backwards scanning with a setscanlimits set, then they'd surely have noticed this already!? David [1] https://postgr.es/m/CAMyN-kB-nFTkF=va_jpwfno08s0d-yk0f741s2b7ldmyai8...@mail.gmail.com diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index faffbb1865..ddd214b7af 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -603,11 +603,15 @@ heapgettup(HeapScanDesc scan, * forward scanners. */ scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC; - /* start from last page of the scan */ - if (scan->rs_startblock > 0) - page = scan->rs_startblock - 1; + + /* +* Start from last page of the scan. Ensure we take into account +* rs_numblocks if it's been adjusted by heap_setscanlimits(). +*/ + if (scan->rs_numblocks != InvalidBlockNumber) + page = (scan->rs_startblock + scan->rs_numblocks - 1) % scan->rs_nblocks; else - page = scan->rs_nblocks - 1; + page = (scan->rs_startblock + scan->rs_nblocks - 1) % scan->rs_nblocks; heapgetpage((TableScanDesc) scan, page); } else @@ -918,11 +922,15 @@ heapgettup_pagemode(HeapScanDesc scan, * forward scanners. */ scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC; - /* start from last page of the scan */ - if (scan->rs_startblock > 0) - page = scan->rs_startblock - 1; + + /* +* Start from last page of the scan. Ensure we take into account +* rs_numblocks if it's been adjusted by heap_setscanlimits(). +*/ + if (scan->rs_numblocks != InvalidBlockNumber) + page = (scan->rs_startblock + scan->rs_numblocks - 1) % scan->rs_nblocks; else - page = scan->rs_nblocks - 1; + page = (scan->rs_startblock + scan->rs_nblocks - 1) % scan->rs_nblocks; heapgetpage((TableScanDesc) scan, page); } else
Re: POC: postgres_fdw insert batching
On 1/21/21 12:52 AM, Tomas Vondra wrote: Hmm, seems that florican doesn't like this :-( https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15 It's a i386 machine running FreeBSD, so not sure what exactly it's picky about. But when I tried running this under valgrind, I get some strange failures in the new chunk in ExecInitModifyTable: /* * Determine if the FDW supports batch insert and determine the batch * size (a FDW may support batching, but it may be disabled for the * server/table). */ if (!resultRelInfo->ri_usesFdwDirectModify && operation == CMD_INSERT && resultRelInfo->ri_FdwRoutine != NULL && resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert) resultRelInfo->ri_BatchSize = resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); else resultRelInfo->ri_BatchSize = 1; Assert(resultRelInfo->ri_BatchSize >= 1); It seems as if the resultRelInfo is not initialized, or something like that. I wouldn't be surprised if the 32-bit machine was pickier and failing because of that. A sample of the valgrind log is attached. It's pretty much just repetitions of these three reports. OK, it's definitely accessing uninitialized memory, because the resultRelInfo (on line 2801, i.e. the "if" condition) looks like this: (gdb) p resultRelInfo $1 = (ResultRelInfo *) 0xe595988 (gdb) p *resultRelInfo $2 = {type = 2139062142, ri_RangeTableIndex = 2139062143, ri_RelationDesc = 0x7f7f7f7f7f7f7f7f, ri_NumIndices = 2139062143, ri_IndexRelationDescs = 0x7f7f7f7f7f7f7f7f, ri_IndexRelationInfo = 0x7f7f7f7f7f7f7f7f, ri_TrigDesc = 0x7f7f7f7f7f7f7f7f, ri_TrigFunctions = 0x7f7f7f7f7f7f7f7f, ri_TrigWhenExprs = 0x7f7f7f7f7f7f7f7f, ri_TrigInstrument = 0x7f7f7f7f7f7f7f7f, ri_ReturningSlot = 0x7f7f7f7f7f7f7f7f, ri_TrigOldSlot = 0x7f7f7f7f7f7f7f7f, ri_TrigNewSlot = 0x7f7f7f7f7f7f7f7f, ri_FdwRoutine = 0x7f7f7f7f7f7f7f7f, ri_FdwState = 0x7f7f7f7f7f7f7f7f, ri_usesFdwDirectModify = 127, ri_NumSlots = 2139062143, ri_BatchSize = 2139062143, ri_Slots = 0x7f7f7f7f7f7f7f7f, ri_PlanSlots = 0x7f7f7f7f7f7f7f7f, ri_WithCheckOptions = 0x7f7f7f7f7f7f7f7f, ri_WithCheckOptionExprs = 0x7f7f7f7f7f7f7f7f, ri_ConstraintExprs = 0x7f7f7f7f7f7f7f7f, ri_GeneratedExprs = 0x7f7f7f7f7f7f7f7f, ri_NumGeneratedNeeded = 2139062143, ri_junkFilter = 0x7f7f7f7f7f7f7f7f, ri_returningList = 0x7f7f7f7f7f7f7f7f, ri_projectReturning = 0x7f7f7f7f7f7f7f7f, ri_onConflictArbiterIndexes = 0x7f7f7f7f7f7f7f7f, ri_onConflict = 0x7f7f7f7f7f7f7f7f, ri_PartitionCheckExpr = 0x7f7f7f7f7f7f7f7f, ri_PartitionRoot = 0x7f7f7f7f7f7f7f7f, ri_RootToPartitionMap = 0x8, ri_PartitionTupleSlot = 0x8, ri_ChildToRootMap = 0xe5952b0, ri_CopyMultiInsertBuffer = 0xe596740} (gdb) I may be wrong, but the most likely explanation seems to be this is due to the junk filter initialization, which simply moves past the end of the mtstate->resultRelInfo array. It kinda seems the GetForeignModifyBatchSize call should happen before that block. The attached patch fixes this for me (i.e. regression tests pass with no valgrind reports. Or did I get that wrong? -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index 9c36860704..2ac4999dc8 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -2672,6 +2672,23 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) } } + /* + * Determine if the FDW supports batch insert and determine the batch + * size (a FDW may support batching, but it may be disabled for the + * server/table). + */ + if (!resultRelInfo->ri_usesFdwDirectModify && + operation == CMD_INSERT && + resultRelInfo->ri_FdwRoutine != NULL && + resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && + resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert) + resultRelInfo->ri_BatchSize = + resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); + else + resultRelInfo->ri_BatchSize = 1; + + Assert(resultRelInfo->ri_BatchSize >= 1); + /* select first subplan */ mtstate->mt_whichplan = 0; subplan = (Plan *) linitial(node->plans); @@ -2793,23 +2810,6 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) } } - /* - * Determine if the FDW supports batch insert and determine the batch - * size (a FDW may support batching, but it may be disabled for the - * server/table). - */ - if (!resultRelInfo->ri_usesFdwDirectModify && - operation == CMD_INSERT && - resultRelInfo->ri_FdwRoutine != NULL && - resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && - resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert) - resultRelInfo->ri_BatchSize = - resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo
Re: Printing LSN made easy
At Wed, 20 Jan 2021 16:40:59 +0900, Michael Paquier wrote in > On Wed, Jan 20, 2021 at 07:25:37AM +0100, Peter Eisentraut wrote: > > It looks like we are not getting any consensus on this approach. One > > reduced version I would consider is just the second part, so you'd write > > something like > > > > snprintf(lsnchar, sizeof(lsnchar), "%X/%X", > > LSN_FORMAT_ARGS(lsn)); > > > > This would still reduce notational complexity quite a bit but avoid any > > funny business with the format strings. > > That seems reasonable to me. So +1. That seems in the good balance. +1, too. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: POC: postgres_fdw insert batching
On 1/21/21 12:59 AM, Tom Lane wrote: Tomas Vondra writes: OK, pushed after a little bit of additional polishing (mostly comments). Thanks everyone! florican reports this is seriously broken on 32-bit hardware: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15 First guess is incorrect memory-allocation computations ... I know, although it seems more like an access to unitialized memory. I've already posted a patch that resolves that for me on 64-bits (per valgrind, I suppose it's the same issue). I'm working on reproducing it on 32-bits, hopefully it won't take long. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: POC: postgres_fdw insert batching
On 1/21/21 1:17 AM, Zhihong Yu wrote: Hi, The assignment to resultRelInfo is done when junk_filter_needed is true: if (junk_filter_needed) { resultRelInfo = mtstate->resultRelInfo; Should the code for determining batch size access mtstate->resultRelInfo directly ? IMO the issue is that code iterates over all plans and moves to the next for each one: resultRelInfo++; so it ends up pointing past the last element, hence the failures. So yeah, either the code needs to move before the loop (per my patch), or we need to access mtstate->resultRelInfo directly. I'm pretty amazed this did not crash during any of the many regression runs I did recently. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: POC: postgres_fdw insert batching
Hi, Tomas: In my opinion, my patch is a little better. Suppose one of the conditions in the if block changes in between the start of loop and the end of the loop: * Determine if the FDW supports batch insert and determine the batch * size (a FDW may support batching, but it may be disabled for the * server/table). My patch would reflect that change. I guess this was the reason the if / else block was placed there in the first place. Cheers On Wed, Jan 20, 2021 at 4:56 PM Tomas Vondra wrote: > > > On 1/21/21 1:17 AM, Zhihong Yu wrote: > > Hi, > > The assignment to resultRelInfo is done when junk_filter_needed is true: > > > > if (junk_filter_needed) > > { > > resultRelInfo = mtstate->resultRelInfo; > > > > Should the code for determining batch size access mtstate->resultRelInfo > > directly ? > > > > IMO the issue is that code iterates over all plans and moves to the next > for each one: > > resultRelInfo++; > > so it ends up pointing past the last element, hence the failures. So > yeah, either the code needs to move before the loop (per my patch), or > we need to access mtstate->resultRelInfo directly. > > I'm pretty amazed this did not crash during any of the many regression > runs I did recently. > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: POC: postgres_fdw insert batching
On 1/21/21 2:02 AM, Zhihong Yu wrote: Hi, Tomas: In my opinion, my patch is a little better. Suppose one of the conditions in the if block changes in between the start of loop and the end of the loop: * Determine if the FDW supports batch insert and determine the batch * size (a FDW may support batching, but it may be disabled for the * server/table). My patch would reflect that change. I guess this was the reason the if / else block was placed there in the first place. But can it change? All the loop does is extracting junk attributes from the plans, it does not modify anything related to the batching. Or maybe I just don't understand what you mean. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: POC: postgres_fdw insert batching
Hi, Do we need to consider how this part of code inside ExecInitModifyTable() would evolve ? I think placing the compound condition toward the end of ExecInitModifyTable() is reasonable because it checks the latest information. Regards On Wed, Jan 20, 2021 at 5:11 PM Tomas Vondra wrote: > > > On 1/21/21 2:02 AM, Zhihong Yu wrote: > > Hi, Tomas: > > In my opinion, my patch is a little better. > > Suppose one of the conditions in the if block changes in between the > > start of loop and the end of the loop: > > > > * Determine if the FDW supports batch insert and determine the > batch > > * size (a FDW may support batching, but it may be disabled for the > > * server/table). > > > > My patch would reflect that change. I guess this was the reason the if / > > else block was placed there in the first place. > > > > But can it change? All the loop does is extracting junk attributes from > the plans, it does not modify anything related to the batching. Or maybe > I just don't understand what you mean. > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: POC: postgres_fdw insert batching
Tomas Vondra writes: > I may be wrong, but the most likely explanation seems to be this is due > to the junk filter initialization, which simply moves past the end of > the mtstate->resultRelInfo array. resultRelInfo is certainly pointing at garbage at that point. > It kinda seems the GetForeignModifyBatchSize call should happen before > that block. The attached patch fixes this for me (i.e. regression tests > pass with no valgrind reports. > Or did I get that wrong? Don't we need to initialize ri_BatchSize for *each* resultrelinfo, not merely the first one? That is, this new code needs to be somewhere inside a loop over the result rels. regards, tom lane
Re: POC: postgres_fdw insert batching
On Thu, Jan 21, 2021 at 9:56 AM Tomas Vondra wrote: > On 1/21/21 1:17 AM, Zhihong Yu wrote: > > Hi, > > The assignment to resultRelInfo is done when junk_filter_needed is true: > > > > if (junk_filter_needed) > > { > > resultRelInfo = mtstate->resultRelInfo; > > > > Should the code for determining batch size access mtstate->resultRelInfo > > directly ? > > > > IMO the issue is that code iterates over all plans and moves to the next > for each one: > > resultRelInfo++; > > so it ends up pointing past the last element, hence the failures. So > yeah, either the code needs to move before the loop (per my patch), or > we need to access mtstate->resultRelInfo directly. Accessing mtstate->resultRelInfo directly would do. The only constraint on where this block should be placed is that ri_projectReturning must be valid as of calling GetForeignModifyBatchSize(), as Tsunakawa-san pointed out upthread. So, after this block in ExecInitModifyTable: /* * Initialize RETURNING projections if needed. */ if (node->returningLists) { /* * Build a projection for each result rel. */ resultRelInfo = mtstate->resultRelInfo; foreach(l, node->returningLists) { List *rlist = (List *) lfirst(l); resultRelInfo->ri_returningList = rlist; resultRelInfo->ri_projectReturning = ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps, resultRelInfo->ri_RelationDesc->rd_att); resultRelInfo++; } } -- Amit Langote EDB: http://www.enterprisedb.com
Re: strange error reporting
Robert Haas writes: >>> Maybe it would be better if it said: >>> connection to database at socket "/tmp/.s.PGSQL.5432" failed: FATAL: >>> database "rhaas" does not exist >> I'd be inclined to spell it "connection to server at ... failed", >> but that sort of wording is surely also possible. > "connection to server" rather than "connection to database" works for > me; in fact, I think I like it slightly better. If I don't hear any other opinions, I'll change these messages to "connection to server at socket \"%s\" failed: " "connection to server at \"%s\" (%s), port %s failed: " (or maybe "server on socket"? "at" sounds right for the IP address case, but it feels a little off in the socket pathname case.) regards, tom lane
Re: Printing backtrace of postgres processes
On Wed, 20 Jan 2021 at 01:31, Robert Haas wrote: > On Sat, Jan 16, 2021 at 3:21 PM Tom Lane wrote: > > I'd argue that backtraces for those processes aren't really essential, > > and indeed that trying to make the syslogger report its own backtrace > > is damn dangerous. > > I agree. Ideally I'd like to be able to use the same mechanism > everywhere and include those processes too, but surely regular > backends and parallel workers are going to be the things that come up > most often. > > > (Personally, I think this whole patch fails the safety-vs-usefulness > > tradeoff, but I expect I'll get shouted down.) > > You and I are frequently on opposite sides of these kinds of > questions, but I think this is a closer call than many cases. I'm > convinced that it's useful, but I'm not sure whether it's safe. On the > usefulness side, backtraces are often the only way to troubleshoot > problems that occur on production systems. I wish we had better > logging and tracing tools instead of having to ask for this sort of > thing, but we don't. Agreed. In theory we should be able to do this sort of thing using external trace and diagnostic tools like perf, systemtap, etc. In practice, these tools tend to be quite version-sensitive and hard to get right without multiple rounds of back-and-forth to deal with specifics of the site's setup, installed debuginfo or lack thereof, specific tool versions, etc. It's quite common to have to fall back on attaching gdb with a breakpoint on a function in the export symbol table (so it works w/o debuginfo), saving a core, and then analysing the core on a separate system on which debuginfo is available for all the loaded modules. It's a major pain. The ability to get a basic bt from within Pg is strongly desirable. IIRC gdb's basic unwinder works without external debuginfo, if not especially well. libunwind produces much better results, but that didn't pass the extra-dependency bar when backtracing support was introduced to core postgres. On a side note, to help get better diagnostics I've also been meaning to look into building --enable-debug with -ggdb3 so we can embed macro info, and using dwz to deduplicate+compress the debuginfo so we can encourage people to install it by default on production. I also want to start exporting pointers to all the important data symbols for diagnostic use, even if we do so in a separate ELF section just for debug use.
RE: POC: postgres_fdw insert batching
From: Zhihong Yu > Do we need to consider how this part of code inside ExecInitModifyTable() > would evolve ? > I think placing the compound condition toward the end of > ExecInitModifyTable() is reasonable because it checks the latest information. +1 for Zaihong-san's idea. But instead of rewriting every relsultRelInfo to mtstate->resultRelInfo, which makes it a bit harder to read, I'd like to suggest just adding "resultRelInfo = mtstate->resultRelInfo;" immediately before the if block. Thanks a lot, all for helping to solve the problem quickly! Regards Takayuki Tsunakawa
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On Wed, Jan 20, 2021 at 03:34:39PM -0300, Alvaro Herrera wrote: > On 2021-Jan-20, Alexey Kondratov wrote: >> Ugh, forgot to attach the patches. Here they are. > > Yeah, looks reasonable. Patch 0002 still has a whole set of issues as I pointed out a couple of hours ago, but if we agree on 0001 as being useful if done independently, I'd rather get that done first. This way or just merging both things in a single commit is not a big deal seeing the amount of code, so I am fine with any approach. It may be possible that 0001 requires more changes depending on the work to-be-done for 0002 though? >> +/* No work if no change in tablespace. */ >> +oldTablespaceOid = rd_rel->reltablespace; >> +if (tablespaceOid != oldTablespaceOid || >> +(tablespaceOid == MyDatabaseTableSpace && >> OidIsValid(oldTablespaceOid))) >> +{ >> +/* Update the pg_class row. */ >> +rd_rel->reltablespace = (tablespaceOid == MyDatabaseTableSpace) >> ? >> +InvalidOid : tablespaceOid; >> +CatalogTupleUpdate(pg_class, &tuple->t_self, tuple); >> + >> +changed = true; >> +} >> + >> +if (changed) >> +/* Record dependency on tablespace */ >> +changeDependencyOnTablespace(RelationRelationId, >> + >> reloid, rd_rel->reltablespace); > > Why have a separate "if (changed)" block here instead of merging with > the above? Yep. + if (SetRelTablespace(reloid, newTableSpace)) + /* Make sure the reltablespace change is visible */ + CommandCounterIncrement(); At quick glance, I am wondering why you just don't do a CCI within SetRelTablespace(). -- Michael signature.asc Description: PGP signature
Re: POC: postgres_fdw insert batching
On 1/21/21 2:24 AM, Amit Langote wrote: On Thu, Jan 21, 2021 at 9:56 AM Tomas Vondra wrote: On 1/21/21 1:17 AM, Zhihong Yu wrote: Hi, The assignment to resultRelInfo is done when junk_filter_needed is true: if (junk_filter_needed) { resultRelInfo = mtstate->resultRelInfo; Should the code for determining batch size access mtstate->resultRelInfo directly ? IMO the issue is that code iterates over all plans and moves to the next for each one: resultRelInfo++; so it ends up pointing past the last element, hence the failures. So yeah, either the code needs to move before the loop (per my patch), or we need to access mtstate->resultRelInfo directly. Accessing mtstate->resultRelInfo directly would do. The only constraint on where this block should be placed is that ri_projectReturning must be valid as of calling GetForeignModifyBatchSize(), as Tsunakawa-san pointed out upthread. So, after this block in ExecInitModifyTable: /* * Initialize RETURNING projections if needed. */ if (node->returningLists) { /* * Build a projection for each result rel. */ resultRelInfo = mtstate->resultRelInfo; foreach(l, node->returningLists) { List *rlist = (List *) lfirst(l); resultRelInfo->ri_returningList = rlist; resultRelInfo->ri_projectReturning = ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps, resultRelInfo->ri_RelationDesc->rd_att); resultRelInfo++; } } Right. But I think Tom is right this should initialize ri_BatchSize for all the resultRelInfo elements, not just the first one. Per the attached patch, which resolves the issue both on x86_64 and armv7l for me. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index 9c36860704..10febcae8a 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -2798,17 +2798,23 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) * size (a FDW may support batching, but it may be disabled for the * server/table). */ - if (!resultRelInfo->ri_usesFdwDirectModify && - operation == CMD_INSERT && - resultRelInfo->ri_FdwRoutine != NULL && - resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && - resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert) - resultRelInfo->ri_BatchSize = - resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); - else - resultRelInfo->ri_BatchSize = 1; + resultRelInfo = mtstate->resultRelInfo; + for (i = 0; i < nplans; i++) + { + if (!resultRelInfo->ri_usesFdwDirectModify && + operation == CMD_INSERT && + resultRelInfo->ri_FdwRoutine != NULL && + resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && + resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert) + resultRelInfo->ri_BatchSize = +resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); + else + resultRelInfo->ri_BatchSize = 1; + + Assert(resultRelInfo->ri_BatchSize >= 1); - Assert(resultRelInfo->ri_BatchSize >= 1); + resultRelInfo++; + } /* * Lastly, if this is not the primary (canSetTag) ModifyTable node, add it
Re: Printing backtrace of postgres processes
On Wed, 20 Jan 2021 at 05:23, Tom Lane wrote: > > Recursion is scary, but it should (I think) not be possible if this > is driven off CHECK_FOR_INTERRUPTS. There will certainly be none > of those in libbacktrace. > We can also hold interrupts for the call, and it might be wise to do so. One point here is that it might be a good idea to suppress elog.c's > calls to functions in the error context stack. As we saw in another > connection recently, allowing that to happen makes for a *very* > large increase in the footprint of code that you are expecting to > work at any random CHECK_FOR_INTERRUPTS call site. > I strongly agree. Treat it as errhidecontext(). BTW, it also looks like the patch is doing nothing to prevent the > backtrace from being sent to the connected client. I'm not sure > what I think about whether it'd be okay from a security standpoint > to do that on the connection that requested the trace, but I sure > as heck don't want it to happen on connections that didn't. I don't see a good reason to send a bt to a client. Even though these backtraces won't be analysing debuginfo and populating args, locals, etc, it should still just go to the server log. > Maybe, given all of these things, we should forget using elog at > all and just emit the trace with fprintf(stderr). That causes quite a lot of pain with MemoryContextStats() already as it's frequently difficult to actually locate the output given the variations that exist in customer logging configurations. Sometimes stderr goes to a separate file or to journald. It's also much harder to locate the desired output since there's no log_line_prefix. I have a WIP patch floating around somewhere that tries to teach MemoryContextStats to write to the ereport channel when not called during an actual out-of-memory situation for that reason; an early version is somewhere in the archives. This is one of those "ok in development, painful in production" situations. So I'm not a big fan of pushing it to stderr, though I'd rather have that than not have the ability at all.
Re: POC: postgres_fdw insert batching
On 1/21/21 2:22 AM, Tom Lane wrote: Tomas Vondra writes: I may be wrong, but the most likely explanation seems to be this is due to the junk filter initialization, which simply moves past the end of the mtstate->resultRelInfo array. resultRelInfo is certainly pointing at garbage at that point. Yup. It's pretty amazing the x86 machines seem to be mostly OK with it. It kinda seems the GetForeignModifyBatchSize call should happen before that block. The attached patch fixes this for me (i.e. regression tests pass with no valgrind reports. Or did I get that wrong? Don't we need to initialize ri_BatchSize for *each* resultrelinfo, not merely the first one? That is, this new code needs to be somewhere inside a loop over the result rels. Yeah, I think you're right. That's an embarrassing oversight :-( regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: Parallel INSERT (INTO ... SELECT ...)
> > Thanks for the feedback. > Posting an updated set of patches. Changes are based on feedback, as detailed > below: Hi It seems there are some previous comments[1][2] not addressed in current patch. Just to make sure it's not missed. [1] https://www.postgresql.org/message-id/77e1c06ffb2240838e5fc94ec8dcb7d3%40G08CNEXMBPEKD05.g08.fujitsu.local [2] https://www.postgresql.org/message-id/CAA4eK1LMmz58ej5BgVLJ8VsUGd%3D%2BKcaA8X%3DkStORhxpfpODOxg%40mail.gmail.com Best regards, houzj
Re: POC: postgres_fdw insert batching
Hi, Takayuki-san: My first name is Zhihong. You can call me Ted if you want to save some typing :-) Cheers On Wed, Jan 20, 2021 at 5:37 PM tsunakawa.ta...@fujitsu.com < tsunakawa.ta...@fujitsu.com> wrote: > From: Zhihong Yu > > > Do we need to consider how this part of code inside > ExecInitModifyTable() would evolve ? > > > > > I think placing the compound condition toward the end of > ExecInitModifyTable() is reasonable because it checks the latest > information. > > > > +1 for Zaihong-san's idea. But instead of rewriting every relsultRelInfo > to mtstate->resultRelInfo, which makes it a bit harder to read, I'd like to > suggest just adding "resultRelInfo = mtstate->resultRelInfo;" immediately > before the if block. > > > > Thanks a lot, all for helping to solve the problem quickly! > > > > > > Regards > > Takayuki Tsunakawa > > > >
RE: POC: postgres_fdw insert batching
From: Tomas Vondra > Right. But I think Tom is right this should initialize ri_BatchSize for all > the > resultRelInfo elements, not just the first one. Per the attached patch, which > resolves the issue both on x86_64 and armv7l for me. I think Your patch is perfect in the sense that it's ready for the future multi-target DML support. +1 Just for learning, could anyone tell me what this loop for? I thought current Postgres's DML supports a single target table, so it's enough to handle the first element of mtstate->resultRelInfo. In that sense, Amit-san and I agreed that we don't put the if block in the for loop yet. Regards Takayuki Tsunakawa
Re: POC: postgres_fdw insert batching
On Thu, Jan 21, 2021 at 10:42 AM Tomas Vondra wrote: > On 1/21/21 2:24 AM, Amit Langote wrote: > > On Thu, Jan 21, 2021 at 9:56 AM Tomas Vondra > > wrote: > >> On 1/21/21 1:17 AM, Zhihong Yu wrote: > >>> Hi, > >>> The assignment to resultRelInfo is done when junk_filter_needed is true: > >>> > >>> if (junk_filter_needed) > >>> { > >>> resultRelInfo = mtstate->resultRelInfo; > >>> > >>> Should the code for determining batch size access mtstate->resultRelInfo > >>> directly ? > >>> > >> > >> IMO the issue is that code iterates over all plans and moves to the next > >> for each one: > >> > >> resultRelInfo++; > >> > >> so it ends up pointing past the last element, hence the failures. So > >> yeah, either the code needs to move before the loop (per my patch), or > >> we need to access mtstate->resultRelInfo directly. > > > > Accessing mtstate->resultRelInfo directly would do. The only > > constraint on where this block should be placed is that > > ri_projectReturning must be valid as of calling > > GetForeignModifyBatchSize(), as Tsunakawa-san pointed out upthread. > > So, after this block in ExecInitModifyTable: > > > > /* > > * Initialize RETURNING projections if needed. > > */ > > if (node->returningLists) > > { > > > > /* > > * Build a projection for each result rel. > > */ > > resultRelInfo = mtstate->resultRelInfo; > > foreach(l, node->returningLists) > > { > > List *rlist = (List *) lfirst(l); > > > > resultRelInfo->ri_returningList = rlist; > > resultRelInfo->ri_projectReturning = > > ExecBuildProjectionInfo(rlist, econtext, slot, > > &mtstate->ps, > > > > resultRelInfo->ri_RelationDesc->rd_att); > > resultRelInfo++; > > } > > } > > > > Right. But I think Tom is right this should initialize ri_BatchSize for > all the resultRelInfo elements, not just the first one. Per the attached > patch, which resolves the issue both on x86_64 and armv7l for me. +1 in general. To avoid looping uselessly in the case of UPDATE/DELETE where batching can't be used today, I'd suggest putting if (operation == CMD_INSERT) around the loop. -- Amit Langote EDB: http://www.enterprisedb.com
Re: Printing backtrace of postgres processes
Craig Ringer writes: > On Wed, 20 Jan 2021 at 05:23, Tom Lane wrote: >> BTW, it also looks like the patch is doing nothing to prevent the >> backtrace from being sent to the connected client. > I don't see a good reason to send a bt to a client. Even though these > backtraces won't be analysing debuginfo and populating args, locals, etc, > it should still just go to the server log. Yeah. That's easier than I was thinking, we just need to s/LOG/LOG_SERVER_ONLY/. >> Maybe, given all of these things, we should forget using elog at >> all and just emit the trace with fprintf(stderr). > That causes quite a lot of pain with MemoryContextStats() already True. Given the changes discussed in the last couple messages, I don't see any really killer reasons why we can't ship the trace through elog. We can always try that first, and back off to fprintf if we do find reasons why it's too unstable. regards, tom lane