Incremental sorts and EXEC_FLAG_REWIND
Hi, When initializing an incremental sort node, we have the following as of ExecInitIncrementalSort(): /* * Incremental sort can't be used with either EXEC_FLAG_REWIND, * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort * batches in the current sort state. */ Assert((eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)) == 0); While I don't quite follow why EXEC_FLAG_REWIND should be allowed here to begin with (because incremental sorts don't support rescans without parameter changes, right?), the comment and the assertion are telling a different story. And I can see that child nodes of an IncrementalSort one use a set of eflags where these three are removed: /* * Initialize child nodes. * * We shield the child node from the need to support REWIND, BACKWARD, or * MARK/RESTORE. */ eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK); I can also spot one case in the regression tests where we actually pass down a REWIND flag (see incremental_sort.sql) when initializing an IncrementalSort node: -- We force the planner to choose a plan with incremental sort on the right side -- of a nested loop join node. That way we trigger the rescan code path. set local enable_hashjoin = off; set local enable_mergejoin = off; set local enable_material = off; set local enable_sort = off; explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2); select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2); Alexander, Tomas, any thoughts? -- Michael signature.asc Description: PGP signature
index paths and enable_indexscan
Hi, Maybe I am missing something obvious, but is it intentional that enable_indexscan is checked by cost_index(), that is, *after* creating an index path? I was expecting that if enable_indexscan is off, then no index paths would be generated to begin with, because I thought they are optional. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Apr 8, 2020 at 6:29 AM Tomas Vondra wrote: > > On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote: > >On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra > > wrote: > > > >I think having something like we discussed or what you have in the > >patch won't be sufficient to clean the KnownAssignedXid array. The > >point is that we won't write a WAL for xid-subxid association for > >unlogged relations in the "Immediately WAL-log assignments" patch, > >however, the KnownAssignedXid would have both kinds of Xids as we > >autofill it with gaps (see RecordKnownAssignedTransactionIds). I > >think if my understanding is correct to make it work we might need > >major surgery in the code or have to maintain KnownAssignedXid array > >differently. > > Hmm, that's a good point. If I understand correctly, the issue is > that if we create new subxact, write something into an unlogged table, > and then create new subxact, the XID of the first subxact will be "known > assigned" but we won't know it's a subxact or to which parent xact it > belongs (because there will be no WAL records that could encode it). > Yeah, there could be multiple such missing subxacts. > I wonder if there's a simple solution (e.g. when creating the second > subxact we might notice the xid-subxid assignment was not logged, and > write some "dummy" WAL record). > That WAL record can have multiple xids. > But I admit it seems a bit ugly. > Yeah, I guess it could be tricky as well because while assembling some WAL record, we need to generate an additional dummy record or might need to add additional information to the current record being formed. I think the handling of such WAL records during hot-standby and in logical decoding could vary. During logical decoding, currently, we don't form an association for subtransactions if it doesn't have any changes (see ReorderBufferCommitChild) and now with this new type of record, I think we need to ensure that we don't form such association. I think after quite some changes, tweaks and a lot of testing, we might be able to remove XLOG_XACT_ASSIGNMENT but I am not sure if it is worth doing along with this patch. I think it would have been good to do this if we are adding any visible overhead with this patch and or it is easy to do that. However, none of that seems to be true, so it might be better to write good comments in the code indicating what all we need to do to remove XLOG_XACT_ASSIGNMENT so that if we feel it is important to do in future we can do so. I am not against spending effort on this but I don't see the urgency of doing it along with this patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: pgbench - test whether a variable exists
Bonjour Michaël, Hmm. It seems to me that this stuff needs to be more careful with the function handling? For example, all those cases fail but they directly pass down a variable that may not be defined, so shouldn't those results be undefined as well instead of failing: \set g double(:{?no_such_variable}) \set g exp(:{?no_such_variable}) \set g greatest(:{?no_such_variable}, :{?no_such_variable}) \set g int(:{?no_such_variable}) I do not understand: All the above examples are type errors, as ":{?var}" is a boolean, that cannot be cast to double, be exponentiated etc. So failing just seems appropriate. Maybe casting boolean to int should be allowed, though, as pg does it. It seems to me that there could be a point in having the result of any function to become undefined if using at least one undefined argument (the point could be made as well that things like greatest just ignore conditioned variables), so I was surprised to not see the logic more linked with ENODE_VARIABLE. Hmmm… :var (ENODE_VARIABLE) replaces de variable by its value, and it fails if the variable is not defined, which is the intention, hence the point of having the ability to test whether a variable exists, and its new expression node type. It could replace it by NULL when not existing, but ISTM that a script can wish to distinguish NULL and undefined, and it departs from SQL which just fails on a undefined alias or column or whatever. If someone wants to go back to having a self propagating NULL, they can simply \if :{?var} \set var NULL \endif Or \set var CASE WHEN :{?var} THEN :var ELSE NULL END to get it. Having some special UNDEF value looks unattractive, as it would depart for SQL even further. If your intention is to keep this behavior, it should at least be tested I guess. My intention is to keep failing on type errors, but maybe I'm missing something of your point. Please note that this patch will have to wait until v14 opens for business for more. Sure. I put it in the July CF, and I do not expect to it to be processed on the first CF it appears in. -- Fabien.
Re: doc review for v13
On Sun, Apr 12, 2020 at 04:35:45PM -0500, Justin Pryzby wrote: > Added a few more. > And rebased on top of dbc60c5593f26dc777a3be032bff4fb4eab1ddd1 Thanks for the patch set, I have applied the most obvious parts (more or less 1/3) to reduce the load. Here is a review of the rest. > @@ -2829,7 +2829,6 @@ > show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo, > > ExplainPropertyList("Sort Methods Used", methodNames, es); > > - if (groupInfo->maxMemorySpaceUsed > 0) > { > longavgSpace = > groupInfo->totalMemorySpaceUsed / groupInfo->groupCount; > const char *spaceTypeName; > @@ -2846,7 +2845,7 @@ > show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo, > > ExplainCloseGroup("Sort Spaces", memoryName.data, true, > es); > } > - if (groupInfo->maxDiskSpaceUsed > 0) > + > { > longavgSpace = > groupInfo->totalDiskSpaceUsed / groupInfo->groupCount; > const char *spaceTypeName; If this can be reworked, it seems to me that more cleanup could be done. > @@ -987,7 +987,7 @@ ExecInitIncrementalSort(IncrementalSort *node, EState > *estate, int eflags) > > /* >* Incremental sort can't be used with either EXEC_FLAG_REWIND, > - * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many > sort > + * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of > many sort >* batches in the current sort state. >*/ > Assert((eflags & (EXEC_FLAG_BACKWARD | The following is inconsistent with this comment block, and I guess that "" should be "have": Assert((eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)) == 0); This is only a doc-related change though, so I'll start a different thread about that after looking more at it. > @@ -1153,7 +1153,7 @@ ExecReScanIncrementalSort(IncrementalSortState *node) > /* >* If we've set up either of the sort states yet, we need to reset them. >* We could end them and null out the pointers, but there's no reason to > - * repay the setup cost, and because guard setting up pivot comparator > + * repay the setup cost, and because guard setting up pivot > comparator >* state similarly, doing so might actually cause a leak. >*/ > if (node->fullsort_state != NULL) I don't quite understand this comment either, but it seems to me that the last part should be a fully-separate sentence, aka "This guards against..". > @@ -631,7 +631,7 @@ logicalrep_partition_open(LogicalRepRelMapEntry *root, > /* >* If the partition's attributes don't match the root relation's, we'll >* need to make a new attrmap which maps partition attribute numbers to > - * remoterel's, instead the original which maps root relation's > attribute > + * remoterel's, instead of the original which maps root relation's > attribute >* numbers to remoterel's. >* >* Note that 'map' which comes from the tuple routing data structure Okay, this is not really clear to start with. I think that I would rewrite that completely as follows: "If the partition's attributes do not match the root relation's attributes, we cannot use the original attribute map which maps the root relation's attributes with remoterel's attributes. Instead, build a new attribute map which maps the partition's attributes with remoterel's attributes." > +++ b/src/backend/storage/lmgr/proc.c > @@ -1373,7 +1373,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod > lockMethodTable) > else > LWLockRelease(ProcArrayLock); > > - /* prevent signal from being resent more than once */ > + /* prevent signal from being re-sent more than once */ > allow_autovacuum_cancel = false; > } Shouldn't that just be "sent more than two times"? > @@ -1428,11 +1428,11 @@ tuplesort_updatemax(Tuplesortstate *state) > } > > /* > - * Sort evicts data to the disk when it didn't manage to fit those data > to > - * the main memory. This is why we assume space used on the disk to be > + * Sort evicts data to the disk when it didn't manage to fit the data in > + * main memory. This is why we assume space used on the disk to be Both the original and the suggestion are wrong? It seems to me that it should be "this data" instead because it refers to the data evicted in the first part of the sentence. >* more important for tracking resource usage than space used in memory. > - * Note that amount of space occupied by some tuple set on the disk > might > - * be less than amount of space occupied by the same tuple set in the > + * Note that amount of spa
Re: Properly mark NULL returns in numeric aggregates
On Tue, 14 Apr 2020 at 06:14, Tom Lane wrote: > Yeah, they're relying exactly on the assumption that nodeAgg is not > going to try to copy a value declared "internal", and therefore they > can be loosey-goosey about whether the value pointer is null or not. > However, if you want to claim that that's wrong, you have to explain > why it's okay for some other code to be accessing a value that's > declared "internal". I'd say that the meaning of that is precisely > "keepa u hands off". > > In the case at hand, the current situation is that we only expect the > values returned by these combine functions to be read by the associated > final functions, which are on board with the null-pointer representation > of an empty result. Your argument is essentially that it should be > possible to feed the values to the aggregate's associated serialization > function as well. But the core code never does that, so I'm not convinced > that we should add it to the requirements; we'd be unable to test it. Casting my mind back to when I originally wrote that code, I attempted to do so in such a way so that it could one day be used for a 3-stage aggregation. e.g Parallel Partial Aggregate -> Gather -> Combine Serial Aggregate on one node, then on some master node a Deserial Combine Finalize Aggregate. You're very right that we can't craft such a plan with today's master (We didn't even add a supporting enum for it in AggSplit). However, it does appear that there are extensions or forks out there which attempt to use the code in this way, so it would be good to not leave those people out in the cold regarding this. For testing, can't we just have an Assert() in advance_transition_function that verifies isnull matches the nullability of the return value for INTERNAL returning transfns? i.e, the attached I don't have a test case to hand that could cause this to fail, but it sounds like Jesse might. David assert_internal_transfns_properly_set_isnull.patch Description: Binary data
Re: Poll: are people okay with function/operator table redesign?
Hello Tom, Before I spend more time on this, I want to make sure that people are happy with this line of attack. +1 I like it this way, because the structure is quite readable, which is the point. My 0.02€: Maybe column heander "Example Result" should be simply "Result", because it is already on the same line as "Example" on its left, and "Example | Example Result" looks redundant. Maybe the signature and description lines could be exchanged: I'm more interested and the description first, and the signature just above the example would make sense. I'm wondering whether the function/operator name should be vertically centered in its cell? I'd left it left justified. -- Fabien.
Re: variation of row_number with parallel
On Tue, Apr 14, 2020 at 9:39 AM Pavel Stehule wrote: > > > út 14. 4. 2020 v 5:59 odesílatel Rajkumar Raghuwanshi < > rajkumar.raghuwan...@enterprisedb.com> napsal: > >> Hi, >> >> I have observed row_number() is giving different results when query >> executed in parallel. is this expected w.r.t parallel execution. >> >> CREATE TABLE tbl1 (c1 INT) partition by list (c1); >> CREATE TABLE tbl1_p1 partition of tbl1 FOR VALUES IN (10); >> CREATE TABLE tbl1_p2 partition of tbl1 FOR VALUES IN (20); >> CREATE TABLE tbl1_p3 partition of tbl1 FOR VALUES IN (30); >> >> CREATE TABLE tbl2 (c1 INT, c2 INT,c3 INT) partition by list (c1); >> CREATE TABLE tbl2_p1 partition of tbl2 FOR VALUES IN (1); >> CREATE TABLE tbl2_p2 partition of tbl2 FOR VALUES IN (2); >> CREATE TABLE tbl2_p3 partition of tbl2 FOR VALUES IN (3); >> CREATE TABLE tbl2_p4 partition of tbl2 FOR VALUES IN (4); >> CREATE TABLE tbl2_p5 partition of tbl2 FOR VALUES IN (5); >> >> INSERT INTO tbl1 VALUES (10),(20),(30); >> >> INSERT INTO tbl2 VALUES >> (1,100,20),(2,200,10),(3,100,20),(4,100,30),(5,100,10); >> >> postgres=# explain select e.c2, row_number() over () from tbl1 d, tbl2 e >> where d.c1=e.c3; >> QUERY PLAN >> >> >> --- >> WindowAgg (cost=1520.35..12287.73 rows=390150 width=12) >>-> Merge Join (cost=1520.35..7410.85 rows=390150 width=4) >> Merge Cond: (d.c1 = e.c3) >> -> Sort (cost=638.22..657.35 rows=7650 width=4) >>Sort Key: d.c1 >>-> Append (cost=0.00..144.75 rows=7650 width=4) >> -> Seq Scan on tbl1_p1 d_1 (cost=0.00..35.50 >> rows=2550 width=4) >> -> Seq Scan on tbl1_p2 d_2 (cost=0.00..35.50 >> rows=2550 width=4) >> -> Seq Scan on tbl1_p3 d_3 (cost=0.00..35.50 >> rows=2550 width=4) >> -> Sort (cost=882.13..907.63 rows=10200 width=8) >>Sort Key: e.c3 >>-> Append (cost=0.00..203.00 rows=10200 width=8) >> -> Seq Scan on tbl2_p1 e_1 (cost=0.00..30.40 >> rows=2040 width=8) >> -> Seq Scan on tbl2_p2 e_2 (cost=0.00..30.40 >> rows=2040 width=8) >> -> Seq Scan on tbl2_p3 e_3 (cost=0.00..30.40 >> rows=2040 width=8) >> -> Seq Scan on tbl2_p4 e_4 (cost=0.00..30.40 >> rows=2040 width=8) >> -> Seq Scan on tbl2_p5 e_5 (cost=0.00..30.40 >> rows=2040 width=8) >> (17 rows) >> >> postgres=# select e.c2, row_number() over () from tbl1 d, tbl2 e where >> d.c1=e.c3; >> c2 | row_number >> -+ >> *200 | 1* >> 100 | 2 >> 100 | 3 >> 100 | 4 >> 100 | 5 >> (5 rows) >> >> postgres=# >> postgres=# set parallel_setup_cost = 0; >> SET >> postgres=# set parallel_tuple_cost = 0; >> SET >> postgres=# >> postgres=# explain select e.c2, row_number() over () from tbl1 d, tbl2 e >> where d.c1=e.c3; >> QUERY PLAN >> >> >> -- >> WindowAgg (cost=130.75..7521.21 rows=390150 width=12) >>-> Gather (cost=130.75..2644.34 rows=390150 width=4) >> Workers Planned: 2 >> -> Parallel Hash Join (cost=130.75..2644.34 rows=162562 >> width=4) >>Hash Cond: (e.c3 = d.c1) >>-> Parallel Append (cost=0.00..131.25 rows=4250 width=8) >> -> Parallel Seq Scan on tbl2_p1 e_1 >> (cost=0.00..22.00 rows=1200 width=8) >> -> Parallel Seq Scan on tbl2_p2 e_2 >> (cost=0.00..22.00 rows=1200 width=8) >> -> Parallel Seq Scan on tbl2_p3 e_3 >> (cost=0.00..22.00 rows=1200 width=8) >> -> Parallel Seq Scan on tbl2_p4 e_4 >> (cost=0.00..22.00 rows=1200 width=8) >> -> Parallel Seq Scan on tbl2_p5 e_5 >> (cost=0.00..22.00 rows=1200 width=8) >>-> Parallel Hash (cost=90.93..90.93 rows=3186 width=4) >> -> Parallel Append (cost=0.00..90.93 rows=3186 >> width=4) >>-> Parallel Seq Scan on tbl1_p1 d_1 >> (cost=0.00..25.00 rows=1500 width=4) >>-> Parallel Seq Scan on tbl1_p2 d_2 >> (cost=0.00..25.00 rows=1500 width=4) >>-> Parallel Seq Scan on tbl1_p3 d_3 >> (cost=0.00..25.00 rows=1500 width=4) >> (16 rows) >> >> postgres=# select e.c2, row_number() over () from tbl1 d, tbl2 e where >> d.c1=e.c3; >> c2 | row_number >> -+ >> 100 | 1 >> 100 | 2 >> 100 | 3 >> *200 | 4* >> 100 | 5 >> (5 rows) >> > > there are not ORDER BY clause, so order is not defined - paralel hash join > surely doesn't ensure a order. > I think so this behave is expected. > thanks.
Re: variation of row_number with parallel
út 14. 4. 2020 v 5:59 odesílatel Rajkumar Raghuwanshi < rajkumar.raghuwan...@enterprisedb.com> napsal: > Hi, > > I have observed row_number() is giving different results when query > executed in parallel. is this expected w.r.t parallel execution. > > CREATE TABLE tbl1 (c1 INT) partition by list (c1); > CREATE TABLE tbl1_p1 partition of tbl1 FOR VALUES IN (10); > CREATE TABLE tbl1_p2 partition of tbl1 FOR VALUES IN (20); > CREATE TABLE tbl1_p3 partition of tbl1 FOR VALUES IN (30); > > CREATE TABLE tbl2 (c1 INT, c2 INT,c3 INT) partition by list (c1); > CREATE TABLE tbl2_p1 partition of tbl2 FOR VALUES IN (1); > CREATE TABLE tbl2_p2 partition of tbl2 FOR VALUES IN (2); > CREATE TABLE tbl2_p3 partition of tbl2 FOR VALUES IN (3); > CREATE TABLE tbl2_p4 partition of tbl2 FOR VALUES IN (4); > CREATE TABLE tbl2_p5 partition of tbl2 FOR VALUES IN (5); > > INSERT INTO tbl1 VALUES (10),(20),(30); > > INSERT INTO tbl2 VALUES > (1,100,20),(2,200,10),(3,100,20),(4,100,30),(5,100,10); > > postgres=# explain select e.c2, row_number() over () from tbl1 d, tbl2 e > where d.c1=e.c3; > QUERY PLAN > > > --- > WindowAgg (cost=1520.35..12287.73 rows=390150 width=12) >-> Merge Join (cost=1520.35..7410.85 rows=390150 width=4) > Merge Cond: (d.c1 = e.c3) > -> Sort (cost=638.22..657.35 rows=7650 width=4) >Sort Key: d.c1 >-> Append (cost=0.00..144.75 rows=7650 width=4) > -> Seq Scan on tbl1_p1 d_1 (cost=0.00..35.50 > rows=2550 width=4) > -> Seq Scan on tbl1_p2 d_2 (cost=0.00..35.50 > rows=2550 width=4) > -> Seq Scan on tbl1_p3 d_3 (cost=0.00..35.50 > rows=2550 width=4) > -> Sort (cost=882.13..907.63 rows=10200 width=8) >Sort Key: e.c3 >-> Append (cost=0.00..203.00 rows=10200 width=8) > -> Seq Scan on tbl2_p1 e_1 (cost=0.00..30.40 > rows=2040 width=8) > -> Seq Scan on tbl2_p2 e_2 (cost=0.00..30.40 > rows=2040 width=8) > -> Seq Scan on tbl2_p3 e_3 (cost=0.00..30.40 > rows=2040 width=8) > -> Seq Scan on tbl2_p4 e_4 (cost=0.00..30.40 > rows=2040 width=8) > -> Seq Scan on tbl2_p5 e_5 (cost=0.00..30.40 > rows=2040 width=8) > (17 rows) > > postgres=# select e.c2, row_number() over () from tbl1 d, tbl2 e where > d.c1=e.c3; > c2 | row_number > -+ > *200 | 1* > 100 | 2 > 100 | 3 > 100 | 4 > 100 | 5 > (5 rows) > > postgres=# > postgres=# set parallel_setup_cost = 0; > SET > postgres=# set parallel_tuple_cost = 0; > SET > postgres=# > postgres=# explain select e.c2, row_number() over () from tbl1 d, tbl2 e > where d.c1=e.c3; > QUERY PLAN > > > -- > WindowAgg (cost=130.75..7521.21 rows=390150 width=12) >-> Gather (cost=130.75..2644.34 rows=390150 width=4) > Workers Planned: 2 > -> Parallel Hash Join (cost=130.75..2644.34 rows=162562 width=4) >Hash Cond: (e.c3 = d.c1) >-> Parallel Append (cost=0.00..131.25 rows=4250 width=8) > -> Parallel Seq Scan on tbl2_p1 e_1 > (cost=0.00..22.00 rows=1200 width=8) > -> Parallel Seq Scan on tbl2_p2 e_2 > (cost=0.00..22.00 rows=1200 width=8) > -> Parallel Seq Scan on tbl2_p3 e_3 > (cost=0.00..22.00 rows=1200 width=8) > -> Parallel Seq Scan on tbl2_p4 e_4 > (cost=0.00..22.00 rows=1200 width=8) > -> Parallel Seq Scan on tbl2_p5 e_5 > (cost=0.00..22.00 rows=1200 width=8) >-> Parallel Hash (cost=90.93..90.93 rows=3186 width=4) > -> Parallel Append (cost=0.00..90.93 rows=3186 > width=4) >-> Parallel Seq Scan on tbl1_p1 d_1 > (cost=0.00..25.00 rows=1500 width=4) >-> Parallel Seq Scan on tbl1_p2 d_2 > (cost=0.00..25.00 rows=1500 width=4) >-> Parallel Seq Scan on tbl1_p3 d_3 > (cost=0.00..25.00 rows=1500 width=4) > (16 rows) > > postgres=# select e.c2, row_number() over () from tbl1 d, tbl2 e where > d.c1=e.c3; > c2 | row_number > -+ > 100 | 1 > 100 | 2 > 100 | 3 > *200 | 4* > 100 | 5 > (5 rows) > there are not ORDER BY clause, so order is not defined - paralel hash join surely doesn't ensure a order. I think so this behave is expected. Regards Pavel > Thanks & Regards, > Rajkumar Raghuwanshi >
Re: Race condition in SyncRepGetSyncStandbysPriority
On Tue, 14 Apr 2020 at 10:34, Tom Lane wrote: > > Kyotaro Horiguchi writes: > > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane wrote in > >> What I think we should do about this is, essentially, to get rid of > >> SyncRepGetSyncStandbys. Instead, let's have each walsender advertise > >> whether *it* believes that it is a sync standby, based on its last > >> evaluation of the relevant GUCs. This would be a bool that it'd > >> compute and set alongside sync_standby_priority. (Hm, maybe we'd not > > > Mmm.. SyncRepGetStandbyPriority returns the "priority" that a > > walsender thinks it is at, among synchronous_standby_names. Then to > > decide "I am a sync standby" we need to know how many walsenders with > > higher priority are alive now. SyncRepGetSyncStandbyPriority does the > > judgment now and suffers from the inconsistency of priority values. > > Yeah. After looking a bit closer, I think that the current definition > of sync_standby_priority (that is, as the result of local evaluation > of SyncRepGetStandbyPriority()) is OK. The problem is what we're doing > with it. I suggest that what we should do in SyncRepGetSyncRecPtr() > is make one sweep across the WalSnd array, collecting PID, > sync_standby_priority, *and* the WAL pointers from each valid entry. > Then examine that data and decide which WAL value we need, without assuming > that the sync_standby_priority values are necessarily totally consistent. > But in any case we must examine each entry just once while holding its > mutex, not go back to it later expecting it to still be the same. Can we have a similar approach of sync_standby_defined for sync_standby_priority? That is, checkpionter is responsible for changing sync_standby_priority of all walsenders when SIGHUP. That way, all walsenders can see a consistent view of sync_standby_priority. And when a walsender starts, it sets sync_standby_priority by itself. The logic to decide who's a sync standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders having higher priority along with their WAL positions. > > Another thing that I'm finding interesting is that I now see this is > not at all new code. It doesn't look like SyncRepGetSyncStandbysPriority > has changed much since 2016. So how come we didn't detect this problem > long ago? I searched the buildfarm logs for assertion failures in > syncrep.c, looking back one year, and here's what I found: > > sysname |branch | snapshot | stage | > l > +---+-+---+--- > nightjar | REL_10_STABLE | 2019-08-13 23:04:41 | recoveryCheck | TRAP: > FailedAssertion("!(((bool) 0))", File: > "/pgbuild/root/REL_10_STABLE/pgsql.build/../pgsql/src/backend/replication/syncrep.c", > Line: 940) > hoverfly | REL9_6_STABLE | 2019-11-07 17:19:12 | recoveryCheck | TRAP: > FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723) > hoverfly | HEAD | 2019-11-22 12:15:08 | recoveryCheck | TRAP: > FailedAssertion("false", File: "syncrep.c", Line: 951) > francolin | HEAD | 2020-01-16 23:10:06 | recoveryCheck | TRAP: > FailedAssertion("false", File: > "/home/andres/build/buildfarm-francolin/HEAD/pgsql.build/../pgsql/src/backend/replication/syncrep.c", > Line: 951) > hoverfly | REL_11_STABLE | 2020-02-29 01:34:55 | recoveryCheck | TRAP: > FailedAssertion("!(0)", File: "syncrep.c", Line: 946) > hoverfly | REL9_6_STABLE | 2020-03-26 13:51:15 | recoveryCheck | TRAP: > FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723) > hoverfly | REL9_6_STABLE | 2020-04-07 21:52:07 | recoveryCheck | TRAP: > FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723) > curculio | HEAD | 2020-04-11 18:30:21 | recoveryCheck | TRAP: > FailedAssertion("false", File: "syncrep.c", Line: 951) > sidewinder | HEAD | 2020-04-11 18:45:39 | recoveryCheck | TRAP: > FailedAssertion("false", File: "syncrep.c", Line: 951) > curculio | HEAD | 2020-04-11 20:30:26 | recoveryCheck | TRAP: > FailedAssertion("false", File: "syncrep.c", Line: 951) > sidewinder | HEAD | 2020-04-11 21:45:48 | recoveryCheck | TRAP: > FailedAssertion("false", File: "syncrep.c", Line: 951) > sidewinder | HEAD | 2020-04-13 10:45:35 | recoveryCheck | TRAP: > FailedAssertion("false", File: "syncrep.c", Line: 951) > conchuela | HEAD | 2020-04-13 16:00:18 | recoveryCheck | TRAP: > FailedAssertion("false", File: > "/home/pgbf/buildroot/HEAD/pgsql.build/../pgsql/src/backend/replication/syncrep.c", > Line: 951) > sidewinder | HEAD | 2020-04-13 18:45:34 | recoveryCheck | TRAP: > FailedAssertion("false", File: "syncrep.c", Line: 951) > (14 rows) > > The line numbers vary in the
variation of row_number with parallel
Hi, I have observed row_number() is giving different results when query executed in parallel. is this expected w.r.t parallel execution. CREATE TABLE tbl1 (c1 INT) partition by list (c1); CREATE TABLE tbl1_p1 partition of tbl1 FOR VALUES IN (10); CREATE TABLE tbl1_p2 partition of tbl1 FOR VALUES IN (20); CREATE TABLE tbl1_p3 partition of tbl1 FOR VALUES IN (30); CREATE TABLE tbl2 (c1 INT, c2 INT,c3 INT) partition by list (c1); CREATE TABLE tbl2_p1 partition of tbl2 FOR VALUES IN (1); CREATE TABLE tbl2_p2 partition of tbl2 FOR VALUES IN (2); CREATE TABLE tbl2_p3 partition of tbl2 FOR VALUES IN (3); CREATE TABLE tbl2_p4 partition of tbl2 FOR VALUES IN (4); CREATE TABLE tbl2_p5 partition of tbl2 FOR VALUES IN (5); INSERT INTO tbl1 VALUES (10),(20),(30); INSERT INTO tbl2 VALUES (1,100,20),(2,200,10),(3,100,20),(4,100,30),(5,100,10); postgres=# explain select e.c2, row_number() over () from tbl1 d, tbl2 e where d.c1=e.c3; QUERY PLAN --- WindowAgg (cost=1520.35..12287.73 rows=390150 width=12) -> Merge Join (cost=1520.35..7410.85 rows=390150 width=4) Merge Cond: (d.c1 = e.c3) -> Sort (cost=638.22..657.35 rows=7650 width=4) Sort Key: d.c1 -> Append (cost=0.00..144.75 rows=7650 width=4) -> Seq Scan on tbl1_p1 d_1 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on tbl1_p2 d_2 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on tbl1_p3 d_3 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=882.13..907.63 rows=10200 width=8) Sort Key: e.c3 -> Append (cost=0.00..203.00 rows=10200 width=8) -> Seq Scan on tbl2_p1 e_1 (cost=0.00..30.40 rows=2040 width=8) -> Seq Scan on tbl2_p2 e_2 (cost=0.00..30.40 rows=2040 width=8) -> Seq Scan on tbl2_p3 e_3 (cost=0.00..30.40 rows=2040 width=8) -> Seq Scan on tbl2_p4 e_4 (cost=0.00..30.40 rows=2040 width=8) -> Seq Scan on tbl2_p5 e_5 (cost=0.00..30.40 rows=2040 width=8) (17 rows) postgres=# select e.c2, row_number() over () from tbl1 d, tbl2 e where d.c1=e.c3; c2 | row_number -+ *200 | 1* 100 | 2 100 | 3 100 | 4 100 | 5 (5 rows) postgres=# postgres=# set parallel_setup_cost = 0; SET postgres=# set parallel_tuple_cost = 0; SET postgres=# postgres=# explain select e.c2, row_number() over () from tbl1 d, tbl2 e where d.c1=e.c3; QUERY PLAN -- WindowAgg (cost=130.75..7521.21 rows=390150 width=12) -> Gather (cost=130.75..2644.34 rows=390150 width=4) Workers Planned: 2 -> Parallel Hash Join (cost=130.75..2644.34 rows=162562 width=4) Hash Cond: (e.c3 = d.c1) -> Parallel Append (cost=0.00..131.25 rows=4250 width=8) -> Parallel Seq Scan on tbl2_p1 e_1 (cost=0.00..22.00 rows=1200 width=8) -> Parallel Seq Scan on tbl2_p2 e_2 (cost=0.00..22.00 rows=1200 width=8) -> Parallel Seq Scan on tbl2_p3 e_3 (cost=0.00..22.00 rows=1200 width=8) -> Parallel Seq Scan on tbl2_p4 e_4 (cost=0.00..22.00 rows=1200 width=8) -> Parallel Seq Scan on tbl2_p5 e_5 (cost=0.00..22.00 rows=1200 width=8) -> Parallel Hash (cost=90.93..90.93 rows=3186 width=4) -> Parallel Append (cost=0.00..90.93 rows=3186 width=4) -> Parallel Seq Scan on tbl1_p1 d_1 (cost=0.00..25.00 rows=1500 width=4) -> Parallel Seq Scan on tbl1_p2 d_2 (cost=0.00..25.00 rows=1500 width=4) -> Parallel Seq Scan on tbl1_p3 d_3 (cost=0.00..25.00 rows=1500 width=4) (16 rows) postgres=# select e.c2, row_number() over () from tbl1 d, tbl2 e where d.c1=e.c3; c2 | row_number -+ 100 | 1 100 | 2 100 | 3 *200 | 4* 100 | 5 (5 rows) Thanks & Regards, Rajkumar Raghuwanshi
Re: WAL usage calculation patch
On Wed, Apr 8, 2020 at 8:36 AM Amit Kapila wrote: > > On Tue, Apr 7, 2020 at 3:30 PM Peter Eisentraut > wrote: > > > > > > We also have existing cases for the other way: > > > > actual time=0.050..0.052 > > Buffers: shared hit=3 dirtied=1 > > > > Buffers case is not the same because 'shared' is used for 'hit', > 'read', 'dirtied', etc. However, I think it is arguable. > > > The cases mentioned by Justin are not formatted in a key=value format, > > so it's not quite the same, but it also raises the question why they are > > not. > > > > Let's figure out a way to consolidate this without making up a third format. > > > > Sure, I think my intention is to keep the format of WAL stats as close > to Buffers stats as possible because both depict I/O and users would > probably be interested to check/read both together. There is a point > to keep things in a format so that it is easier for someone to parse > but I guess as these as fixed 'words', it shouldn't be difficult > either way and we should give more weightage to consistency. Any > suggestions? > Peter E, others, any suggestions on how to move forward? I think here we should follow the rule "follow the style of nearby code" which in this case would be to have one space after each field as we would like it to be closer to the "Buffers" format. It would be good if we have a unified format among all Explain stuff but we might not want to change the existing things and even if we want to do that it might be a broader/bigger change and we should do that as a PG14 change. What do you think? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Vacuum o/p with (full 1, parallel 0) option throwing an error
On Tue, Apr 14, 2020 at 7:52 AM Michael Paquier wrote: > > Makes sense. I have two comments. > > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > - errmsg("cannot specify both FULL and PARALLEL options"))); > + errmsg("VACUUM FULL cannot be performed in parallel"))); > Better to avoid a full sentence here [1]? This should be a "cannot do > foo" errror. > I could see similar error messages in other places like below: CONCURRENTLY cannot be used when the materialized view is not populated CONCURRENTLY and WITH NO DATA options cannot be used together COPY delimiter cannot be newline or carriage return Also, I am not sure if it violates the style we have used in code. It seems the error message is short, succinct and conveys the required information. > -VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option > +VACUUM (PARALLEL 1) tmp; -- parallel vacuum disabled for temp tables > WARNING: disabling parallel option of vacuum on "tmp" --- cannot vacuum > temporary tables in parallel > +VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even > though that's implied by FULL) > > To fully close the gap in the tests, I would also add a test for > (PARALLEL 1, FULL false) where FULL directly specified, even if that > sounds like a nit. That's fine to test even on a temporary table. > Okay, I will do this once we agree on the error message stuff. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: doc review for parallel vacuum
On Tue, Apr 14, 2020 at 2:54 AM Justin Pryzby wrote: > > Looks good. One more change: > > [-Even if-]{+If+} this option is specified with the ANALYZE > [-option-]{+option,+} > > Remove "even" and add comma. > Pushed after making this change. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Corruption during WAL replay
At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada wrote in > On Mon, 13 Apr 2020 at 17:40, Andres Freund wrote: > > > > Hi, > > > > On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote: > > > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti wrote: > > > /* > > > * We WAL-log the truncation before actually truncating, which means > > > * trouble if the truncation fails. If we then crash, the WAL replay > > > * likely isn't going to succeed in the truncation either, and cause a > > > * PANIC. It's tempting to put a critical section here, but that cure > > > * would be worse than the disease. It would turn a usually harmless > > > * failure to truncate, that might spell trouble at WAL replay, into a > > > * certain PANIC. > > > */ > > > > Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL > > log something and then not perform the action. This leads to to primary > > / replica getting out of sync, crash recovery potentially not completing > > (because of records referencing the should-be-truncated pages), ... It is introduced in 2008 by 3396000684, for 8.4. So it can be said as an overlook when introducing log-shipping. The reason other operations like INSERTs (that extends the underlying file) are "safe" after an extension failure is the following operations are performed in shared buffers as if the new page exists, then tries to extend the file again. So if we continue working after truncation failure, we need to disguise on shared buffers as if the truncated pages are gone. But we don't have a room for another flag in buffer header. For example, BM_DIRTY && !BM_VALID might be able to be used as the state that the page should have been truncated but not succeeded yet, but I'm not sure. Anyway, I think the prognosis of a truncation failure is far hopeless than extension failure in most cases and I doubt that it's good to introduce such a complex feature only to overcome such a hopeless situation. In short, I think we should PANIC in that case. > > > As a second idea, I wonder if we can defer truncation until commit > > > time like smgrDoPendingDeletes mechanism. The sequence would be: > > > > This is mostly an issue during [auto]vacuum partially truncating the end > > of the file. We intentionally release the AEL regularly to allow other > > accesses to continue. > > > > For transactional truncations we don't go down this path (as we create a > > new relfilenode). > > > > > > > At RelationTruncate(), > > > 1. WAL logging. > > > 2. Remember buffers to be dropped. > > > > You definitely cannot do that, as explained above. > > Ah yes, you're right. > > So it seems to me currently what we can do for this issue would be to > enclose the truncation operation in a critical section. IIUC it's not > enough just to reverse the order of dropping buffers and physical file > truncation because it cannot solve the problem of inconsistency on the > standby. And as Horiguchi-san mentioned, there is no need to reverse > that order if we envelop the truncation operation by a critical > section because we can recover page changes during crash recovery. The > strategy of writing out all dirty buffers before dropping buffers, > proposed as (a) in [1], also seems not enough. Agreed. Since it's not acceptable ether WAL-logging->not-performing nor performing->WAL-logging, there's no other way than working as if truncation is succeeded (and try again) even if it actually failed. But it would be too complex. Just making it a critical section seems the right thing here. > [1] > https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de > Doing sync before truncation regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Should program exit, When close() failed for O_RDONLY mode
Hi, I find that most of the code does not check the return value of close(), When open a file for reading(O_RDONLY). But I find that it checks the return value of close() in code "src/bin/pg_rewind/copy_fetch.c" When open a file for reading(O_RDONLY). And it will call pg_fatal to cause premature exit. I think that when closing a read-only file fails, it shouid not exit the program early.It should ensure that the program execution is completed. Like below: ・src/bin/pg_rewind/copy_fetch.c before -- rewind_copy_file_range { ... if (close(srcfd) != 0) pg_fatal("could not close file \"%s\": %m", srcpath); } -- after -- rewind_copy_file_range { ... close(srcfd); } -- Regards, -- Lin
Re: Vacuum o/p with (full 1, parallel 0) option throwing an error
On Mon, Apr 13, 2020 at 05:55:43PM +0530, Amit Kapila wrote: > On Mon, Apr 13, 2020 at 4:23 PM Masahiko Sawada > wrote: > I am not very sure about this. I don't think the current text is wrong > especially when you see the value we can specify there is described > as: "Specifies a non-negative integer value passed to the selected > option.". However, we can consider changing it if others also think > the proposed text or something like that is better than current text. FWIW, the current formulation in the docs looked fine to me. > Yeah, something on these lines would be a good idea. Note that, we are > already planning to slightly change this particular sentence in > another patch [1]. > > [1] - > https://www.postgresql.org/message-id/20200322021801.GB2563%40telsasoft.com Makes sense. I have two comments. ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot specify both FULL and PARALLEL options"))); + errmsg("VACUUM FULL cannot be performed in parallel"))); Better to avoid a full sentence here [1]? This should be a "cannot do foo" errror. -VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option +VACUUM (PARALLEL 1) tmp; -- parallel vacuum disabled for temp tables WARNING: disabling parallel option of vacuum on "tmp" --- cannot vacuum temporary tables in parallel +VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL) To fully close the gap in the tests, I would also add a test for (PARALLEL 1, FULL false) where FULL directly specified, even if that sounds like a nit. That's fine to test even on a temporary table. [1]: https://www.postgresql.org/docs/devel/error-style-guide.html -- Michael signature.asc Description: PGP signature
Re: relcache leak warnings vs. errors
On Mon, Apr 13, 2020 at 04:22:26PM -0400, Tom Lane wrote: > Andres Freund writes: >> I'd much rather see this throw an assertion than the current >> behaviour. But I'm wondering if there's a chance we can throw an error >> in non-assert builds without adding too much complexity to the error >> paths. Could we perhaps throw the error a bit later during the commit >> processing? > > Any error post-commit is a semantic disaster. Yes, I can immediately think of two problems in the very recent history where this has bitten. > I guess that an assertion wouldn't be so awful, if people would rather > do it like that in debug builds. WARNING is useful mainly for tests where the output is checked, like the main regression test suite. Now that TAP scenarios get more and more complex, +1 on the addition of an assertion for a hard failure. I don't think either that's worth controlling with a developer GUC. -- Michael signature.asc Description: PGP signature
Re: weird hash plan cost, starting with pg10
On Mon, Apr 13, 2020 at 9:53 PM Tom Lane wrote: > Richard Guo writes: > > At first I was wondering if we need to check whether HashState.hashtable > > is not NULL in ExecShutdownHash() before we decide to allocate save > > space for HashState.hinstrument. And then I convinced myself that that's > > not necessary since HashState.hinstrument and HashState.hashtable cannot > > be both NULL there. > > Even if the hashtable is null at that point, creating an all-zeroes > hinstrument struct is harmless. > Correct. The only benefit we may get from checking if the hashtable is null is to avoid an unnecessary palloc0 for hinstrument. But that case cannot happen though. Thanks Richard
Re: 001_rep_changes.pl stalls
Noah Misch writes: > This seems to have made the following race condition easier to hit: > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us Yeah, I just came to the same guess in the other thread. > While I don't think that indicates anything wrong with the fix for $SUBJECT, > creating more buildfarm noise is itself bad. I am inclined to revert the fix > after a week. Not immediately, in case it uncovers lower-probability bugs. > I'd then re-commit it after one of those threads fixes the other bug. Would > anyone like to argue for a revert earlier, later, or not at all? I don't think you should revert. Those failures are (just) often enough to be annoying but I do not think that a proper fix is very far away. regards, tom lane
Re: 001_rep_changes.pl stalls
On Sun, Apr 05, 2020 at 11:36:49PM -0700, Noah Misch wrote: > Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is > important for promptly updating pg_stat_replication. When caught up, we > should impose that logic before every sleep. The one-line fix is to sleep in > WalSndLoop() only when pq_is_send_pending(), not when caught up. This seems to have made the following race condition easier to hit: https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us Now it happened eight times in three days, all on BSD machines: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2018%3A30%3A21 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2018%3A45%3A39 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2020%3A30%3A26 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2021%3A45%3A48 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2010%3A45%3A35 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2020-04-13%2016%3A00%3A18 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2018%3A45%3A34 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2023%3A45%3A22 While I don't think that indicates anything wrong with the fix for $SUBJECT, creating more buildfarm noise is itself bad. I am inclined to revert the fix after a week. Not immediately, in case it uncovers lower-probability bugs. I'd then re-commit it after one of those threads fixes the other bug. Would anyone like to argue for a revert earlier, later, or not at all? There was a novel buildfarm failure, in 010_logical_decoding_timelines.pl: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2008%3A35%3A05 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2017%3A15%3A01 Most-relevant lines of the test script: $node_master->safe_psql('postgres', "INSERT INTO decoding(blah) VALUES ('afterbb');"); $node_master->safe_psql('postgres', 'CHECKPOINT'); $node_master->stop('immediate'); The failure suggested the INSERT was not replicated before the immediate stop. I can reproduce that consistently, before or after the fix for $SUBJECT, by modifying walsender to delay 0.2s before sending WAL: --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -65,2 +65,3 @@ #include "libpq/pqformat.h" +#include "libpq/pqsignal.h" #include "miscadmin.h" @@ -2781,2 +2782,5 @@ retry: + PG_SETMASK(&BlockSig); + pg_usleep(200 * 1000); + PG_SETMASK(&UnBlockSig); pq_putmessage_noblock('d', output_message.data, output_message.len); I will shortly push a fix adding a wait_for_catchup to the test. I don't know if/how fixing $SUBJECT made this 010_logical_decoding_timelines.pl race condition easier to hit.
Re: Race condition in SyncRepGetSyncStandbysPriority
Kyotaro Horiguchi writes: > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane wrote in >> What I think we should do about this is, essentially, to get rid of >> SyncRepGetSyncStandbys. Instead, let's have each walsender advertise >> whether *it* believes that it is a sync standby, based on its last >> evaluation of the relevant GUCs. This would be a bool that it'd >> compute and set alongside sync_standby_priority. (Hm, maybe we'd not > Mmm.. SyncRepGetStandbyPriority returns the "priority" that a > walsender thinks it is at, among synchronous_standby_names. Then to > decide "I am a sync standby" we need to know how many walsenders with > higher priority are alive now. SyncRepGetSyncStandbyPriority does the > judgment now and suffers from the inconsistency of priority values. Yeah. After looking a bit closer, I think that the current definition of sync_standby_priority (that is, as the result of local evaluation of SyncRepGetStandbyPriority()) is OK. The problem is what we're doing with it. I suggest that what we should do in SyncRepGetSyncRecPtr() is make one sweep across the WalSnd array, collecting PID, sync_standby_priority, *and* the WAL pointers from each valid entry. Then examine that data and decide which WAL value we need, without assuming that the sync_standby_priority values are necessarily totally consistent. But in any case we must examine each entry just once while holding its mutex, not go back to it later expecting it to still be the same. Another thing that I'm finding interesting is that I now see this is not at all new code. It doesn't look like SyncRepGetSyncStandbysPriority has changed much since 2016. So how come we didn't detect this problem long ago? I searched the buildfarm logs for assertion failures in syncrep.c, looking back one year, and here's what I found: sysname |branch | snapshot | stage | l +---+-+---+--- nightjar | REL_10_STABLE | 2019-08-13 23:04:41 | recoveryCheck | TRAP: FailedAssertion("!(((bool) 0))", File: "/pgbuild/root/REL_10_STABLE/pgsql.build/../pgsql/src/backend/replication/syncrep.c", Line: 940) hoverfly | REL9_6_STABLE | 2019-11-07 17:19:12 | recoveryCheck | TRAP: FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723) hoverfly | HEAD | 2019-11-22 12:15:08 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951) francolin | HEAD | 2020-01-16 23:10:06 | recoveryCheck | TRAP: FailedAssertion("false", File: "/home/andres/build/buildfarm-francolin/HEAD/pgsql.build/../pgsql/src/backend/replication/syncrep.c", Line: 951) hoverfly | REL_11_STABLE | 2020-02-29 01:34:55 | recoveryCheck | TRAP: FailedAssertion("!(0)", File: "syncrep.c", Line: 946) hoverfly | REL9_6_STABLE | 2020-03-26 13:51:15 | recoveryCheck | TRAP: FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723) hoverfly | REL9_6_STABLE | 2020-04-07 21:52:07 | recoveryCheck | TRAP: FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723) curculio | HEAD | 2020-04-11 18:30:21 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951) sidewinder | HEAD | 2020-04-11 18:45:39 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951) curculio | HEAD | 2020-04-11 20:30:26 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951) sidewinder | HEAD | 2020-04-11 21:45:48 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951) sidewinder | HEAD | 2020-04-13 10:45:35 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951) conchuela | HEAD | 2020-04-13 16:00:18 | recoveryCheck | TRAP: FailedAssertion("false", File: "/home/pgbf/buildroot/HEAD/pgsql.build/../pgsql/src/backend/replication/syncrep.c", Line: 951) sidewinder | HEAD | 2020-04-13 18:45:34 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951) (14 rows) The line numbers vary in the back branches, but all of these crashes are at that same Assert. So (a) yes, this does happen in the back branches, but (b) some fairly recent change has made it a whole lot more probable. Neither syncrep.c nor 007_sync_rep.pl have changed much in some time, so whatever the change was was indirect. Curious. Is it just timing? I'm giving the side-eye to Noah's recent changes 328c70997 and 421685812, but this isn't enough evidence to say definitely that that's what boosted the failure rate. regards, tom lane
Re: wrong relkind error messages
On Mon, Apr 13, 2020 at 11:13:15AM -0400, Tom Lane wrote: > Fixing this while avoiding your concern about proliferation of messages > seems a bit difficult though. The best I can do after a couple minutes' > thought is > > ERROR: cannot define statistics for relation "ti" > DETAIL: "ti" is an index, and this operation is not supported for that > kind of relation. > > which seems a little long and awkward. Another idea is > > ERROR: cannot define statistics for relation "ti" > DETAIL: This operation is not supported for indexes. > > which still leaves implicit that "ti" is an index, but probably that's > something the user can figure out. > > Maybe someone else can do better? "This operation is not supported for put_relkind_here \"%s\"."? I think that it is better to provide a relation name in the error message (even optionally a namespace). That's less to guess for the user. +int +errdetail_relkind(const char *relname, char relkind) +{ + switch (relkind) + { + case RELKIND_RELATION: + return errdetail("\"%s\" is a table.", relname); + case RELKIND_INDEX: It seems to me that we should optionally add the namespace in the error message, or just have a separate routine for that. I think that it would be useful in some cases (see for example the part about the statistics in the patch), still annoying in some others (instability in test output for temporary schemas for example) so there is a point for both in my view. - if (rel->rd_rel->relkind != RELKIND_VIEW && - rel->rd_rel->relkind != RELKIND_COMPOSITE_TYPE && - rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE && - rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) - { + if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind)) RelationDropStorage(rel); These should be applied separately in my opinion. Nice catch. - errmsg("\"%s\" is not a table, view, materialized view, sequence, or foreign table", -rv->relname))); + errmsg("cannot change schema of relation \"%s\"", +rv->relname), + (relkind == RELKIND_INDEX || relkind == RELKIND_PARTITIONED_INDEX ? errhint("Change the schema of the table instead.") : + (relkind == RELKIND_COMPOSITE_TYPE ? errhint("Use ALTER TYPE instead.") : 0; This is not great style either and reduces readability, so I would recommend to split the errhint generation using a switch/case. + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), +errmsg("action cannot be performed on relation \"%s\"", + RelationGetRelationName(rel)), Echoing Robert upthread, "action" is not really useful for the user, and it seems to me that it should be reworked as "cannot perform foo on relation \"hoge\"" +errmsg("relation \"%s\" does not support comments", + RelationGetRelationName(relation)), This is not project-style as full sentences cannot be used in error messages, no? The former is not that good either, still, while this is getting touched... Say, "cannot use COMMENT on relation \"%s\""? Overall +1 on the patch by the way. Thanks for sending something to improve the situation -- Michael signature.asc Description: PGP signature
Re: [patch] some PQExpBuffer are not destroyed in pg_dump
On Mon, Apr 13, 2020 at 04:51:06PM +0900, Masahiko Sawada wrote: > On Tue, 7 Apr 2020 at 11:42, Zhang, Jie wrote: >> In getDefaultACLs function, some PQExpBuffer are not destroy > > Yes, it looks like an oversight. It's related to the commit > e2090d9d20d809 which is back-patched to 9.6. > > The patch looks good to me. Indeed. Any code path of pg_dump calling buildACLQueries() clears up things, and I think that it is a better practice to clean up properly PQExpBuffer stuff even if there is always the argument that pg_dump is a tool running in a "short"-term context. So I will backpatch that unless there are any objections from others. The part I am actually rather amazed of here is that I don't recall seeing Coverity complaining about leaks after this commit. Perhaps it just got lost. -- Michael signature.asc Description: PGP signature
Re: pgbench - test whether a variable exists
On Mon, Apr 13, 2020 at 09:54:01AM +0200, Fabien COELHO wrote: > Attached a v4. I'm resurrecting this small patch, after "\aset" has been > added to pgbench (9d8ef988). Hmm. It seems to me that this stuff needs to be more careful with the function handling? For example, all those cases fail but they directly pass down a variable that may not be defined, so shouldn't those results be undefined as well instead of failing: \set g double(:{?no_such_variable}) \set g exp(:{?no_such_variable}) \set g greatest(:{?no_such_variable}, :{?no_such_variable}) \set g int(:{?no_such_variable}) It seems to me that there could be a point in having the result of any function to become undefined if using at least one undefined argument (the point could be made as well that things like greatest just ignore conditioned variables), so I was surprised to not see the logic more linked with ENODE_VARIABLE. If your intention is to keep this behavior, it should at least be tested I guess. Please note that this patch will have to wait until v14 opens for business for more comments. :p -- Michael signature.asc Description: PGP signature
Re: pg_basebackup, manifests and backends older than ~12
On Mon, Apr 13, 2020 at 07:55:07PM -0400, Robert Haas wrote: > Oh, hmm. Maybe I'm getting confused with a previous version of the > patch that behaved differently. No problem. If you prefer keeping this part of the code, that's fine by me. If you think that the patch is suited as-is, including silencing the error forcing to use --no-manifest on server versions older than v13, I am fine to help out and apply it myself, but I am also fine if you wish to take care of it by yourself. -- Michael signature.asc Description: PGP signature
Re: pg_basebackup, manifests and backends older than ~12
On Mon, Apr 13, 2020 at 6:26 PM Michael Paquier wrote: > Well, the documentation tells me that as of protocol.sgml: > "For compatibility with previous releases, the default is > MANIFEST 'no'." > > The code also tells me that, in line with the docs: > static void > parse_basebackup_options(List *options, basebackup_options *opt) > [...] > MemSet(opt, 0, sizeof(*opt)); > opt->manifest = MANIFEST_OPTION_NO; > > And there is also a TAP test for that when passing down --no-manifest, > which should not create a backup manifest: > $node->command_ok( > [ > 'pg_basebackup', '-D', "$tempdir/backup2", '--no-manifest', > '--waldir', "$tempdir/xlog2" > ], > > So, it seems to me that it is fine to remove this block, as when > --no-manifest is used, then "manifest" gets set to false, and then it > does not matter if the MANIFEST clause is added or not as we'd just > rely on the default. Keeping the block would matter if you want to > make the code more robust to a change of the default value in the > BASE_BACKUP query though, and its logic is not incorrect either. So, > if you wish to keep it, that's fine by me, but it looks cleaner to me > to remove it and more consistent with the other options like MAX_RATE, > TABLESPACE_MAP, etc. Oh, hmm. Maybe I'm getting confused with a previous version of the patch that behaved differently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Poll: are people okay with function/operator table redesign?
"Jonathan S. Katz" writes: > On 4/13/20 7:02 PM, Jonathan S. Katz wrote: >> Perhaps a counterproposal: We eliminate the content in the leftmost >> "function column, but leave that there to allow the function name / >> signature to span the full 3 columns. Then the rest of the info goes >> below. This will also compress the table height down a bit. > An attempt at a "POC" of what I'm describing (attached image). Hmm ... what is determining the width of the left-hand column? It doesn't seem to have any content, since the function entries are being spanned across the whole table. I think the main practical problem though is that it wouldn't work nicely for operators, since the key "name" you'd be looking for would not be at the left of the signature line. I suppose we don't necessarily have to have the same layout for operators as for functions, but it feels like it'd be jarringly inconsistent. regards, tom lane
Re: Poll: are people okay with function/operator table redesign?
On 2020-Apr-13, Jonathan S. Katz wrote: > On 4/13/20 7:02 PM, Jonathan S. Katz wrote: > > Perhaps a counterproposal: We eliminate the content in the leftmost > > "function column, but leave that there to allow the function name / > > signature to span the full 3 columns. Then the rest of the info goes > > below. This will also compress the table height down a bit. > > An attempt at a "POC" of what I'm describing (attached image). > > I'm not sure if I 100% like it, but it does reduce the amount of > information we're displaying but conveys all the details (and matches > what we have in the previous version). Ooh, this seems a nice idea -- the indentation seems to be sufficient to tell apart entries from each other. Your point about information reduction refers to the fact that we no longer keep the unadorned name but only the signature, right? That seems an improvement to me now that I look at it. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Poll: are people okay with function/operator table redesign?
I wrote: > I don't think I like the version better than --- it adds > quite a bit of vertical space, more than I was expecting really. Actually, after staring more at HTML-hr.png, what's *really* bothering me about that rendering is that the lines made by are actually wider than the inter-table-cell lines. Surely we want the opposite relationship. Presumably that could be fixed with some css-level adjustments; and maybe the spacing could be tightened up a bit too? I do like having that visual separation, it just needs to be toned down compared to the table cell separators. Reproducing the effect in the PDF build remains an issue, too. regards, tom lane
Re: Poll: are people okay with function/operator table redesign?
On 4/13/20 7:02 PM, Jonathan S. Katz wrote: > On 4/13/20 6:51 PM, Tom Lane wrote: >> "Jonathan S. Katz" writes: >>> I think one thing that was throwing me off was having the function >>> signature before the description. I would recommend flipping them: have >>> the function description first, followed by signature, followed be >>> examples. I think that follows the natural flow more of what one is >>> doing when they look up the function. >> >> The trouble with that is it doesn't work very well when we have >> multiple similarly-named functions with different signatures. >> Consider what the two enum_range() entries in 9.33 will look like, >> for example. I think we need the signature to establish which function >> we're talking about. > > I get that, I just find I'm doing too much thinking looking at it. > > Perhaps a counterproposal: We eliminate the content in the leftmost > "function column, but leave that there to allow the function name / > signature to span the full 3 columns. Then the rest of the info goes > below. This will also compress the table height down a bit. An attempt at a "POC" of what I'm describing (attached image). I'm not sure if I 100% like it, but it does reduce the amount of information we're displaying but conveys all the details (and matches what we have in the previous version). The alignment could be adjusted if need be, too. Jonathan signature.asc Description: OpenPGP digital signature
Re: pg_basebackup, manifests and backends older than ~12
On 2020-Apr-13, Michael Paquier wrote: > On Mon, Apr 13, 2020 at 11:52:51AM +0900, Kyotaro Horiguchi wrote: > > Since I'm not sure about the work flow that contains taking a > > basebackup from a server of a different version, I'm not sure which is > > better between silently disabling and erroring out. However, it seems > > to me, the option for replication slot is a choice of the way the tool > > works which doesn't affect the result itself, but that for backup > > manifest is about what the resulting backup contains. Therefore I > > think it is better that pg_basebackup in PG13 should error out if the > > source server doesn't support backup manifest but --no-manifest is not > > specfied, and show how to accomplish their wants (, though I don't see > > the wants clearly). > > Not sure what Robert and other authors of the feature think about > that. What I am rather afraid of is somebody deciding to patch a > script aimed at working across multiple backend versions to add > unconditionally --no-manifest all the time, even for v13. That would > kill the purpose of encouraging the use of manifests. I agree, I think forcing users to specify --no-manifest when run on old servers will cause users to write bad scripts; I vote for silently disabling checksums. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Poll: are people okay with function/operator table redesign?
On 4/13/20 6:51 PM, Tom Lane wrote: > "Jonathan S. Katz" writes: >> I think one thing that was throwing me off was having the function >> signature before the description. I would recommend flipping them: have >> the function description first, followed by signature, followed be >> examples. I think that follows the natural flow more of what one is >> doing when they look up the function. > > The trouble with that is it doesn't work very well when we have > multiple similarly-named functions with different signatures. > Consider what the two enum_range() entries in 9.33 will look like, > for example. I think we need the signature to establish which function > we're talking about. I get that, I just find I'm doing too much thinking looking at it. Perhaps a counterproposal: We eliminate the content in the leftmost "function column, but leave that there to allow the function name / signature to span the full 3 columns. Then the rest of the info goes below. This will also compress the table height down a bit. >> There are probably some things we can do with shading on the pgweb side >> to make items more distinguishable, I don't think that would be too >> terrible to add. > > Per David's earlier comment, it seems like alternating backgrounds might > be feasible if we can get it down to one per function, as the > version I just posted has. or a classname on the "" when a new function starts or the like. Easy enough to get the CSS to work off of that :) Jonathan signature.asc Description: OpenPGP digital signature
Re: Poll: are people okay with function/operator table redesign?
"Jonathan S. Katz" writes: > I think one thing that was throwing me off was having the function > signature before the description. I would recommend flipping them: have > the function description first, followed by signature, followed be > examples. I think that follows the natural flow more of what one is > doing when they look up the function. The trouble with that is it doesn't work very well when we have multiple similarly-named functions with different signatures. Consider what the two enum_range() entries in 9.33 will look like, for example. I think we need the signature to establish which function we're talking about. > There are probably some things we can do with shading on the pgweb side > to make items more distinguishable, I don't think that would be too > terrible to add. Per David's earlier comment, it seems like alternating backgrounds might be feasible if we can get it down to one per function, as the version I just posted has. regards, tom lane
Re: Poll: are people okay with function/operator table redesign?
> > Yeah, back at the beginning of this exercise, Alvaro wondered aloud > if we should go to something other than tables altogether. I dunno > what that'd look like though. > It would probably look like our acronyms and glossary pages. Maybe the return example and return values get replaced with a programlisting?
Re: Poll: are people okay with function/operator table redesign?
> > Thinking out loud, it'd also be great if we could add in some anchors as > well, so perhaps in the future on the pgweb side we could add in some > discoverable links that other documentation has -- which in turn people > could click / link to others directly to the function name. > +1
Re: Poll: are people okay with function/operator table redesign?
On 4/13/20 1:13 PM, Tom Lane wrote: > As discussed in the thread at [1], I've been working on redesigning > the tables we use to present SQL functions and operators. The > first installment of that is now up; see tables 9.30 and 9.31 at > > https://www.postgresql.org/docs/devel/functions-datetime.html > > and table 9.33 at > > https://www.postgresql.org/docs/devel/functions-enum.html > > Before I spend more time on this, I want to make sure that people > are happy with this line of attack. Comparing these tables to > the way they look in v12, they clearly take more vertical space; > but at least to my eye they're less cluttered and more readable. > They definitely scale a lot better for cases where a long function > description is needed, or where we'd like to have more than one > example. Does anyone prefer the old way, or have a better idea? > > I know that the table headings are a bit weirdly laid out; hopefully > that can be resolved [2]. > [2] https://www.postgresql.org/message-id/6169.1586794603%40sss.pgh.pa.us When evaluating [2], I will admit at first I was very confused about the layout and wasn't exactly sure what you were saying was incorrect in that note. After fixing [2] on my local copy, I started to look at it again. For positives, I do think it's an improvement for readability on mobile. Flow/content aside, it was easier to read and follow what was going on and there was less side scrolling. I think one thing that was throwing me off was having the function signature before the description. I would recommend flipping them: have the function description first, followed by signature, followed be examples. I think that follows the natural flow more of what one is doing when they look up the function. I think that would also benefit larger tables too: instead of having to scroll up to understand how things are laid out, it'd follow said flow. There are probably some things we can do with shading on the pgweb side to make items more distinguishable, I don't think that would be too terrible to add. Thinking out loud, it'd also be great if we could add in some anchors as well, so perhaps in the future on the pgweb side we could add in some discoverable links that other documentation has -- which in turn people could click / link to others directly to the function name. Anyway, change is hard. I'm warming up to it. Jonathan signature.asc Description: OpenPGP digital signature
Re: pg_basebackup, manifests and backends older than ~12
On Mon, Apr 13, 2020 at 11:13:06AM -0400, Robert Haas wrote: > I think that this patch is incorrect. I have no objection to > introducing MINIMUM_VERSION_FOR_MANIFESTS, but this is not OK: > > - else > - { > - if (serverMajor < 1300) > - manifest_clause = ""; > - else > - manifest_clause = "MANIFEST 'no'"; > - } > > It seems to me that this will break --no-manifest option on v13. Well, the documentation tells me that as of protocol.sgml: "For compatibility with previous releases, the default is MANIFEST 'no'." The code also tells me that, in line with the docs: static void parse_basebackup_options(List *options, basebackup_options *opt) [...] MemSet(opt, 0, sizeof(*opt)); opt->manifest = MANIFEST_OPTION_NO; And there is also a TAP test for that when passing down --no-manifest, which should not create a backup manifest: $node->command_ok( [ 'pg_basebackup', '-D', "$tempdir/backup2", '--no-manifest', '--waldir', "$tempdir/xlog2" ], So, it seems to me that it is fine to remove this block, as when --no-manifest is used, then "manifest" gets set to false, and then it does not matter if the MANIFEST clause is added or not as we'd just rely on the default. Keeping the block would matter if you want to make the code more robust to a change of the default value in the BASE_BACKUP query though, and its logic is not incorrect either. So, if you wish to keep it, that's fine by me, but it looks cleaner to me to remove it and more consistent with the other options like MAX_RATE, TABLESPACE_MAP, etc. -- Michael signature.asc Description: PGP signature
Re: documenting the backup manifest file format
On 2020-Apr-13, Robert Haas wrote: > On Mon, Apr 13, 2020 at 3:34 PM Alvaro Herrera > wrote: > > Are these hex figures upper or lower case? No leading zeroes? This > > would normally not matter, but the toplevel checksum will care. > > Not really. You just feed the whole file except for the last line > through shasum and you get the answer. > > It so happens that the server generates lower-case, but > pg_verifybackup will accept either. > > Leading zeroes are not omitted. If the checksum's not the right > length, it ain't gonna work. If SHA is used, it's the same output you > would get from running shasum -a on the file, which is > certainly a fixed length. I assumed that this followed from the > statement that there are two characters per byte in the checksum, and > from the fact that no checksum algorithm I know about drops leading > zeroes in the output. Eh, apologies, I was completely unclear -- I was looking at the LSN fields when writing the above. So the leading zeroes and letter case comment refers to those in the LSN values. I agree that it doesn't matter as long as the same tool generates the json file and writes the checksum. > > Also, I see no mention of prettification-chars such as newlines or > > indentation. I suppose if I pass a manifest file through > > prettification (or Windows newline conversion), the checksum may > > break. > > It would indeed break. I'm not sure what you want me to say here, > though. If you're trying to parse a manifest, you shouldn't care about > how the whitespace is arranged. If you're trying to generate one, you > can arrange it any way you like, as long as you also include it in the > checksum. Yeah, I guess I'm just saying that it feels brittle to have a file format that's supposed to be good for data exchange and then make it itself depend on representation details such as the order that fields appear in, the letter case, or the format of newlines. Maybe this isn't really of concern, but it seemed strange. > > As for Last-Modification, I think the spec should indicate the exact > > format that's used, because it'll also be critical for checksumming. > > Again, I don't think it really matters for checksumming, but it's > "-MM-DD HH:MM:SS TZ" format, where TZ is always GMT. I agree that whatever format you use will work as long as it isn't modified. I think strict ISO 8601 might be preferable (with the T in the middle and ending in Z instead of " GMT"). > > Why is the top-level checksum only allowed to be SHA-256, if the > > files can use up to SHA-512? Thanks for the discussion. I think you mostly want to make sure that the manifest is sensible (not corrupt) rather than defend against somebody maliciously giving you an attacking manifest (??). I incline to agree that any SHA-2 hash is going to serve that purpose and have no further comment to make. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Poll: are people okay with function/operator table redesign?
On Mon, Apr 13, 2020 at 1:57 PM Tom Lane wrote: > Actually ... if we did it like that, then it would be possible to treat > the signature + description + example(s) as one big table cell with line > breaks rather than row-separator bars. > That would help address the > inadequate-visual-separation-between-groups issue, but on the other hand > maybe we'd end up with too little visual separation between the elements > of a function description. > Speaking in terms of HTML if we use instead of we would get the best of both worlds. David J.
Re: doc review for parallel vacuum
On Mon, Apr 13, 2020 at 03:22:06PM +0530, Amit Kapila wrote: > On Mon, Apr 13, 2020 at 2:00 PM Justin Pryzby wrote: > > > > |Copy the index > > |bulk-deletion result returned from ambulkdelete and amvacuumcleanup to > > |the DSM segment if it's the first time [???] because they allocate locally > > |and it's possible that an index will be vacuumed by a different > > |vacuum process the next time." > > > > Is it correct to say: "..if it's the first iteration" and "different > > process on > > the next iteration" ? Or "cycle" ? > > > > "cycle" sounds better. I have changed the patch as per your latest > comments. Let me know what you think? Looks good. One more change: [-Even if-]{+If+} this option is specified with the ANALYZE [-option-]{+option,+} Remove "even" and add comma. Thanks, -- Justin
Re: Poll: are people okay with function/operator table redesign?
On Mon, Apr 13, 2020 at 1:41 PM Tom Lane wrote: > "David G. Johnston" writes: > > Can we lightly background color every other rowgroup (i.e., "greenbar")? > > If you know how to do that at all, let alone in a maintainable way (ie > one where inserting a new function doesn't require touching the entries > for the ones after), let's see it. > The nth-child({odd|even}) CSS Selector should provide the desired functionality, at least for HTML, but the structure will need to modified so that there is some single element that represents a single rowgroup. I tried (not too hard) to key off of the presence of the "rowspan" attribute but that does not seem possible. https://www.w3schools.com/cssref/sel_nth-child.asp David J.
Re: Poll: are people okay with function/operator table redesign?
I wrote: > "David G. Johnston" writes: >> I don't think having a separate Result column helps. The additional >> horizontal whitespace distances all relevant context information (at least >> on a wide monitor). Having the example rows mirror the Signature row seems >> like an easier to consume choice. > Interesting idea. I'm afraid that it would not look so great in cases > where the example-plus-result overflows one line, which would inevitably > happen in PDF format. Still, maybe that would be rare enough to not be > a huge problem. In most places it'd be a win to not have to separately > allocate example and result space. Actually ... if we did it like that, then it would be possible to treat the signature + description + example(s) as one big table cell with line breaks rather than row-separator bars. That would help address the inadequate-visual-separation-between-groups issue, but on the other hand maybe we'd end up with too little visual separation between the elements of a function description. A quick google search turned up this suggestion about how to force line breaks in docbook table cells: http://www.sagehill.net/docbookxsl/LineBreaks.html which seems pretty hacky but it should work. Anyone know a better way? regards, tom lane
Re: documenting the backup manifest file format
On 4/13/20 4:14 PM, Robert Haas wrote: On Mon, Apr 13, 2020 at 3:34 PM Alvaro Herrera wrote: Also, I see no mention of prettification-chars such as newlines or indentation. I suppose if I pass a manifest file through prettification (or Windows newline conversion), the checksum may break. It would indeed break. I'm not sure what you want me to say here, though. If you're trying to parse a manifest, you shouldn't care about how the whitespace is arranged. If you're trying to generate one, you can arrange it any way you like, as long as you also include it in the checksum. pgBackRest ignores whitespace but this is a legacy of the way Perl calculated checksums, not an intentional feature. This worked well when the manifest was loaded as a whole, converted to JSON, and checksummed, but it is a major pain for the streaming code we now have in C. I guarantee that that our next manifest version will do a simple checksum of bytes as Robert has done in this feature. So, I'm +1 as implemented. Why is the top-level checksum only allowed to be SHA-256, if the files can use up to SHA-512? I agree that it's a little bit weird that you can have a stronger checksum for the files instead of the manifest itself, but I also wonder what the use case would be for using a stronger checksum on the manifest. David Steele argued that strong checksums on the files could be useful to software that wants to rifle through all the backups you've ever taken and find another copy of that file by looking for something with a matching checksum. CRC-32C wouldn't be strong enough for that, because eventually you could have enough files that you start to have collisions. The SHA algorithms output enough bits to make that quite unlikely. But this argument only makes sense for the files, not the manifest. Agreed. I think SHA-256 is *more* than enough to protect the manifest against corruption. That said, since the cost of SHA-256 vs. SHA-512 in the context on the manifest is negligible we could just use the stronger algorithm to deflect a similar question going forward. That choice might not age well, but we could always say, well, we picked it because it was the strongest available at the time. Allowing a choice of which algorithm to use for to manifest checksum seems like it will just make verifying the file harder with no tangible benefit. Maybe just a comment in the docs about why SHA-256 was used would be fine. (Also, did we intentionally omit the dash in hash names, so "SHA-256" to make it SHA256? This will also be critical for checksumming the manifest itself.) I debated this with myself, settled on this spelling, and nobody complained until now. It could be changed, though. I didn't have any particular reason for choosing it except the feeling that people would probably prefer to type --manifest-checksum=sha256 rather than --manifest-checksum=sha-256. +1 for sha256 rather than sha-256. Regards, -- -David da...@pgmasters.net
Re: Poll: are people okay with function/operator table redesign?
"David G. Johnston" writes: > Can we lightly background color every other rowgroup (i.e., "greenbar")? If you know how to do that at all, let alone in a maintainable way (ie one where inserting a new function doesn't require touching the entries for the ones after), let's see it. I agree it'd be a nice solution, if we could make it work, but I don't see how. I'd been imagining instead that we could give a different background color to the first line of each group; which I don't know how to do but it at least seems plausible that a style could be attached to a . > I don't think having a separate Result column helps. The additional > horizontal whitespace distances all relevant context information (at least > on a wide monitor). Having the example rows mirror the Signature row seems > like an easier to consume choice. Interesting idea. I'm afraid that it would not look so great in cases where the example-plus-result overflows one line, which would inevitably happen in PDF format. Still, maybe that would be rare enough to not be a huge problem. In most places it'd be a win to not have to separately allocate example and result space. regards, tom lane
Re: Poll: are people okay with function/operator table redesign?
Andrew Dunstan writes: > One thing that did occur to me is that the function/operator name is > essentially redundant, as it's in the signature anyway. Not sure if that > helps us any though. Hm, you have a point there. However, if we drop the lefthand column then there really isn't any visual distinction between the row(s) associated with one function and those of the next. Unless we can find another fix for that aspect (as already discussed in this thread) I doubt it'd be an improvement. > Maybe we're just trying to shoehorn too much information into a single > table. Yeah, back at the beginning of this exercise, Alvaro wondered aloud if we should go to something other than tables altogether. I dunno what that'd look like though. regards, tom lane
Re: Poll: are people okay with function/operator table redesign?
On Mon, Apr 13, 2020 at 11:27 AM Tom Lane wrote: > Isaac Morland writes: > > > - I think there should be much more distinctive lines between the > different > > functions. As it is the fact that the table is groups of 3 lines doesn’t > > jump out at the eye. > > I don't know any easy way to do that. We do already have the grouping > visible in the first column... > Can we lightly background color every other rowgroup (i.e., "greenbar")? I don't think having a separate Result column helps. The additional horizontal whitespace distances all relevant context information (at least on a wide monitor). Having the example rows mirror the Signature row seems like an easier to consume choice. e.g., enum_first(null::rainbow) → red date '2001-09-28' + 7 → 2001-10-05 Its also removes the left alignment in a fixed width column which draws unwanted visual attention. David J.
Re: Poll: are people okay with function/operator table redesign?
Robert Haas writes: > I just wonder if there's too much clutter here. Like, line 1: > date - interval → timestamp > OK, gotcha. Line 2: > Subtract an interval from a date > Well, is that really adding anything non-obvious? Yeah, back in the other thread I said >>> I decided to try converting the date/time operators table too, to >>> see how well this works for that. It's bulkier than before, but >>> also (I think) more precise. I realized that this table actually >>> had three examples already for float8 * interval, but it wasn't >>> at all obvious that they were the same operator. So that aspect >>> is a lot nicer here. On the other hand, it seems like the text >>> descriptions are only marginally useful here. I can imagine that >>> they would be useful in some other operator tables, such as >>> geometric operators, but I'm a bit tempted to leave them out >>> in this particular table. The format would adapt to that easily. I wouldn't be averse to dropping the text descriptions for operators in places where they seem obvious ... but who decides what is obvious? Indeed, we've gotten more than one complaint in the past that some of the geometric and JSON operators require a longer explanation than they've got. So one of the points here was to have a format that could adapt to that. But in this particular table I agree they're marginal. regards, tom lane
Re: relcache leak warnings vs. errors
Andres Freund writes: > On 2020-04-11 10:54:49 -0400, Tom Lane wrote: >> I guess you could make them PANICs, but it would be an option that nobody >> could possibly want to have enabled in anything resembling production. >> So I"m kind of -0.5 on making --enable-cassert do it automatically. >> Although I suppose that it's not really worse than other assertion >> failures. > I'd much rather see this throw an assertion than the current > behaviour. But I'm wondering if there's a chance we can throw an error > in non-assert builds without adding too much complexity to the error > paths. Could we perhaps throw the error a bit later during the commit > processing? Any error post-commit is a semantic disaster. I guess that an assertion wouldn't be so awful, if people would rather do it like that in debug builds. regards, tom lane
Re: Parallel copy
On Mon, Apr 13, 2020 at 4:16 PM Andres Freund wrote: > I don't think so. If only one process does the splitting, the > exclusively locked section is just popping off a bunch of offsets of the > ring. And that could fairly easily be done with atomic ops (since what > we need is basically a single producer multiple consumer queue, which > can be done lock free fairly easily ). Whereas in the case of each > process doing the splitting, the exclusively locked part is splitting > along lines - which takes considerably longer than just popping off a > few offsets. Hmm, that does seem believable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Poll: are people okay with function/operator table redesign?
On 4/13/20 1:13 PM, Tom Lane wrote: > As discussed in the thread at [1], I've been working on redesigning > the tables we use to present SQL functions and operators. The > first installment of that is now up; see tables 9.30 and 9.31 at > > https://www.postgresql.org/docs/devel/functions-datetime.html > > and table 9.33 at > > https://www.postgresql.org/docs/devel/functions-enum.html > > Before I spend more time on this, I want to make sure that people > are happy with this line of attack. Comparing these tables to > the way they look in v12, they clearly take more vertical space; > but at least to my eye they're less cluttered and more readable. > They definitely scale a lot better for cases where a long function > description is needed, or where we'd like to have more than one > example. Does anyone prefer the old way, or have a better idea? > > I know that the table headings are a bit weirdly laid out; hopefully > that can be resolved [2]. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/flat/9326.1581457869%40sss.pgh.pa.us > [2] https://www.postgresql.org/message-id/6169.1586794603%40sss.pgh.pa.us > Gotta say I'm not a huge fan. I appreciate the effort, and I get the problem, but I'm not sure we have a net improvement here. One thing that did occur to me is that the function/operator name is essentially redundant, as it's in the signature anyway. Not sure if that helps us any though. Maybe we're just trying to shoehorn too much information into a single table. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Poll: are people okay with function/operator table redesign?
On Mon, Apr 13, 2020 at 2:47 PM Tom Lane wrote: > Another possibility, which'd only help in HTML, would be to render > some of the cells with a slightly different background color. > That's beyond my docbook/css skills, but it might be possible. I think some visual distinction would be really helpful, if we can get it. I just wonder if there's too much clutter here. Like, line 1: date - interval → timestamp OK, gotcha. Line 2: Subtract an interval from a date Well, is that really adding anything non-obvious? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel copy
Hi, On 2020-04-13 14:13:46 -0400, Robert Haas wrote: > On Fri, Apr 10, 2020 at 2:26 PM Andres Freund wrote: > > > Still, it might be the case that having the process that is reading > > > the data also find the line endings is so fast that it makes no sense > > > to split those two tasks. After all, whoever just read the data must > > > have it in cache, and that helps a lot. > > > > Yea. And if it's not fast enough to split lines, then we have a problem > > regardless of which process does the splitting. > > Still, if the reader does the splitting, then you don't need as much > IPC, right? The shared memory data structure is just a ring of bytes, > and whoever reads from it is responsible for the rest. I don't think so. If only one process does the splitting, the exclusively locked section is just popping off a bunch of offsets of the ring. And that could fairly easily be done with atomic ops (since what we need is basically a single producer multiple consumer queue, which can be done lock free fairly easily ). Whereas in the case of each process doing the splitting, the exclusively locked part is splitting along lines - which takes considerably longer than just popping off a few offsets. Greetings, Andres Freund
Re: documenting the backup manifest file format
On Mon, Apr 13, 2020 at 4:10 PM Andrew Dunstan wrote: > Seems ok. A tiny example, or an excerpt, might be nice. An empty database produces a manifest about 1200 lines long, so a full example seems like too much to include in the documentation. An excerpt could be included, I suppose. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: documenting the backup manifest file format
On Mon, Apr 13, 2020 at 3:34 PM Alvaro Herrera wrote: > Are these hex figures upper or lower case? No leading zeroes? This > would normally not matter, but the toplevel checksum will care. Not really. You just feed the whole file except for the last line through shasum and you get the answer. It so happens that the server generates lower-case, but pg_verifybackup will accept either. Leading zeroes are not omitted. If the checksum's not the right length, it ain't gonna work. If SHA is used, it's the same output you would get from running shasum -a on the file, which is certainly a fixed length. I assumed that this followed from the statement that there are two characters per byte in the checksum, and from the fact that no checksum algorithm I know about drops leading zeroes in the output. > Also, I > see no mention of prettification-chars such as newlines or indentation. > I suppose if I pass a manifest file through prettification (or Windows > newline conversion), the checksum may break. It would indeed break. I'm not sure what you want me to say here, though. If you're trying to parse a manifest, you shouldn't care about how the whitespace is arranged. If you're trying to generate one, you can arrange it any way you like, as long as you also include it in the checksum. > As for Last-Modification, I think the spec should indicate the exact > format that's used, because it'll also be critical for checksumming. Again, I don't think it really matters for checksumming, but it's "-MM-DD HH:MM:SS TZ" format, where TZ is always GMT. > Why is the top-level checksum only allowed to be SHA-256, if the files > can use up to SHA-512? If we allowed the top-level checksum to be changed to something else, then we'd probably we want to indicate which kind of checksum is being used at the beginning of the file, so as to enable incremental parsing with checksum verification at the end. pg_verifybackup doesn't currently do incremental parsing, but I'd like to add that sometime, if I get time to hash out the details. I think the use case for varying the checksum type of the manifest itself is much less than for varying it for the files. The big problem with checksumming the files is that it can be slow, because the files can be big. However, unless you have a truckload of empty files in the database, the manifest is going to be very small compared to the sizes of all the files, so it seemed harmless to use a stronger checksum algorithm for the manifest itself. Maybe someone with a ton of empty or nearly-empty relations will complain, but they can always use --no-manifest if they want. I agree that it's a little bit weird that you can have a stronger checksum for the files instead of the manifest itself, but I also wonder what the use case would be for using a stronger checksum on the manifest. David Steele argued that strong checksums on the files could be useful to software that wants to rifle through all the backups you've ever taken and find another copy of that file by looking for something with a matching checksum. CRC-32C wouldn't be strong enough for that, because eventually you could have enough files that you start to have collisions. The SHA algorithms output enough bits to make that quite unlikely. But this argument only makes sense for the files, not the manifest. Naturally, all this is arguable, though, and a good deal of arguing about it has been done, as you have probably noticed. I am still of the opinion that if somebody's goal is to use this facility for its intended purpose, which is to find out whether your backup got corrupted, any of these algorithms are fine, and are highly likely to tell you that you have a problem if, in fact, you do. In fact, I bet that even a checksum algorithm considerably stupider than anything I'd actually consider using would accomplish that goal in a high percentage of cases. But not everybody agrees with me, to the point where I am starting to wonder if I really understand how computers work. > (Also, did we intentionally omit the dash in > hash names, so "SHA-256" to make it SHA256? This will also be critical > for checksumming the manifest itself.) I debated this with myself, settled on this spelling, and nobody complained until now. It could be changed, though. I didn't have any particular reason for choosing it except the feeling that people would probably prefer to type --manifest-checksum=sha256 rather than --manifest-checksum=sha-256. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: documenting the backup manifest file format
On 4/13/20 1:40 PM, Robert Haas wrote: > On Fri, Mar 27, 2020 at 4:32 PM Andres Freund wrote: >> I don't like having a file format that's intended to be used by external >> tools too that's undocumented except for code that assembles it in a >> piecemeal fashion. Do you mean in a follow-on patch this release, or >> later? I don't have a problem with the former. > Here is a patch for that. > Seems ok. A tiny example, or an excerpt, might be nice. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: relcache leak warnings vs. errors
Hi, On 2020-04-11 10:54:49 -0400, Tom Lane wrote: > Peter Eisentraut writes: > > How about a compile-time option to turn all the warnings in resowner.c > > into errors? This could be enabled automatically by --enable-cassert, > > similar to other defines that that option enables. > > [ itch... ] Those calls occur post-commit; throwing an error there > is really a mess, which is why it's only WARNING now. > I guess you could make them PANICs, but it would be an option that nobody > could possibly want to have enabled in anything resembling production. > So I"m kind of -0.5 on making --enable-cassert do it automatically. > Although I suppose that it's not really worse than other assertion > failures. I'd much rather see this throw an assertion than the current behaviour. But I'm wondering if there's a chance we can throw an error in non-assert builds without adding too much complexity to the error paths. Could we perhaps throw the error a bit later during the commit processing? Greetings, Andres Freund
Re: documenting the backup manifest file format
On Mon, Apr 13, 2020 at 2:28 PM Erik Rijkers wrote: > Can you double check this sentence? Seems strange to me but I don't > know why; it may well be that my english is not good enough. Maybe a > comma after 'required' makes reading easier? > > The timeline from which this range of WAL records will be required in > order to make use of this backup. The value is an integer. It sounds a little awkward to me, but not outright wrong. I'm not exactly sure how to rephrase it, though. Maybe just shorten it to "the timeline for this range of WAL records"? > One typo: > > 'when making using' should be > 'when making use' Right, thanks, fixed in my local copy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Using Oracle SQL Client commands with PSQL 12.2 DB
PGSQLCommunities, We migrated Oracle 11.x Database to PostgreSQL 12.x Database on a RH Linux 7.x server. On a different RH Linux 7.x Server, I have Oracle Client installed. Since we have many scripts developed in Oracle SQL, is it possible for the PostgreSQL 12.x DB to process the Oracle Scripts? Are there utilities or drivers that could be installed on the PostgreSQL 12.x Database or server for processing the Oracle SQL client commands? We are trying to avoid updating our Oracle Client scripts on remote servers. Thanks Fred
Re: documenting the backup manifest file format
+ The LSN at which replay must begin on the indicated timeline in order to + make use of this backup. The LSN is stored in the format normally used + by PostgreSQL; that is, it is a string + consisting of two strings of hexademical characters, each with a length + of between 1 and 8, separated by a slash. typo "hexademical" Are these hex figures upper or lower case? No leading zeroes? This would normally not matter, but the toplevel checksum will care. Also, I see no mention of prettification-chars such as newlines or indentation. I suppose if I pass a manifest file through prettification (or Windows newline conversion), the checksum may break. As for Last-Modification, I think the spec should indicate the exact format that's used, because it'll also be critical for checksumming. Why is the top-level checksum only allowed to be SHA-256, if the files can use up to SHA-512? (Also, did we intentionally omit the dash in hash names, so "SHA-256" to make it SHA256? This will also be critical for checksumming the manifest itself.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Poll: are people okay with function/operator table redesign?
Alvaro Herrera writes: > One improvement (that I don't know is possible in docbook) would be to > have the inter-logical-row line be slightly thicker than the > intra-logical-row one. That'd make each entry visually more obvious. Yeah, I don't see any way to do that :-(. We could suppress the row lines entirely between the members of the logical group, but that'd almost surely look worse. (I tried to implement this to see, and couldn't get rowsep="0" in a to render the way I expected, so there may be toolchain bugs in the way of it anyway.) We could leave an entirely empty row between logical groups, but that would be really wasteful of vertical space. Another possibility, which'd only help in HTML, would be to render some of the cells with a slightly different background color. That's beyond my docbook/css skills, but it might be possible. regards, tom lane
Re: documenting the backup manifest file format
On 2020-04-13 20:08, Robert Haas wrote: [v2-0001-Document-the-backup-manifest-file-format.patch] Can you double check this sentence? Seems strange to me but I don't know why; it may well be that my english is not good enough. Maybe a comma after 'required' makes reading easier? The timeline from which this range of WAL records will be required in order to make use of this backup. The value is an integer. One typo: 'when making using' should be 'when making use' Erik Rijkers
Re: Poll: are people okay with function/operator table redesign?
Isaac Morland writes: > - showing the signature like this is interesting. For a moment I was > wondering why it doesn’t say, for example, "interval → interval → interval” > then I remembered this is Postgres, not Haskell. On the one hand, I like > putting the signature like this; on the other, I don’t like that the return > type is in a different place in each one. Could it be split into the same > two columns as the example(s); first column inputs, second column results? We tried that in an earlier iteration (see the referenced thread). It doesn't work very well because you end up having to allocate the max amount of space for any result type or example result on every line. Giving up the separate cell for return type is a lot of what makes this workable. > - another possibility for the parameters: list each one on a separate line, > together with default (if applicable). Maybe that would be excessively > tall, but it would sure make completely clear just exactly how many > parameters there are and never wrap (well, maybe on a phone, but we can > only do so much). Since so few built-in functions have default parameters, that's going to waste an awful lot of space in most cases. I actually ended up removing the explicit "default" clauses from make_interval (which is the only function with defaults that I dealt with so far) and instead explained that they all default to zero in the text description, because that took way less space. > - for the various current-time-related functions (age, current_time, etc.), > rather than saying “variable”, could it be the actual result with “now” > being taken to be a specific fixed time within the year in which the > documentation was generated? This would be really helpful for example with > being clear that current_time is only the time of day with no date. Yeah, I've been waffling about that. On the one hand, we regularly get docs complaints from people who say "I tried this example and I didn't get the claimed result". On the other hand you could figure that everybody should understand that current_timestamp won't work like that ... but the first such example in the table is age() for which that automatic understanding might not apply. The examples down in 9.9.4 use a specific time, which is looking pretty long in the tooth right now, and no one has complained --- but that's in a context where it's absolutely plain that every mentioned function is going to have a time-varying result. On the whole I'm kind of leaning to going back to using a specific time. But that's a detail that's not very relevant to the bigger picture here. (No, I'm not going to try to make it update every year; too much work for too little reward.) > - I think there should be much more distinctive lines between the different > functions. As it is the fact that the table is groups of 3 lines doesn’t > jump out at the eye. I don't know any easy way to do that. We do already have the grouping visible in the first column... regards, tom lane
Re: Parallel copy
On Fri, Apr 10, 2020 at 2:26 PM Andres Freund wrote: > > Still, it might be the case that having the process that is reading > > the data also find the line endings is so fast that it makes no sense > > to split those two tasks. After all, whoever just read the data must > > have it in cache, and that helps a lot. > > Yea. And if it's not fast enough to split lines, then we have a problem > regardless of which process does the splitting. Still, if the reader does the splitting, then you don't need as much IPC, right? The shared memory data structure is just a ring of bytes, and whoever reads from it is responsible for the rest. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Apr 13, 2020 at 6:34 PM Dilip Kumar wrote: > Skipping 0003 for now. Review comments from 0004-Gracefully-handle-*.patch @@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid) ItemId lp = NULL; HeapTupleHeader htup; + /* + * We don't expect direct calls to heap_hot_search with + * valid CheckXidAlive for regular tables. Track that below. + */ + if (unlikely(TransactionIdIsValid(CheckXidAlive) && + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation + elog(ERROR, "unexpected heap_hot_search call during logical decoding"); The call is to heap_finish_speculative. @@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan) } } + if (TransactionIdIsValid(CheckXidAlive) && + !TransactionIdIsInProgress(CheckXidAlive) && + !TransactionIdDidCommit(CheckXidAlive)) + ereport(ERROR, + (errcode(ERRCODE_TRANSACTION_ROLLBACK), + errmsg("transaction aborted during system catalog scan"))); s/transaction aborted/transaction aborted concurrently perhaps? Also, can we move this check at the begining of the function? If the condition fails, we can skip the sys scan. Some of the checks looks repetative in the same file. Should we declare them as inline functions? Review comments from 0005-Implement-streaming*.patch +static void +AssertChangeLsnOrder(ReorderBufferTXN *txn) +{ +#ifdef USE_ASSERT_CHECKING + dlist_iter iter; ... +#endif +} We can implement the same as following: #ifdef USE_ASSERT_CHECKING static void AssertChangeLsnOrder(ReorderBufferTXN *txn) { dlist_iter iter; ... } #else #define AssertChangeLsnOrder(txn) ((void)true) #endif + * if it is aborted we will report an specific error which we can ignore. We s/an specific/a specific + * Set the last last of the stream as the final lsn before calling + * stream stop. s/last last/last PG_CATCH(); { + MemoryContext ecxt = MemoryContextSwitchTo(ccxt); + ErrorData *errdata = CopyErrorData(); When we don't re-throw, the errdata should be freed by calling FreeErrorData(errdata), right? + /* + * Set the last last of the stream as the final lsn before + * calling stream stop. + */ + txn->final_lsn = prev_lsn; + rb->stream_stop(rb, txn); + + FlushErrorState(); + } stream_stop() can still throw some error, right? In that case, we should flush the error state before calling stream_stop(). + /* + * Remember the command ID and snapshot if transaction is streaming + * otherwise free the snapshot if we have copied it. + */ + if (streaming) + { + txn->command_id = command_id; + + /* Avoid copying if it's already copied. */ + if (snapshot_now->copied) + txn->snapshot_now = snapshot_now; + else + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, + txn, command_id); + } + else if (snapshot_now->copied) + ReorderBufferFreeSnap(rb, snapshot_now); Hmm, it seems this part needs an assumption that after copying the snapshot, no subsequent step can throw any error. If they do, then we can again create a copy of the snapshot in catch block, which will leak some memory. Is my understanding correct? + } + else + { + ReorderBufferCleanupTXN(rb, txn); + PG_RE_THROW(); + } Shouldn't we switch back to previously created error memory context before re-throwing? +ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, + XLogRecPtr commit_lsn, XLogRecPtr end_lsn, + TimestampTz commit_time, + RepOriginId origin_id, XLogRecPtr origin_lsn) +{ + ReorderBufferTXN *txn; + volatile Snapshot snapshot_now; + volatile CommandId command_id = FirstCommandId; In the modified ReorderBufferCommit(), why is it necessary to declare the above two variable as volatile? There is no try-catch block here. @@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn) if (txn == NULL) return; + /* + * When the (sub)transaction was streamed, notify the remote node + * about the abort only if we have sent any data for this transaction. + */ + if (rbtxn_is_streamed(txn) && txn->any_data_sent) + rb->stream_abort(rb, txn, lsn); + s/When/If + /* + * When the (sub)transaction was streamed, notify the remote node + * about the abort. + */ + if (rbtxn_is_streamed(txn)) + rb->stream_abort(rb, txn, lsn); s/When/If. And, in this case, if we've not sent any data, why should we send the abort message (similar to the previous one)? + * Note: We never do both stream and serialize a transaction (we only spill + * to disk when streaming is not supported by the plugin), so only one of + * those two flags may be set at any given time. + */ +#define rbtxn_is_streamed(txn) \ +( \ + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \ +) Should we put any assert (not necessarily here) to validate the above comment? + txn = ReorderBufferLargestTopTXN(rb); + + /* we know there has to be one, because the size is not zero */ + Assert(txn && !txn->toptxn); + Assert(txn->size > 0); + Assert(rb->size >= txn->size); The same three assertions are already there in ReorderBufferLargestTopTXN(). +static bool +ReorderBufferCanStr
Re: Properly mark NULL returns in numeric aggregates
Jesse Zhang writes: > On Fri, Apr 10, 2020 at 3:59 PM Tom Lane wrote: >> They can't be strict because the initial iteration needs to produce >> something from a null state and non-null input. nodeAgg's default >> behavior won't work for those because nodeAgg doesn't know how to >> copy a value of type "internal". > Ah, I think I get it. A copy must happen because the input is likely in > a shorter-lived memory context than the state, but nodeAgg's default > behavior of copying a by-value datum won't really copy the object > pointed to by the pointer wrapped in the datum of "internal" type, so we > defer to the combine function. Am I right? Then it follows kinda > naturally that those combine functions have been sloppy on arrival since > commit 11c8669c0cc . Yeah, they're relying exactly on the assumption that nodeAgg is not going to try to copy a value declared "internal", and therefore they can be loosey-goosey about whether the value pointer is null or not. However, if you want to claim that that's wrong, you have to explain why it's okay for some other code to be accessing a value that's declared "internal". I'd say that the meaning of that is precisely "keepa u hands off". In the case at hand, the current situation is that we only expect the values returned by these combine functions to be read by the associated final functions, which are on board with the null-pointer representation of an empty result. Your argument is essentially that it should be possible to feed the values to the aggregate's associated serialization function as well. But the core code never does that, so I'm not convinced that we should add it to the requirements; we'd be unable to test it. regards, tom lane
Re: documenting the backup manifest file format
On Mon, Apr 13, 2020 at 1:55 PM Justin Pryzby wrote: > typos: > manifes > hexademical (twice) Thanks. v2 attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company v2-0001-Document-the-backup-manifest-file-format.patch Description: Binary data
Re: Poll: are people okay with function/operator table redesign?
On 2020-Apr-13, Tom Lane wrote: > As discussed in the thread at [1], I've been working on redesigning > the tables we use to present SQL functions and operators. The > first installment of that is now up; see tables 9.30 and 9.31 at > > https://www.postgresql.org/docs/devel/functions-datetime.html > > and table 9.33 at > > https://www.postgresql.org/docs/devel/functions-enum.html > > Before I spend more time on this, I want to make sure that people > are happy with this line of attack. Comparing these tables to > the way they look in v12, they clearly take more vertical space; > but at least to my eye they're less cluttered and more readable. > They definitely scale a lot better for cases where a long function > description is needed, or where we'd like to have more than one > example. I am torn. On the one side, I think this new format is so much better than the old one that we should definitely use it for all tables. On the other side, I also think this format is slightly more complicated to read, so perhaps it would be sensible to keep using the old format for the simplest tables. One argument for the first of those positions is that if this new table layout is everywhere, it'll take less total time to get used to it. One improvement (that I don't know is possible in docbook) would be to have the inter-logical-row line be slightly thicker than the intra-logical-row one. That'd make each entry visually more obvious. I think you already mentioned the PDF issue that these multi-row entries are sometimes split across pages. I cannot believe docbook is so stupid not to have a solution to that problem, but I don't know what that solution would be. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Poll: are people okay with function/operator table redesign?
On Mon, 13 Apr 2020 at 13:13, Tom Lane wrote: > As discussed in the thread at [1], I've been working on redesigning > the tables we use to present SQL functions and operators. The > first installment of that is now up; see tables 9.30 and 9.31 at > > https://www.postgresql.org/docs/devel/functions-datetime.html > > and table 9.33 at > > https://www.postgresql.org/docs/devel/functions-enum.html > > Before I spend more time on this, I want to make sure that people > are happy with this line of attack. Comparing these tables to > the way they look in v12, they clearly take more vertical space; > but at least to my eye they're less cluttered and more readable. > They definitely scale a lot better for cases where a long function > description is needed, or where we'd like to have more than one > example. Does anyone prefer the old way, or have a better idea? > I honestly don’t know. My initial reaction is a combination of “that’s weird” and “that’s cool”. So a few comments, which shouldn’t be taken as indicating a definite preference: - showing the signature like this is interesting. For a moment I was wondering why it doesn’t say, for example, "interval → interval → interval” then I remembered this is Postgres, not Haskell. On the one hand, I like putting the signature like this; on the other, I don’t like that the return type is in a different place in each one. Could it be split into the same two columns as the example(s); first column inputs, second column results? - another possibility for the parameters: list each one on a separate line, together with default (if applicable). Maybe that would be excessively tall, but it would sure make completely clear just exactly how many parameters there are and never wrap (well, maybe on a phone, but we can only do so much). - for the various current-time-related functions (age, current_time, etc.), rather than saying “variable”, could it be the actual result with “now” being taken to be a specific fixed time within the year in which the documentation was generated? This would be really helpful for example with being clear that current_time is only the time of day with no date. - the specific fixed time should be something like (current year)-06-30 18:45:54. I’ve deliberately chosen all values to be outside of the range of values with smaller ranges. For example, the hour is >12, the limit of the month field. - I think there should be much more distinctive lines between the different functions. As it is the fact that the table is groups of 3 lines doesn’t jump out at the eye.
Re: documenting the backup manifest file format
On Mon, Apr 13, 2020 at 01:40:56PM -0400, Robert Haas wrote: > On Fri, Mar 27, 2020 at 4:32 PM Andres Freund wrote: > > I don't like having a file format that's intended to be used by external > > tools too that's undocumented except for code that assembles it in a > > piecemeal fashion. Do you mean in a follow-on patch this release, or > > later? I don't have a problem with the former. > > Here is a patch for that. typos: manifes hexademical (twice) -- Justin
Re: Poll: are people okay with function/operator table redesign?
On 2020-04-13 19:13, Tom Lane wrote: As discussed in the thread at [1], I've been working on redesigning the tables we use to present SQL functions and operators. The first installment of that is now up; see tables 9.30 and 9.31 at https://www.postgresql.org/docs/devel/functions-datetime.html and table 9.33 at https://www.postgresql.org/docs/devel/functions-enum.html Before I spend more time on this, I want to make sure that people are happy with this line of attack. Comparing these tables to the way they look in v12, they clearly take more vertical space; but at least to my eye they're less cluttered and more readable. They definitely scale a lot better for cases where a long function description is needed, or where we'd like to have more than one example. Does anyone prefer the old way, or have a better idea? +1 In the pdf it is a big improvement; and the html is better too.
documenting the backup manifest file format
On Fri, Mar 27, 2020 at 4:32 PM Andres Freund wrote: > I don't like having a file format that's intended to be used by external > tools too that's undocumented except for code that assembles it in a > piecemeal fashion. Do you mean in a follow-on patch this release, or > later? I don't have a problem with the former. Here is a patch for that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company v1-0001-Document-the-backup-manifest-file-format.patch Description: Binary data
Re: Poll: are people okay with function/operator table redesign?
On Mon, Apr 13, 2020 at 1:13 PM Tom Lane wrote: > As discussed in the thread at [1], I've been working on redesigning > the tables we use to present SQL functions and operators. The > first installment of that is now up; see tables 9.30 and 9.31 at > > https://www.postgresql.org/docs/devel/functions-datetime.html > > and table 9.33 at > > https://www.postgresql.org/docs/devel/functions-enum.html > > Before I spend more time on this, I want to make sure that people > are happy with this line of attack. Comparing these tables to > the way they look in v12, they clearly take more vertical space; > but at least to my eye they're less cluttered and more readable. > They definitely scale a lot better for cases where a long function > description is needed, or where we'd like to have more than one > example. Does anyone prefer the old way, or have a better idea? I find the new way quite hard to read. I prefer the old way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Properly mark NULL returns in numeric aggregates
On Fri, Apr 10, 2020 at 3:59 PM Tom Lane wrote: > > They can't be strict because the initial iteration needs to produce > something from a null state and non-null input. nodeAgg's default > behavior won't work for those because nodeAgg doesn't know how to > copy a value of type "internal". > > regards, tom lane Ah, I think I get it. A copy must happen because the input is likely in a shorter-lived memory context than the state, but nodeAgg's default behavior of copying a by-value datum won't really copy the object pointed to by the pointer wrapped in the datum of "internal" type, so we defer to the combine function. Am I right? Then it follows kinda naturally that those combine functions have been sloppy on arrival since commit 11c8669c0cc . Cheers, Jesse
Poll: are people okay with function/operator table redesign?
As discussed in the thread at [1], I've been working on redesigning the tables we use to present SQL functions and operators. The first installment of that is now up; see tables 9.30 and 9.31 at https://www.postgresql.org/docs/devel/functions-datetime.html and table 9.33 at https://www.postgresql.org/docs/devel/functions-enum.html Before I spend more time on this, I want to make sure that people are happy with this line of attack. Comparing these tables to the way they look in v12, they clearly take more vertical space; but at least to my eye they're less cluttered and more readable. They definitely scale a lot better for cases where a long function description is needed, or where we'd like to have more than one example. Does anyone prefer the old way, or have a better idea? I know that the table headings are a bit weirdly laid out; hopefully that can be resolved [2]. regards, tom lane [1] https://www.postgresql.org/message-id/flat/9326.1581457869%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/6169.1586794603%40sss.pgh.pa.us
Re: cleaning perl code
On 4/12/20 3:42 AM, Noah Misch wrote: > On Sat, Apr 11, 2020 at 12:13:08PM -0400, Andrew Dunstan wrote: >> --- a/src/tools/msvc/Project.pm >> +++ b/src/tools/msvc/Project.pm >> @@ -420,13 +420,10 @@ sub read_file >> { >> my $filename = shift; >> my $F; >> -my $t = $/; >> - >> -undef $/; >> +local $/ = undef; >> open($F, '<', $filename) || croak "Could not open file $filename\n"; >> my $txt = <$F>; >> close($F); >> -$/ = $t; > +1 for this and for the other three hunks like it. The resulting code is > shorter and more robust, so this is a good one-time cleanup. It's not > important to mandate this style going forward, so I wouldn't change > perlcriticrc for this one. > >> --- a/src/tools/version_stamp.pl >> +++ b/src/tools/version_stamp.pl >> @@ -1,4 +1,4 @@ >> -#! /usr/bin/perl -w >> +#! /usr/bin/perl >> >> # >> # version_stamp.pl -- update version stamps throughout the source tree >> @@ -21,6 +21,7 @@ >> # >> >> use strict; >> +use warnings; > This and the other "use warnings" additions look good. I'm assuming you'd > change perlcriticrc like this: > > +[TestingAndDebugging::RequireUseWarnings] > +severity = 5 OK, I've committed all that stuff. I think that takes care of the non-controversial part of what I proposed :-) cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: wrong relkind error messages
On 4/13/20 11:13 AM, Tom Lane wrote: > > ERROR: cannot define statistics for relation "ti" > DETAIL: This operation is not supported for indexes. > > which still leaves implicit that "ti" is an index, but probably that's > something the user can figure out. > +1 for this. It's clear and succinct. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: backup manifests
On Sun, Apr 12, 2020 at 10:09 PM Fujii Masao wrote: > I found other minor issues. I think these are all correct fixes. Thanks for the post-commit review, and sorry for this mistakes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: wrong relkind error messages
Peter Eisentraut writes: > I'm not a fan of error messages like > relation "%s" is not a table, foreign table, or materialized view Agreed, they're not great. > For example: > -ERROR: relation "ti" is not a table, foreign table, or materialized view > +ERROR: cannot define statistics for relation "ti" > +DETAIL: "ti" is an index. I see where you'e going, and it seems like a generally-better idea, but I feel like this phrasing is omitting some critical background information that users don't necessarily have. At the very least it's not stating clearly that the failure is *because* ti is an index. More generally, the whole concept that statistics can only be defined for certain kinds of relations has disappeared from view. I fear that users who're less deeply into Postgres hacking than we are might not have that concept at all, or at least it might not come to mind immediately when they get this message. Fixing this while avoiding your concern about proliferation of messages seems a bit difficult though. The best I can do after a couple minutes' thought is ERROR: cannot define statistics for relation "ti" DETAIL: "ti" is an index, and this operation is not supported for that kind of relation. which seems a little long and awkward. Another idea is ERROR: cannot define statistics for relation "ti" DETAIL: This operation is not supported for indexes. which still leaves implicit that "ti" is an index, but probably that's something the user can figure out. Maybe someone else can do better? regards, tom lane
Re: pg_basebackup, manifests and backends older than ~12
On Sun, Apr 12, 2020 at 8:56 PM Michael Paquier wrote: > On Sun, Apr 12, 2020 at 08:08:17AM +0900, Michael Paquier wrote: > > Exactly. My point is exactly that. The current code would force > > users maintaining scripts with pg_basebackup to use --no-manifest if > > such a script runs with older versions of Postgres, but we should > > encourage users not do to that because we want them to use manifests > > with backend versions where they are supported. > > Please note that I have added an open item for this thread, and > attached is a proposal of patch. While reading the code, I have > noticed that the minimum version handling is not consistent with the > other MINIMUM_VERSION_*, so I have added one for manifests. I think that this patch is incorrect. I have no objection to introducing MINIMUM_VERSION_FOR_MANIFESTS, but this is not OK: - else - { - if (serverMajor < 1300) - manifest_clause = ""; - else - manifest_clause = "MANIFEST 'no'"; - } It seems to me that this will break --no-manifest option on v13. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: wrong relkind error messages
On Mon, Apr 13, 2020 at 9:55 AM Peter Eisentraut wrote: > Attached is another attempt to improve this. Nice effort. Most of these seem like clear improvements, but some I don't like: + errmsg("relation \"%s\" is of unsupported kind", + RelationGetRelationName(rel)), + errdetail_relkind(RelationGetRelationName(rel), rel->rd_rel->relkind))); It would help to work "pgstattuple" into the message somehow. "cannot use pgstattuple on relation \"%s\"", perhaps? + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("action cannot be performed on relation \"%s\"", + RelationGetRelationName(rel)), Super-vague. + errmsg("cannot set relation options of relation \"%s\"", + RelationGetRelationName(rel)), I suggest "cannot set options for relation \"%s\""; that is, use "for" instead of "of", and don't say "relation" twice. + errmsg("cannot create trigger on relation \"%s\"", + RelationGetRelationName(rel)), + errmsg("relation \"%s\" cannot have triggers", + RelationGetRelationName(rel)), + errmsg("relation \"%s\" cannot have triggers", + rv->relname), Maybe use the second wording for all three? And similarly for rules? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_validatebackup -> pg_verifybackup?
On Sun, Apr 12, 2020 at 10:20 PM Shinoda, Noriyoshi (PN Japan A&PS Delivery) wrote: > Sorry this email is not a discussion about word selection. > Since part of the manual had left pg_validatebackup in commit > dbc60c5593f26dc777a3be032bff4fb4eab1ddd1. > I've attached a patch to fix this. Committed, thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: spin_delay() for ARM
On Sat, 11 Apr 2020 at 04:18, Tom Lane wrote: > > I wrote: > > A more useful test would be to directly experiment with contended > > spinlocks. As I recall, we had some test cases laying about when > > we were fooling with the spin delay stuff on Intel --- maybe > > resurrecting one of those would be useful? > > The last really significant performance testing we did in this area > seems to have been in this thread: > > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com > > A relevant point from that is Haas' comment > > I think optimizing spinlocks for machines with only a few CPUs is > probably pointless. Based on what I've seen so far, spinlock > contention even at 16 CPUs is negligible pretty much no matter what > you do. Whether your implementation is fast or slow isn't going to > matter, because even an inefficient implementation will account for > only a negligible percentage of the total CPU time - much less than 1% > - as opposed to a 64-core machine, where it's not that hard to find > cases where spin-waits consume the *majority* of available CPU time > (recall previous discussion of lseek). Yeah, will check if I find some machines with large cores. > So I wonder whether this patch is getting ahead of the game. It does > seem that ARM systems with a couple dozen cores exist, but are they > common enough to optimize for yet? Can we even find *one* to test on > and verify that this is a win and not a loss? (Also, seeing that > there are so many different ARM vendors, results from just one > chipset might not be too trustworthy ...) Ok. Yes, it would be worth waiting to see if there are others in the community with ARM systems that have implemented YIELD. May be after that we might gain some confidence. I myself also hope that I will get one soon to test, but right now I have one that does not support it, so it will be just a no-op. -- Thanks, -Amit Khandekar Huawei Technologies -- Thanks, -Amit Khandekar Huawei Technologies
Re: spin_delay() for ARM
On Sat, 11 Apr 2020 at 00:47, Andres Freund wrote: > > Hi, > > On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote: > > On my Intel Xeon machine with 8 cores, I tried to test PAUSE also > > using a sample C program (attached spin.c). Here, many child processes > > (much more than CPUs) wait in a tight loop for a shared variable to > > become 0, while the parent process continuously increments a sequence > > number for a fixed amount of time, after which, it sets the shared > > variable to 0. The child's tight loop calls PAUSE in each iteration. > > What I hoped was that because of PAUSE in children, the parent process > > would get more share of the CPU, due to which, in a given time, the > > sequence number will reach a higher value. Also, I expected the CPU > > cycles spent by child processes to drop down, thanks to PAUSE. None of > > these happened. There was no change. > > > Possibly, this testcase is not right. Probably the process preemption > > occurs only within the set of hyperthreads attached to a single core. > > And in my testcase, the parent process is the only one who is ready to > > run. Still, I have anyway attached the program (spin.c) for archival; > > in case somebody with a YIELD-supporting ARM machine wants to use it > > to test YIELD. > > PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't > just schedule another process - you won't see different CPU usage if you > measure it purely as the time running. Yeah, I thought that the OS scheduling would be an *indirect* consequence of the pause because of it's slowing down the CPU, but looks like that does not happen. > You should be able to see a > difference if you measure with a profiler that shows you data from the > CPUs performance monitoring unit. Hmm, I had tried with perf and could see the pause itself consuming 5% cpu. But I haven't yet played with per-process figures. -- Thanks, -Amit Khandekar Huawei Technologies -- Thanks, -Amit Khandekar Huawei Technologies
Re: execExprInterp() questions / How to improve scalar array op expr eval?
I've read through all of the previous discussions related to stable subexpression caching, and I'm planning to send a summary email with all of those links in one place. But I also happened to stumble upon mention in the TODO of some email discussion way back in 2007 where Tom suggested [1] we should really try planning scalar array ops (particularly those with large IN lists) as `IN (VALUES ...)`. That actually would solve the specific case I'd had this problem with (seq scan on a large constant array IN expression). Ideally any query with forms like: select * from t where a in (1, 2,...) select * from t where a in ((select i from x)) would always be isomorphic in planning. But thinking about this overnight and scanning through things quickly this morning, I have a feeling that'd be 1.) a pretty significant undertaking, and 2.) likely to explode the number of plans considered. Also I don't know if there's a good place to slot that into planning. Do either of you happen to have any pointers into places that do similar kinds of rewrites I could look at? And in those cases do we normally always rewrite or do we consider both styles independently? I suppose _only_ handling the case where a `IN (VALUES ...)` replaces a seq scan with a scalar array op might be somewhat easier...but feels like it leaves a lot of holes. I'm still at the point where I'm trying to determine if any of the above (subexpression caching, saop optimization only on constants, re-planning as `IN (VALUES ...)`) is something reasonable enough relative to the amount of effort to be worth working on. James [1]: https://www.postgresql.org/message-id/19001.1178823208%40sss.pgh.pa.us
Re: where should I stick that backup?
On Sun, Apr 12, 2020 at 9:18 PM Stephen Frost wrote: > There's two different questions we're talking about here and I feel like > they're being conflated. To try and clarify: > > - Could you implement FDWs with shell scripts, and custom programs? I'm > pretty confident that the answer is yes, but the thrust of that > argument is primarily to show that you *can* implement just about > anything using a shell script "API", so just saying it's possible to > do doesn't make it necessarily a good solution. The FDW system is > complicated, and also good, because we made it so and because it's > possible to do more sophisticated things with a C API, but it could > have started out with shell scripts that just returned data in much > the same way that COPY PROGRAM works today. What matters is that > forward thinking to consider what you're going to want to do tomorrow, > not just thinking about how you can solve for the simple cases today > with a shell out to an existing command. > > - Does providing a C-library interface deter people from implementing > solutions that use that interface? Perhaps it does, but it doesn't > have nearly the dampening effect that is being portrayed here, and we > can see that pretty clearly from the FDW situation. Sure, not all of > those are good solutions, but lots and lots of archive command shell > scripts are also pretty terrible, and there *are* a few good solutions > out there, including the ones that we ourselves ship. At least when > it comes to FDWs, there's an option there for us to ship a *good* > answer ourselves for certain (and, in particular, the very very > common) use-cases. > > > - We're only talking about writing a handful of tar files, and that's > > in the context of a full-database backup, which is a much > > heavier-weight operation than a query. > > This is true for -Ft, but not -Fp, and I don't think there's enough > thought being put into this when it comes to parallelism and that you > don't want to be limited to one process per tablespace. > > > - There is not really any state that needs to be maintained across calls. > > As mentioned elsewhere, this isn't really true. These are fair points, and my thinking has been somewhat refined by this discussion, so let me try to clarify my (current) position a bit. I believe that there are two subtly different questions here. Question #1 is "Would it be useful to people to be able to pipe the tar files that they get from pg_basebackup into some other command rather than writing them to the filesystem, and should we give them the option to do so?" Question #2 is "Is piping the tar files that pg_basebackup would produce into some other program the best possible way of providing more flexibility about where backups get written?" I'm prepared to concede that the answer to question #2 is no. I had earlier assumed that establishing connections was pretty fast and that, even if not, there were solutions to that problem, like setting up an SSH tunnel in advance. Several people have said, well, no, establishing connections is a problem. As I acknowledged from the beginning, plain format backups are a problem. So I think a convincing argument has been made that a shell command won't meet everyone's needs, and a more complex API is required for some cases. But I still think the answer to question #1 is yes. I disagree entirely with any argument to the effect that because some users might do unsafe things with the option, we ought not to provide it. Practically speaking, it would work fine for many people even with no other changes, and if we add something like pgfile, which I'm willing to do, it would work for more people in more situations. It is a useful thing to have, full stop. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Lexer issues
Hi, I am experimenting with postgres and am wondering if there is any tutorial on how to properly add a new command to postgres. I want to add a new constraint on "CREATE ROLE" that requires an integer, it has an identifier that is not a known (reserved or unreserved keyword) in postgres, say we call it TestPatrick. In other words, I want to do this "CREATE ROLE X TestPatrick=10". I am having an issue with having postgres recognize my new syntax. I have seen this video: https://www.youtube.com/watch?v=uSEXTcEiXGQ and was able to add have my postgres compile with my added word (modified gram.y, kwlist.h, gram.cpp etc based on the video). However, when I use my syntax on a client session, it still doesn't recognize my syntax... Are there any specific lexer changes I need to make? I followed the example of CONNECTION LIMIT and tried to mimic it for Create ROLE. Best, Patrick
Re: where should I stick that backup?
On Sun, Apr 12, 2020 at 09:18:28PM -0400, Stephen Frost wrote: > * Bruce Momjian (br...@momjian.us) wrote: > > On Fri, Apr 10, 2020 at 10:54:10AM -0400, Stephen Frost wrote: > > > * Robert Haas (robertmh...@gmail.com) wrote: > > > > On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian wrote: > > > > > Good point, but if there are multiple APIs, it makes shell script > > > > > flexibility even more useful. > > > > > > > > This is really the key point for me. There are so many existing tools > > > > that store a file someplace that we really can't ever hope to support > > > > them all in core, or even to have well-written extensions that support > > > > them all available on PGXN or wherever. We need to integrate with the > > > > tools that other people have created, not try to reinvent them all in > > > > PostgreSQL. > > > > > > So, this goes to what I was just mentioning to Bruce independently- you > > > could have made the same argument about FDWs, but it just doesn't > > > actually hold any water. Sure, some of the FDWs aren't great, but > > > there's certainly no shortage of them, and the ones that are > > > particularly important (like postgres_fdw) are well written and in core. > > > > No, no one made that argument. It isn't clear how a shell script API > > would map to relational database queries. The point is how well the > > APIs match, and then if they are close, does it give us the flexibility > > we need. You can't just look at flexibility without an API match. > > If what we're talking about is the file_fdw, which certainly isn't very > complicated, it's not hard to see how you could use shell scripts for > it. What happens is that it starts to get harder and require custom > code when you want to do something more complex- which is very nearly > what we're talking about here too. Sure, for a simple 'bzip2' filter, a > shell script might be alright, but it's not going to cut it for the more > complex use-cases that users, today, expect solutions to. Well, file_fdw is the simplest FDW, and we might have been able to do that in shell script, but almost all the other FDWs couldn't, so we might as well choose a C API for FDWs and use the same one for file_fdw. It seems like basic engineering that you choose the closest API that meets most of your deployment requirements, and meets all of the required ones. > To that end, if we contemplate adding support for some cloud vendor's > storage, as an example, and discover that the command line tools for it > suck or don't meet our expectations, I'd expect us to either refuse to > support it, or to forgo using the command-line tools and instead > implement support for talking to the cloud storage interface directly, > if it works well. Do we choose a more inflexible API on a hypothetical risk? -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
wrong relkind error messages
I'm not a fan of error messages like relation "%s" is not a table, foreign table, or materialized view It doesn't tell me what's wrong, it only tells me what else could have worked. It's also tedious to maintain and the number of combinations grows over time. This was discussed many years ago in [0], with the same arguments, and there appeared to have been general agreement to change this, but then the thread stalled somehow on some technical details. Attached is another attempt to improve this. I have rewritten the primary error messages using the principle of "cannot do this with that" and then added a detail message to show what relkind the object has. For example: -ERROR: relation "ti" is not a table, foreign table, or materialized view +ERROR: cannot define statistics for relation "ti" +DETAIL: "ti" is an index. and -ERROR: "test_foreign_table" is not a table, materialized view, or TOAST table +ERROR: relation "test_foreign_table" does not have a visibility map +DETAIL: "test_foreign_table" is a foreign table. You can see more instances of this in the test diffs in the attached patch. In passing, I also changed a few places to use the RELKIND_HAS_STORAGE() macro. This is related because it allows writing more helpful error messages, such as in pgstatindex.c. One question on a detail arose: check_relation_relkind() in pg_visibility.c accepts RELKIND_RELATION, RELKIND_MATVIEW, and RELKIND_TOASTVALUE, but pgstatapprox.c only accepts RELKIND_RELATION and RELKIND_MATVIEW, even though they both look for a visibility map. Is that an intentional omission? If so, it should be commented better. [0]: https://www.postgresql.org/message-id/flat/AANLkTimR_sZ_wKd1cgqVG1PEvTvdr9j7zD%2B3_NPvfaa_%40mail.gmail.com -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 2e32dce36cd34e6d58c99067969f98de1bb1264b Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 13 Apr 2020 15:29:46 +0200 Subject: [PATCH] Improve error messages about mismatching relkind Most error messages about a relkind that was not supported or appropriate for the command was of the pattern "relation \"%s\" is not a table, foreign table, or materialized view" This style can become verbose and tedious to maintain. Moreover, it's not very helpful: If I'm trying to create a comment on a TOAST table, which is not supported, then the information that I could have created a comment on a materialized view is pointless. Instead, write the primary error message shorter and saying more directly that what was attempted is not supported or possible. Then, in the detail message show show relkind the object was. To simplify that, add a new function errdetail_relkind() that does this. In passing, make use of RELKIND_HAS_STORAGE() where appropriate, instead of listing out the relkinds individually. --- .../pg_visibility/expected/pg_visibility.out | 75 ++-- contrib/pg_visibility/pg_visibility.c | 5 +- contrib/pgstattuple/expected/pgstattuple.out | 18 ++-- contrib/pgstattuple/pgstatapprox.c| 5 +- contrib/pgstattuple/pgstatindex.c | 11 +-- src/backend/catalog/Makefile | 1 + src/backend/catalog/heap.c| 7 +- src/backend/catalog/pg_class.c| 51 +++ src/backend/catalog/toasting.c| 6 +- src/backend/commands/comment.c| 5 +- src/backend/commands/indexcmds.c | 16 +--- src/backend/commands/lockcmds.c | 5 +- src/backend/commands/seclabel.c | 5 +- src/backend/commands/sequence.c | 5 +- src/backend/commands/statscmds.c | 5 +- src/backend/commands/tablecmds.c | 88 --- src/backend/commands/trigger.c| 15 ++-- src/backend/parser/parse_utilcmd.c| 5 +- src/backend/postmaster/pgstat.c | 6 +- src/backend/rewrite/rewriteDefine.c | 8 +- src/backend/utils/adt/dbsize.c| 73 ++- src/include/catalog/pg_class.h| 1 + src/test/regress/expected/alter_table.out | 9 +- .../regress/expected/create_table_like.out| 3 +- src/test/regress/expected/foreign_data.out| 6 +- src/test/regress/expected/indexing.out| 3 +- src/test/regress/expected/sequence.out| 3 +- src/test/regress/expected/stats_ext.out | 12 ++- 28 files changed, 235 insertions(+), 217 deletions(-) create mode 100644 src/backend/catalog/pg_class.c diff --git a/contrib/pg_visibility/expected/pg_visibility.out b/contrib/pg_visibility/expected/pg_visibility.out index 2abc1b5107..01ff1361d4 100644 --- a/contrib/pg_visibility/expected/pg_visibility.out +++ b/contrib/pg_visibility/expected/pg_visibility.out @@ -41,66 +41,91 @@ ROLLBACK; create table test_partitioned (a int) partition by list (a)
Re: weird hash plan cost, starting with pg10
Richard Guo writes: > At first I was wondering if we need to check whether HashState.hashtable > is not NULL in ExecShutdownHash() before we decide to allocate save > space for HashState.hinstrument. And then I convinced myself that that's > not necessary since HashState.hinstrument and HashState.hashtable cannot > be both NULL there. Even if the hashtable is null at that point, creating an all-zeroes hinstrument struct is harmless. regards, tom lane
Re: WAL usage calculation patch
Le lun. 13 avr. 2020 à 13:47, Amit Kapila a écrit : > On Mon, Apr 13, 2020 at 1:10 PM Julien Rouhaud wrote: > > > > On Mon, Apr 13, 2020 at 8:11 AM Amit Kapila > wrote: > > > > > > On Sat, Apr 11, 2020 at 6:55 PM Julien Rouhaud > wrote: > > > > > > > > On Fri, Apr 10, 2020 at 9:37 PM Julien Rouhaud > wrote: > > > > > > > > > I tried to take into account all that have been discussed, but I have > > > > to admit that I'm absolutely not sure of what was actually decided > > > > here. I went with those changes: > > > > > > > > - rename wal_num_fpw to wal_fpw for consistency, both in pgss view > > > > fiel name but also everywhere in the code > > > > - change comments to consistently mention "full page writes > generated" > > > > - changed pgss and explain documentation to mention "full page images > > > > generated", from Justin's patch on another thread > > > > > > > > > > I think it is better to use "full page writes" to be consistent with > > > other places. > > > > > > > - kept "amount" of WAL bytes > > > > > > > > > > Okay, but I would like to make another change suggested by Justin > > > which is to replace "count" with "number" at a few places. > > > > Ah sorry I missed this one. +1 it also sounds better. > > > > > I have made the above two changes in the attached. Let me know what > > > you think about attached? > > > > It all looks good to me! > > > > Pushed. > Thanks a lot Amit! >
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh wrote: > > On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar wrote: > > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh > > wrote: > > > > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > This looks wrong. We should change the name of this Macro or we can > > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. > > > > I think this is in sync with below code (SizeOfXlogOrigin), SO doen't > > make much sense to add different terminology no? > > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char)) > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > In that case, we can rename this, for example, SizeOfXLogTransactionId. Make sense. > > Some review comments from 0002-Issue-individual-*.path, > > +void > +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid, > + XLogRecPtr lsn, int nmsgs, > + SharedInvalidationMessage *msgs) > +{ > + MemoryContext oldcontext; > + ReorderBufferChange *change; > + > + /* XXX Should we even write invalidations without valid XID? */ > + if (xid == InvalidTransactionId) > + return; > + > + Assert(xid != InvalidTransactionId); > > It seems we don't call the function if xid is not valid. In fact, > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf) > } > case XLOG_XACT_ASSIGNMENT: > break; > + case XLOG_XACT_INVALIDATIONS: > + { > + TransactionId xid; > + xl_xact_invalidations *invals; > + > + xid = XLogRecGetXid(r); > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > + > + if (!TransactionIdIsValid(xid)) > + break; > + > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, > + invals->nmsgs, invals->msgs); > > Why should we insert an WAL record for such cases? I think we can avoid this. I will analyze and send update in my next patch. > > + * When wal_level=logical, write invalidations into WAL at each command end > to > + * support the decoding of the in-progress transaction. As of now it was > + * enough to log invalidation only at commit because we are only decoding > the > + * transaction at the commit time. We only need to log the catalog cache > and > + * relcache invalidation. There can not be any active MVCC scan in logical > + * decoding so we don't need to log the snapshot invalidation. > The alignment is not right. Will fix. > /* > * CommandEndInvalidationMessages > - * Process queued-up invalidation messages at end of one command > - * in a transaction. > + * Process queued-up invalidation messages at end of one command > + * in a transaction. > Looks unnecessary changes. Will fix. > > * Note: > - * This should be called during CommandCounterIncrement(), > - * after we have advanced the command ID. > + * This should be called during CommandCounterIncrement(), > + * after we have advanced the command ID. > */ > Looks unnecessary changes. Will fix. > if (transInvalInfo == NULL) > - return; > + return; > Looks unnecessary changes. > > + /* prepare record */ > + memset(&xlrec, 0, sizeof(xlrec)); > We should use MinSizeOfXactInvalidations, no? Right. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar wrote: > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh > wrote: > > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > This looks wrong. We should change the name of this Macro or we can > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. > > I think this is in sync with below code (SizeOfXlogOrigin), SO doen't > make much sense to add different terminology no? > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char)) > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > In that case, we can rename this, for example, SizeOfXLogTransactionId. Some review comments from 0002-Issue-individual-*.path, +void +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid, + XLogRecPtr lsn, int nmsgs, + SharedInvalidationMessage *msgs) +{ + MemoryContext oldcontext; + ReorderBufferChange *change; + + /* XXX Should we even write invalidations without valid XID? */ + if (xid == InvalidTransactionId) + return; + + Assert(xid != InvalidTransactionId); It seems we don't call the function if xid is not valid. In fact, @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) } case XLOG_XACT_ASSIGNMENT: break; + case XLOG_XACT_INVALIDATIONS: + { + TransactionId xid; + xl_xact_invalidations *invals; + + xid = XLogRecGetXid(r); + invals = (xl_xact_invalidations *) XLogRecGetData(r); + + if (!TransactionIdIsValid(xid)) + break; + + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, + invals->nmsgs, invals->msgs); Why should we insert an WAL record for such cases? + * When wal_level=logical, write invalidations into WAL at each command end to + * support the decoding of the in-progress transaction. As of now it was + * enough to log invalidation only at commit because we are only decoding the + * transaction at the commit time. We only need to log the catalog cache and + * relcache invalidation. There can not be any active MVCC scan in logical + * decoding so we don't need to log the snapshot invalidation. The alignment is not right. /* * CommandEndInvalidationMessages - * Process queued-up invalidation messages at end of one command - * in a transaction. + * Process queued-up invalidation messages at end of one command + * in a transaction. Looks unnecessary changes. * Note: - * This should be called during CommandCounterIncrement(), - * after we have advanced the command ID. + * This should be called during CommandCounterIncrement(), + * after we have advanced the command ID. */ Looks unnecessary changes. if (transInvalInfo == NULL) - return; + return; Looks unnecessary changes. + /* prepare record */ + memset(&xlrec, 0, sizeof(xlrec)); We should use MinSizeOfXactInvalidations, no? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: Vacuum o/p with (full 1, parallel 0) option throwing an error
On Mon, Apr 13, 2020 at 4:23 PM Masahiko Sawada wrote: > > On Mon, 13 Apr 2020 at 18:25, Amit Kapila wrote: > > > > On Fri, Apr 10, 2020 at 7:05 PM Justin Pryzby wrote: > > > > > > > > > No problem. I think I was trying to make my text similar to that from > > > 14a4f6f37. > > > > > > I realized that I didn't sq!uash my last patch, so it didn't include the > > > functional change (which is maybe what Robert was referring to). > > > > > > > I think it is better to add a new test for temporary table which has > > less data. We don't want to increase test timings to test the > > combination of options. I changed that in the attached patch. I will > > commit this tomorrow unless you or anyone else has any more comments. > > > > Thank you for updating the patch! > > I think we can update the documentation as well. Currently, the > documentation says "This option can't be used with the FULL option." > but we can say instead, for example, "VACUUM FULL can't use parallel > vacuum.". > I am not very sure about this. I don't think the current text is wrong especially when you see the value we can specify there is described as: "Specifies a non-negative integer value passed to the selected option.". However, we can consider changing it if others also think the proposed text or something like that is better than current text. > Also, I'm concerned that the documentation says that VACUUM FULL > cannot use parallel vacuum and we compute the parallel degree when > PARALLEL option is omitted, but the following command is accepted: > > postgres(1:55514)=# vacuum (full on) test; > VACUUM > > Instead, we can say: > > In plain VACUUM (without FULL), if the PARALLEL option is omitted, > then VACUUM decides the number of workers based on the number of > indexes that support parallel vacuum operation on the relation which > is further limited by max_parallel_maintenance_workers. > > (it just adds "In plain VACUUM (without FULL)" to the beginning of the > original sentence.) > Yeah, something on these lines would be a good idea. Note that, we are already planning to slightly change this particular sentence in another patch [1]. [1] - https://www.postgresql.org/message-id/20200322021801.GB2563%40telsasoft.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh wrote: > > On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar wrote: > > > > I have rebased the patch on the latest head. I haven't yet changed > > anything for xid assignment thing because it is not yet concluded. > > > Some review comments from 0001-Immediately-WAL-log-*.patch, > > +bool > +IsSubTransactionAssignmentPending(void) > +{ > + if (!XLogLogicalInfoActive()) > + return false; > + > + /* we need to be in a transaction state */ > + if (!IsTransactionState()) > + return false; > + > + /* it has to be a subtransaction */ > + if (!IsSubTransaction()) > + return false; > + > + /* the subtransaction has to have a XID assigned */ > + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny())) > + return false; > + > + /* and it needs to have 'assigned' */ > + return !CurrentTransactionState->assigned; > + > +} > IMHO, it's important to reduce the complexity of this function since > it's been called for every WAL insertion. During the lifespan of a > transaction, any of these if conditions will only be evaluated if > previous conditions are true. So, we can maintain some state machine > to avoid multiple evaluation of a condition inside a transaction. But, > if the overhead is not much, it's not worth I guess. Yeah maybe, in some cases we can avoid checking multiple conditions by maintaining that state. But, that state will have to be at the transaction level. But, I am not sure how much worth it will be to add one extra condition to skip a few if checks and it will also add the code complexity. And, in some cases where logical decoding is not enabled, it may add one extra check? I mean first check the state and that will take you to the first if check. > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > This looks wrong. We should change the name of this Macro or we can > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. I think this is in sync with below code (SizeOfXlogOrigin), SO doen't make much sense to add different terminology no? #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char)) +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > @@ -195,6 +197,10 @@ XLogResetInsertion(void) > { > int i; > > + /* reset the subxact assignment flag (if needed) */ > + if (curinsert_flags & XLOG_INCLUDE_XID) > + MarkSubTransactionAssigned(); > The comment looks contradictory. > > XLogSetRecordFlags(uint8 flags) > { > Assert(begininsert_called); > - curinsert_flags = flags; > + curinsert_flags |= flags; > } > I didn't understand why we need this change in this patch. I think it's changed so that below code can use it, but we have directly set the flag. I think I will change in the next version. @@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, scratch += sizeof(replorigin_session_origin); } + /* followed by toplevel XID, if not already included in previous record */ + if (IsSubTransactionAssignmentPending()) + { + TransactionId xid = GetTopTransactionIdIfAny(); + + /* update the flag (later used by XLogInsertRecord) */ + curinsert_flags |= XLOG_INCLUDE_XID; > > + txid = XLogRecGetTopXid(record); > + > + /* > + * If the toplevel_xid is valid, we need to assign the subxact to the > + * toplevel transaction. We need to do this for all records, hence we > + * do it before the switch. > + */ > s/toplevel_xid/toplevel xid or s/toplevel_xid/txid Okay, we can change > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT && > - info != XLOG_XACT_ASSIGNMENT) > + !TransactionIdIsValid(r->toplevel_xid)) > Perhaps, XLogRecGetTopXid() can be used. ok -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: WAL usage calculation patch
On Mon, Apr 13, 2020 at 1:10 PM Julien Rouhaud wrote: > > On Mon, Apr 13, 2020 at 8:11 AM Amit Kapila wrote: > > > > On Sat, Apr 11, 2020 at 6:55 PM Julien Rouhaud wrote: > > > > > > On Fri, Apr 10, 2020 at 9:37 PM Julien Rouhaud wrote: > > > > > > > I tried to take into account all that have been discussed, but I have > > > to admit that I'm absolutely not sure of what was actually decided > > > here. I went with those changes: > > > > > > - rename wal_num_fpw to wal_fpw for consistency, both in pgss view > > > fiel name but also everywhere in the code > > > - change comments to consistently mention "full page writes generated" > > > - changed pgss and explain documentation to mention "full page images > > > generated", from Justin's patch on another thread > > > > > > > I think it is better to use "full page writes" to be consistent with > > other places. > > > > > - kept "amount" of WAL bytes > > > > > > > Okay, but I would like to make another change suggested by Justin > > which is to replace "count" with "number" at a few places. > > Ah sorry I missed this one. +1 it also sounds better. > > > I have made the above two changes in the attached. Let me know what > > you think about attached? > > It all looks good to me! > Pushed. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Vacuum o/p with (full 1, parallel 0) option throwing an error
On Mon, 13 Apr 2020 at 18:25, Amit Kapila wrote: > > On Fri, Apr 10, 2020 at 7:05 PM Justin Pryzby wrote: > > > > > > No problem. I think I was trying to make my text similar to that from > > 14a4f6f37. > > > > I realized that I didn't sq!uash my last patch, so it didn't include the > > functional change (which is maybe what Robert was referring to). > > > > I think it is better to add a new test for temporary table which has > less data. We don't want to increase test timings to test the > combination of options. I changed that in the attached patch. I will > commit this tomorrow unless you or anyone else has any more comments. > Thank you for updating the patch! I think we can update the documentation as well. Currently, the documentation says "This option can't be used with the FULL option." but we can say instead, for example, "VACUUM FULL can't use parallel vacuum.". Also, I'm concerned that the documentation says that VACUUM FULL cannot use parallel vacuum and we compute the parallel degree when PARALLEL option is omitted, but the following command is accepted: postgres(1:55514)=# vacuum (full on) test; VACUUM Instead, we can say: In plain VACUUM (without FULL), if the PARALLEL option is omitted, then VACUUM decides the number of workers based on the number of indexes that support parallel vacuum operation on the relation which is further limited by max_parallel_maintenance_workers. (it just adds "In plain VACUUM (without FULL)" to the beginning of the original sentence.) What do you think? Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar wrote: > > I have rebased the patch on the latest head. I haven't yet changed > anything for xid assignment thing because it is not yet concluded. > Some review comments from 0001-Immediately-WAL-log-*.patch, +bool +IsSubTransactionAssignmentPending(void) +{ + if (!XLogLogicalInfoActive()) + return false; + + /* we need to be in a transaction state */ + if (!IsTransactionState()) + return false; + + /* it has to be a subtransaction */ + if (!IsSubTransaction()) + return false; + + /* the subtransaction has to have a XID assigned */ + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny())) + return false; + + /* and it needs to have 'assigned' */ + return !CurrentTransactionState->assigned; + +} IMHO, it's important to reduce the complexity of this function since it's been called for every WAL insertion. During the lifespan of a transaction, any of these if conditions will only be evaluated if previous conditions are true. So, we can maintain some state machine to avoid multiple evaluation of a condition inside a transaction. But, if the overhead is not much, it's not worth I guess. +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) This looks wrong. We should change the name of this Macro or we can add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. @@ -195,6 +197,10 @@ XLogResetInsertion(void) { int i; + /* reset the subxact assignment flag (if needed) */ + if (curinsert_flags & XLOG_INCLUDE_XID) + MarkSubTransactionAssigned(); The comment looks contradictory. XLogSetRecordFlags(uint8 flags) { Assert(begininsert_called); - curinsert_flags = flags; + curinsert_flags |= flags; } I didn't understand why we need this change in this patch. + txid = XLogRecGetTopXid(record); + + /* + * If the toplevel_xid is valid, we need to assign the subxact to the + * toplevel transaction. We need to do this for all records, hence we + * do it before the switch. + */ s/toplevel_xid/toplevel xid or s/toplevel_xid/txid if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT && - info != XLOG_XACT_ASSIGNMENT) + !TransactionIdIsValid(r->toplevel_xid)) Perhaps, XLogRecGetTopXid() can be used. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: [Proposal] Global temporary tables
On 4/9/20 6:26 PM, 曾文旌 wrote: On 4/7/20 2:27 PM, 曾文旌 wrote: Vacuum full GTT, cluster GTT is already supported in global_temporary_table_v24-pg13.patch. Here , it is skipping GTT postgres=# \c foo You are now connected to database "foo" as user "tushar". foo=# create global temporary table g123( c1 int) ; CREATE TABLE foo=# \q [tushar@localhost bin]$ ./vacuumdb --full foo vacuumdb: vacuuming database "foo" WARNING: skipping vacuum global temp table "g123" because storage is not initialized for current session The message was inappropriate at some point, so I removed it. Thanks Wenjing. Please see -if this below behavior is correct X terminal - postgres=# create global temp table foo1(n int); CREATE TABLE postgres=# insert into foo1 values (generate_series(1,10)); INSERT 0 10 postgres=# vacuum full ; VACUUM Y Terminal - [tushar@localhost bin]$ ./vacuumdb -f postgres vacuumdb: vacuuming database "postgres" WARNING: global temp table oldest relfrozenxid 3276 is the oldest in the entire db DETAIL: The oldest relfrozenxid in pg_class is 3277 HINT: If they differ greatly, please consider cleaning up the data in global temp table. WARNING: global temp table oldest relfrozenxid 3276 is the oldest in the entire db DETAIL: The oldest relfrozenxid in pg_class is 3277 HINT: If they differ greatly, please consider cleaning up the data in global temp table. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Re: Corruption during WAL replay
On Mon, 13 Apr 2020 at 17:40, Andres Freund wrote: > > Hi, > > On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote: > > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti wrote: > > > > > > Thanks Andres and Kyotaro for the quick review. I have fixed the typos > > > and also included the critical section (emulated it with try-catch block > > > since palloc()s are causing issues in the truncate code). This time I > > > used git format-patch. > > > > > > > I briefly looked at the latest patch but I'm not sure it's the right > > thing here to use PG_TRY/PG_CATCH to report the PANIC error. For > > example, with the following code you changed, we will always end up > > with emitting a PANIC "failed to truncate the relation" regardless of > > the actual cause of the error. > > > > + PG_CATCH(); > > + { > > + ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR), > > + errmsg("failed to truncate the relation"))); > > + } > > + PG_END_TRY(); > > > > And the comments of RelationTruncate() mentions: > > I think that's just a workaround for mdtruncate not being usable in > critical sections. > > > > /* > > * We WAL-log the truncation before actually truncating, which means > > * trouble if the truncation fails. If we then crash, the WAL replay > > * likely isn't going to succeed in the truncation either, and cause a > > * PANIC. It's tempting to put a critical section here, but that cure > > * would be worse than the disease. It would turn a usually harmless > > * failure to truncate, that might spell trouble at WAL replay, into a > > * certain PANIC. > > */ > > Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL > log something and then not perform the action. This leads to to primary > / replica getting out of sync, crash recovery potentially not completing > (because of records referencing the should-be-truncated pages), ... > > > > As a second idea, I wonder if we can defer truncation until commit > > time like smgrDoPendingDeletes mechanism. The sequence would be: > > This is mostly an issue during [auto]vacuum partially truncating the end > of the file. We intentionally release the AEL regularly to allow other > accesses to continue. > > For transactional truncations we don't go down this path (as we create a > new relfilenode). > > > > At RelationTruncate(), > > 1. WAL logging. > > 2. Remember buffers to be dropped. > > You definitely cannot do that, as explained above. Ah yes, you're right. So it seems to me currently what we can do for this issue would be to enclose the truncation operation in a critical section. IIUC it's not enough just to reverse the order of dropping buffers and physical file truncation because it cannot solve the problem of inconsistency on the standby. And as Horiguchi-san mentioned, there is no need to reverse that order if we envelop the truncation operation by a critical section because we can recover page changes during crash recovery. The strategy of writing out all dirty buffers before dropping buffers, proposed as (a) in [1], also seems not enough. Regards, [1] https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.deDoing sync before truncation -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services