Re: bug in pageinspect's "tuple data" feature
On 21/11/2020 21:32, Alvaro Herrera wrote: If you have a sufficiently broken data page, pageinspect throws an error when trying to examine the page: ERROR: invalid memory alloc request size 18446744073709551451 This is pretty unhelpful; it would be better not to try to print the data instead of dying. With that, at least you can know where the problem is. This was introduced in d6061f83a166 (2015). Proposed patch to fix it (by having the code print a null "data" instead of dying) is attached. Null seems misleading. Maybe something like "invalid", or print a warning? - Heikki
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
> > Here is how I'm making 4 separate patches: > > 1. new function and it's documentation. > 2. GUC and it's documentation. > 3. server level option and it's documentation. > 4. test cases for all of the above patches. > Hi, I'm attaching the patches here. Note that, though the code changes for this feature are small, I divided them up as separate patches to make review easy. v1-0001-postgres_fdw-function-to-discard-cached-connections.patch This patch adds a new function that gets defined as part of CREATE EXTENSION postgres_fdw; postgres_fdw_disconnect() when called with a foreign server name discards the associated connections with the server name. When called without any argument, discards all the existing cached connections. v1-0002-postgres_fdw-add-keep_connections-GUC-to-not-cache-connections.patch This patch adds a new GUC postgres_fdw.keep_connections, default being on, when set to off no remote connections are cached by the local session. v1-0003-postgres_fdw-server-level-option-keep_connection.patch This patch adds a new server level option, keep_connection, default being on, when set to off, the local session doesn't cache the connections associated with the foreign server. v1-0004-postgres_fdw-connection-cache-discard-tests-and-documentation.patch This patch adds the tests and documentation related to this feature. Please review the patches. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v1-0001-postgres_fdw-function-to-discard-cached-connections.patch Description: Binary data v1-0002-postgres_fdw-add-keep_connections-GUC-to-not-cache-connections.patch Description: Binary data v1-0003-postgres_fdw-server-level-option-keep_connection.patch Description: Binary data v1-0004-postgres_fdw-connection-cache-discard-tests-and-documentation.patch Description: Binary data
Re: Parallel plans and "union all" subquery
On Sun, Nov 22, 2020 at 11:51 PM Phil Florent wrote: > > > Hi, > > > I have a question about parallel plans. I also posted it on the general list > but perhaps it's a question for hackers. Here is my test case : > > > explain > select count(*) > from (select > n1 > from drop_me > union all > values(1)) ua; > > > QUERY PLAN > > Aggregate (cost=2934739.24..2934739.25 rows=1 width=8) > -> Append (cost=0.00..2059737.83 rows=7113 width=32) > -> Seq Scan on drop_me (cost=0.00..1009736.12 rows=7112 width=6) > -> Subquery Scan on "*SELECT* 2" (cost=0.00..0.02 rows=1 width=32) > -> Result (cost=0.00..0.01 rows=1 width=4) > JIT: > Functions: 4 > Options: Inlining true, Optimization true, Expressions true, Deforming true > > > No parallel plan, 2s6 > > > I read the documentation but I don't get the reason of the "noparallel" seq > scan of drop_me in the last case ? > Without debugging this, it looks to me that the UNION type resolution isn't working as well as it possibly could in this case, for the generation of a parallel plan. I found that with a minor tweak to your SQL, either for the table creation or query, it will produce a parallel plan. Noting that currently you're creating the drop_me table with a "numeric" column, you can either: (1) Change the table creation FROM: create unlogged table drop_me as select generate_series(1,7e7) n1; TO: create unlogged table drop_me as select generate_series(1,7e7)::int n1; OR (2) Change the query FROM: explain select count(*) from (select n1 from drop_me union all values(1)) ua; TO: explain select count(*) from (select n1 from drop_me union all values(1::numeric)) ua; QUERY PLAN Finalize Aggregate (cost=821152.71..821152.72 rows=1 width=8) -> Gather (cost=821152.50..821152.71 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=820152.50..820152.51 rows=1 width=8) -> Parallel Append (cost=0.00..747235.71 rows=29166714 width=0) -> Result (cost=0.00..0.01 rows=1 width=0) -> Parallel Seq Scan on drop_me (cost=0.00..601402.13 rows=29166713 width=0) (7 rows) Regards, Greg Nancarrow Fujitsu Australia
Re: PoC/WIP: Extended statistics on expressions
On 11/23/20 3:26 AM, Justin Pryzby wrote: > On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote: >> attached is a significantly improved version of the patch, allowing >> defining extended statistics on expressions. This fixes most of the >> problems in the previous WIP version and AFAICS it does pass all >> regression tests (including under valgrind). There's a bunch of FIXMEs >> and a couple loose ends, but overall I think it's ready for reviews. > > I was looking at the previous patch, so now read this one instead, and attach > some proposed fixes. > > + * This matters especially for * expensive expressions, of course. > The point this was trying to make is that we evaluate the expressions only once, and use the results to build all extended statistics. Instead of leaving it up to every "build" to re-evaluate it. > + The expression can refer only to columns of the underlying table, but > + it can use all columns, not just the ones the statistics is defined > + on. > > I don't know what these are trying to say? > D'oh. That's bogus para, copied from the CREATE INDEX docs (where it talked about the index predicate, which is irrelevant here). > +errmsg("statistics expressions and > predicates can refer only to the table being indexed"))); > +* partial-index predicates. Create it in the per-index context to be > > I think these are copied and shouldn't mention "indexes" or "predicates". Or > should statistics support predicates, too ? > Right. Stupid copy-pasto. > Idea: if a user specifies no stakinds, and there's no expression specified, > then you automatically build everything except for expressional stats. But if > they specify only one statistics "column", it gives an error. If that's a > non-simple column reference, should that instead build *only* expressional > stats (possibly with a NOTICE, since the user might be intending to make MV > stats). > Right, that was the intention - but I messed up and it works only if you specify the "expressions" kind explicitly (and I also added the ERROR message to expected output by mistake). I agree we should handle this automatically, so that CREATE STATISTICS s ON (a+b) FROM t works and only creates statistics for the expression. > I think pg_stats_ext should allow inspecting the pg_statistic data in > pg_statistic_ext_data.stxdexprs. I guess array_agg() should be ordered by > something, so maybe it should use ORDINALITY (?) > I agree we should expose the expression statistics, but I'm not convinced we should do that in the pg_stats_ext view itself. The problem is that it's a table bested in a table, essentially, with non-trivial structure, so I was thinking about adding a separate view exposing just this one part. Something like pg_stats_ext_expressions, with about the same structure as pg_stats, or something. > I hacked more on bootstrap.c so included that here. Thanks. As for the 0004-0007 patches: 0004 - Seems fine. IMHO not really "silly errors" but OK. 0005 - Mostly OK. The docs wording mostly comes from CREATE INDEX docs, though. The paragraph about "t1" is old, so if we want to reword it then maybe we should backpatch too. 0006 - Not sure. I think CreateStatistics can be fixed with less code, keeping it more like PG13 (good for backpatching). Not sure why rename extended statistics to multi-variate statistics - we use "extended" everywhere. Not sure what's the point of serialize_expr_stats changes, that's code is mostly copy-paste from update_attstats. 0007 - I suspect this makes the pg_stats_ext too complex to work with, IMHO we should move this to a separate view. Thanks for the review! I'll try to look more closely at those patches sometime next week, and merge most of it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: Disable WAL logging to speed up data loading
From: Osumi, Takamichi/大墨 昂道 > This time, I updated my patch to address comments below only. I forgot to mentionthis. I confirmed all review comments are reflected correctly. Regards Takayuki Tsunakawa
RE: Disable WAL logging to speed up data loading
From: Osumi, Takamichi/大墨 昂道 > > case TRANS_STMT_PREPARE: > > + if (wal_level == > > WAL_LEVEL_NONE) > > + ereport(ERROR, > > + > > errmsg("cannot execute PREPARE TRANSACTION when WAL logging > is > > disabled")); > Usually, this safeguard is not needed for PREPARE TRANSACTION. > But, when an unexpected crash happens, user needs to recreate the cluster > from the backup and execute the operations that are executed into the crashed > server again if the user sets wal_level=none. > > While 2PC guarantees that after issuing PREPARE TRANSACTION, the > processing or the contents of the transaction becomes *secure*, setting > wal_level=none makes the server never start up again if a database crash > occurs. > In other words, this safeguard of the new wal_level damages the important > aspect of PREPARE TRANSACTION's functionality, in my opinion. > > I don't think this is what it should be. > > Therefore, I don't want users to utilize the PREPARE TRANSACTION still in > order to prevent that user thinks that they can use PREPARE TRANSACTION > safely during wal_level=none and execute it. > Did it make sense ? PREPARE TRANSACTION is the same as COMMIT in that it persists transaction updates. A crash during wal_level = none loses both of them. So, I don't think PREPARE TRANSACTION needs special treatment. Does PREPARE TRANSACTION complete successfully? I remember Fujii-san said it PANICed if --enable-asserts is used in configure. Regards Takayuki Tsunakawa
RE: [POC] Fast COPY FROM command for the table with foreign partitions
Hi Andrey-san, From: Tomas Vondra > I needed to look at this patch while working on something related, and I > found it > got broken by 6973533650c a couple days ago. So here's a fixed version, to > keep > cfbot happy. I haven't done any serious review yet. Could I or my colleague continue this patch in a few days? It looks it's stalled over one month. Regards Takayuki Tsunakawa
RE: POC: postgres_fdw insert batching
From: Tomas Vondra > I don't think this is usable in practice, because a single session may > be using multiple FDW servers, with different implementations, latency > to the data nodes, etc. It's unlikely a single GUC value will be > suitable for all of them. That makes sense. The row size varies from table to table, so the user may want to tune this option to reduce memory consumption. I think the attached patch has reflected all your comments. I hope this will pass.. Regards Takayuki Tsunakawa v4-0001-Add-bulk-insert-for-foreign-tables.patch Description: v4-0001-Add-bulk-insert-for-foreign-tables.patch
Re: [PoC] Non-volatile WAL buffer
Hi, On 10/30/20 6:57 AM, Takashi Menjo wrote: > Hi Heikki, > >> I had a new look at this thread today, trying to figure out where >> we are. > > I'm a bit confused. >> >> One thing we have established: mmap()ing WAL files performs worse >> than the current method, if pg_wal is not on a persistent memory >> device. This is because the kernel faults in existing content of >> each page, even though we're overwriting everything. > > Yes. In addition, after a certain page (in the sense of OS page) is > msync()ed, another page fault will occur again when something is > stored into that page. > >> That's unfortunate. I was hoping that mmap() would be a good option >> even without persistent memory hardware. I wish we could tell the >> kernel to zero the pages instead of reading them from the file. >> Maybe clear the file with ftruncate() before mmapping it? > > The area extended by ftruncate() appears as if it were zero-filled > [1]. Please note that it merely "appears as if." It might not be > actually zero-filled as data blocks on devices, so pre-allocating > files should improve transaction performance. At least, on Linux 5.7 > and ext4, it takes more time to store into the mapped file just > open(O_CREAT)ed and ftruncate()d than into the one filled already and > actually. > Does is really matter that it only appears zero-filled? I think Heikki's point was that maybe ftruncate() would prevent the kernel from faulting the existing page content when we're overwriting it. Not sure I understand what the benchmark with ext4 was doing, exactly. How was that measured? Might be interesting to have some simple benchmarking tool to demonstrate this (I believe a small standalone tool written in C should do the trick). >> That should not be problem with a real persistent memory device, >> however (or when emulating it with DRAM). With DAX, the storage is >> memory-mapped directly and there is no page cache, and no >> pre-faulting. > > Yes, with filesystem DAX, there is no page cache for file data. A > page fault still occurs but for each 2MiB DAX hugepage, so its > overhead decreases compared with 4KiB page fault. Such a DAX > hugepage fault is only applied to DAX-mapped files and is different > from a general transparent hugepage fault. > I don't follow - if there are page faults even when overwriting all the data, I'd say it's still an issue even with 2MB DAX pages. How big is the difference between 4kB and 2MB pages? Not sure I understand how is this different from general THP fault? >> Because of that, I'm baffled by what the >> v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it >> correctly, it puts the WAL buffers in a separate file, which is >> stored on the NVRAM. Why? I realize that this is just a Proof of >> Concept, but I'm very much not interested in anything that requires >> the DBA to manage a second WAL location. Did you test the mmap() >> patches with persistent memory hardware? Did you compare that with >> the pmem patchset, on the same hardware? If there's a meaningful >> performance difference between the two, what's causing it? > Yes, this patchset puts the WAL buffers into the file specified by > "nvwal_path" in postgresql.conf. > > Why this patchset puts the buffers into the separated file, not > existing segment files in PGDATA/pg_wal, is because it reduces the > overhead due to system calls such as open(), mmap(), munmap(), and > close(). It open()s and mmap()s the file "nvwal_path" once, and keeps > that file mapped while running. On the other hand, as for the > patchset mmap()ing the segment files, a backend process should > munmap() and close() the current mapped file and open() and mmap() > the new one for each time the inserting location for that process > goes over segments. This causes the performance difference between > the two. > I kinda agree with Heikki here - having to manage yet another location for WAL data is rather inconvenient. We should aim not to make the life of DBAs unnecessarily difficult, IMO. I wonder how significant the syscall overhead is - can you show share some numbers? I don't see any such results in this thread, so I'm not sure if it means losing 1% or 10% throughput. Also, maybe there are alternative ways to reduce the overhead? For example, we can increase the size of the WAL segment, and with 1GB segments we'd do 1/64 of syscalls. Or maybe we could do some of this asynchronously - request a segment ahead, and let another process do the actual work etc. so that the running process does not wait. Do I understand correctly that the patch removes "regular" WAL buffers and instead writes the data into the non-volatile PMEM buffer, without writing that to the WAL segments at all (unless in archiving mode)? Firstly, I guess many (most?) instances will have to write the WAL segments anyway because of PITR/backups, so I'm not sure we can save much here. But more importantly - doesn't that mean the nvwal_siz
Re: [PoC] Non-volatile WAL buffer
Hi, These patches no longer apply :-( A rebased version would be nice. I've been interested in what performance improvements this might bring, so I've been running some extensive benchmarks on a machine with PMEM hardware. So let me share some interesting results. (I used commit from early September, to make the patch apply cleanly.) Note: The hardware was provided by Intel, and they are interested in supporting the development and providing access to machines with PMEM to developers. So if you're interested in this patch & PMEM, but don't have access to suitable hardware, try contacting Steve Shaw who's the person responsible for open source databases at Intel (he's also the author of HammerDB). The benchmarks were done on a machine with 2 x Xeon Platinum (24/48 cores), 128GB RAM, NVMe and PMEM SSDs. I did some basic pgbench tests with different scales (500, 5000, 15000) with and without these patches. I did some usual tuning (shared buffers, max_wal_size etc.), the most important changes being: - maintenance_work_mem = 256MB - max_connections = 200 - random_page_cost = 1.2 - shared_buffers = 16GB - work_mem = 64MB - checkpoint_completion_target = 0.9 - checkpoint_timeout = 20min - max_wal_size = 96GB - autovacuum_analyze_scale_factor = 0.1 - autovacuum_vacuum_insert_scale_factor = 0.05 - autovacuum_vacuum_scale_factor = 0.01 - vacuum_cost_limit = 1000 And on the patched version: - nvwal_size = 128GB - nvwal_path = … points to the PMEM DAX device … The machine has multiple SSDs (all Optane-based, IIRC): - NVMe SSD (Optane) - PMEM in BTT mode - PMEM in DAX mode So I've tested all of them - the data was always on the NVMe device, and the WAL was placed on one of those devices. That means we have these four cases to compare: - nvme - master with WAL on the NVMe SSD - pmembtt - master with WAL on PMEM in BTT mode - pmemdax - master with WAL on PMEM in DAX mode - pmemdax-ntt - patched version with WAL on PMEM in DAX mode The "nvme" is a bit disadvantaged as it places both data and WAL on the same device, so consider that while evaluating the results. But for the smaller data sets this should be fairly negligible, I believe. I'm not entirely sure whether the "pmemdax" (i.e. unpatched instance with WAL on PMEM DAX device) is actually safe, but I included it anyway to see what difference is. Now let's look at results for the basic data sizes and client counts. I've also attached some charts to illustrate this. These numbers are tps averages from 3 runs, each about 30 minutes long. 1) scale 500 (fits into shared buffers) --- wal 1 16326496 -- nvme 632173794132687185409192228 pmembtt624860105 85272 82943 84124 pmemdax668686188154850105219149224 pmemdax-ntt8062 104887211722231085252593 The NVMe performs well (the single device is not an issue, as there should be very little non-WAL I/O). The PMBM/BTT has a clear bottleneck ~85k tps. It's interesting the PMEM/DAX performs much worse without the patch, and the drop at 64 clients. Not sure what that's about. 2) scale 5000 (fits into RAM) - wal 116326496 --- nvme 4804 43636 61443 79807 86414 pmembtt4203 28354 37562 41562 43684 pmemdax5580 62180 92361112935117261 pmemdax-ntt6325 79887128259141793127224 The differences are more significant, compared to the small scale. The BTT seems to have bottleneck around ~43k tps, the PMEM/DAX dominates. 3) scale 15000 (bigger than RAM) wal 116326496 --- pmembtt3638 20630 28985 32019 31303 pmemdax5164 48230 69822 85740 90452 pmemdax-ntt5382 62359 80038 83779 80191 I have not included the nvme results here, because the impact of placing both data and WAL on the same device was too significant IMHO. The remaining results seem nice. It's interesting the patched case is a bit slower than master. Not sure why. Overall, these results seem pretty nice, I guess. Of course, this does not say the current patch is the best way to implement this (or whether it's correct), but it does suggest supporting PMEM might bring sizeable performance boost. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [doc] improve tableoid description
2020年11月21日(土) 16:29 Peter Eisentraut : > > On 2020-10-19 14:28, Ian Lawrence Barwick wrote: > > On further reflection, I think trying to explain all that is going to > > end up as a > > mini-tutorial which is beyond the scope of the explanation of a column, so > > the existing reference to pg_class should be enough. > > > > Revised patch attached just mentioning partitioned tables. > > committed Thanks! -- EnterpriseDB: https://www.enterprisedb.com
Re: Bogus documentation for bogus geometric operators
Pavel Borisov writes: >> undocumented. Maybe instead of removing, change the text to be >> "Deprecated, use the equivalent XXX operator instead." Or we could >> add a footnote similar to what was there for a previous renaming: > The problem that this new <<| is equivalent to <^ only for points (To > recap: the source of a problem is the same name of <^ operator for points > and boxes with different meaning for these types). I don't think it's that hard to be clear; see proposed wording below. The other loose end is that I don't think we can take away the opclass entries for the old spellings, unless we're willing to visibly break people's queries by removing those operator names altogether. That doesn't seem like it'll fly when we haven't even deprecated the old names yet. So for now, we have to support both names in the opclasses. I extended the patch to do that. This version seems committable to me --- any thoughts? regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7d06b979eb..507bc1a668 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -10609,7 +10609,7 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple Is first object strictly below second? -Available for box, polygon, +Available for point, box, polygon, circle. @@ -10625,7 +10625,7 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple Is first object strictly above second? -Available for box, polygon, +Available for point, box, polygon, circle. @@ -10680,21 +10680,6 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple - - -point <^ point -boolean - - -Is first object strictly below second? -(This operator is misnamed; it should be <<|.) - - -point '(1,0)' <^ point '(1,1)' -t - - - box >^ box @@ -10709,21 +10694,6 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple - - -point >^ point -boolean - - -Is first object strictly above second? -(This operator is misnamed; it should be |>>.) - - -point '(1,1)' >^ point '(1,0)' -t - - - geometric_type ?# geometric_type @@ -10877,6 +10847,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple + + + Before PostgreSQL 14, the point + is strictly below/above comparison operators point + <<| point and point + |>> point were respectively + called <^ and >^. These + names are still available, but are deprecated and will eventually be + removed. + + + Geometric Functions diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml index 1bf5f09659..d1b6cc9a01 100644 --- a/doc/src/sgml/gist.sgml +++ b/doc/src/sgml/gist.sgml @@ -118,12 +118,12 @@ point_ops - >^ (point,point) + |>> (point,point) <-> (point,point) << (point,point) >> (point,point) - <^ (point,point) + <<| (point,point) ~= (point,point) <@ (point,box) <@ (point,polygon) diff --git a/doc/src/sgml/spgist.sgml b/doc/src/sgml/spgist.sgml index 68d09951d9..ea88ae45e5 100644 --- a/doc/src/sgml/spgist.sgml +++ b/doc/src/sgml/spgist.sgml @@ -76,7 +76,7 @@ box_ops << (box,box) - <-> (box,point) + <-> (box,point) &< (box,box) &> (box,box) @@ -92,12 +92,12 @@ kd_point_ops - >^ (point,point) + |>> (point,point) <-> (point,point) << (point,point) >> (point,point) - <^ (point,point) + <<| (point,point) ~= (point,point) <@ (point,box) @@ -132,16 +132,16 @@ <<| (polygon,polygon) &<| (polygon,polygon) |>> (polygon,polygon) - |&> (polygon,polygon) + |&> (polygon,polygon) quad_point_ops - >^ (point,point) + |>> (point,point) <-> (point,point) << (point,point) >> (point,point) - <^ (point,point) + <<| (point,point) ~= (point,point) <@ (point,box) @@ -159,7 +159,7 @@ &< (anyrange,anyrange) &> (anyrange,anyrange) -|- (anyrange,anyrange) - + text_ops = (text,text) diff --git a/src/backend/access/gist/gistproc.c b/src/backend/access/gist/gistproc.c index b03c4b55a0..784807c636 100644 --- a/src/backend/access/gist/gistproc.c +++ b/src/backend/access/gist/gistproc.c @@ -1341,8 +1341,18 @@ gist_point_consistent(PG_FUNCTION_ARGS) StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2); bo
Re: Why does create_gather_merge_plan need make_sort?
On 11/22/20 10:31 PM, Tom Lane wrote: > Tomas Vondra writes: >> On 11/20/20 11:24 PM, James Coleman wrote: >>> While looking at another issue I noticed that create_gather_merge_plan >>> calls make_sort if the subplan isn't sufficiently sorted. In all of >>> the cases I've seen where a gather merge path (not plan) is created >>> the input path is expected to be properly sorted, so I was wondering >>> if anyone happened to know what case is being handled by the make_sort >>> call. Removing it doesn't seem to break any tests. > >> Yeah, I think you're right this is dead code, essentially. We're only >> ever calling create_gather_merge_path() with pathkeys matching the >> subpath. And it's like that on REL_12_STABLE too, i.e. before the >> incremental sort was introduced. > > It's probably there by analogy to the other callers of > prepare_sort_from_pathkeys, which all do at least a conditional > make_sort(). I'd be inclined to leave it there; it's cheap insurance > against somebody weakening the existing behavior. > But how do we know it's safe to actually do the sort there, e.g. in light of the volatility/parallel-safety issues discussed in other threads? I agree the check may be useful, but maybe we should just do elog(ERROR) instead. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Fix generate_useful_gather_paths for parallel unsafe pathkeys
On 11/21/20 2:55 AM, James Coleman wrote: > Over on the -bugs list we had a report [1] of a seg fault with > incremental sort. The short of the investigation there was that a > subplan wasn't being serialized since it wasn't parallel safe, and > that attempting to initialize that subplan resulted in the seg fault. > Tom pushed a change to raise an ERROR when this occurs so that at > least we don't crash the server. > > There was some small amount of discussion about this on a thread about > a somewhat similar issue [2] (where volatile expressions were the > issue), but that thread resulted in a committed patch, and beyond that > it seems that this issue deserves its own discussion. > > I've been digging further into the problem, and my first question was > "why doesn't this result in a similar error but with a full sort when > incremental sort is disabled?" After all, we have code to consider a > gather merge + full sort on the cheapest partial path in > generate_useful_gather_paths(), and the plan that I get on Luis's test > case with incremental sort disabled has an full sort above a gather, > which presumably it'd be cheaper to push down into the parallel plan > (using gather merge instead of gather). > > It turns out that there's a separate bug here: I'm guessing that in > the original commit of generate_useful_gather_paths() we had a > copy/paste thinko in this code: > > /* > * If the path has no ordering at all, then we can't use either > * incremental sort or rely on implicit sorting with a gather > * merge. > */ > if (subpath->pathkeys == NIL) > continue; > > The comment claims that we should skip subpaths that aren't sorted at > all since we can't possibly use a gather merge directly or with an > incremental sort applied first. It's true that we can't do either of > those things unless the subpath is already at least partially sorted. > But subsequently we attempt to construct a gather merge path on top of > a full sort (for the cheapest path), and there's no reason to limit > that to partially sorted paths (indeed in the presence of incremental > sort it seems unlikely that that would be the cheapest path). > > We still don't want to build incremental sort paths if the subpath has > no pathkeys, but that will imply presorted_keys = 0, which we already > check. > > So 0001 removes that branch entirely. And, as expected, we now get a > gather merge + full sort the resulting error about subplan not being > initialized. > Yeah, that's clearly a thinko in generate_useful_gather_paths. Thanks for spotting it! The 0001 patch seems fine to me. > With that oddity out of the way, I started looking at the actually > reported problem, and nominally issue is that we have a correlated > subquery (in the final target list and the sort clause), and > correlated subqueries (not sure why exactly in this case; see [3]) > aren't considered parallel-safe. > > As to why that's happening: > 1. find_em_expr_usable_for_sorting_rel doesn't exclude that > expression, and arguably correctly so in the general case since one > might (though we don't now) use this same kind of logic for > non-parallel plans. > 2. generate_useful_pathkeys_for_relation is noting that it'd be useful > to sort on that expression and sees it as safe in part due to the (1). > 3. We create a sort node that includes that expression in its output. > That seems pretty much the same (in terms of safety given generalized > input paths/plans, not in terms of what Robert brought up in [4]) as > what would happen in the prepare_sort_from_pathkeys call in > create_gather_merge_plan. > > So to fix this problem I've added the option to find_em_expr_for_rel > to restrict to parallel-safe expressions in 0002. > OK, that seems like a valid fix. I wonder if we can have EC with members mixing parallel-safe and parallel-unsafe expressions? I guess no, but it's more a feeling than a clear reasoning and so I wonder what would happen in such cases. > Given point 3 above (and the discussion with Robert earlier) above it > seems to me that there's a bigger problem here that just hasn't > happened to have been exposed yet: in particular the > prepare_sort_from_pathkeys call in create_gather_merge_plan seems > suspect to me. But it's also possible other usages of > prepare_sort_from_pathkeys could be suspect given other seemingly > unrelated changed to path generation, since nothing other than > implicit knowledge about the conventions here prevents us from > introducing paths generating sorts with expressions not in the target > list (in fact a part of the issue here IMO is that non-var expression > pathkeys are added to the target list after path generation and > selection is completed). > > Alternatively we could "fix" this by being even more conservative in > find_em_expr_usable_for_sorting_rel and disregard any expressions not > currently found in the target list. But that seems unsatisfactory to > me since, given we know that there are cases w
Re: Why does create_gather_merge_plan need make_sort?
Tomas Vondra writes: > On 11/20/20 11:24 PM, James Coleman wrote: >> While looking at another issue I noticed that create_gather_merge_plan >> calls make_sort if the subplan isn't sufficiently sorted. In all of >> the cases I've seen where a gather merge path (not plan) is created >> the input path is expected to be properly sorted, so I was wondering >> if anyone happened to know what case is being handled by the make_sort >> call. Removing it doesn't seem to break any tests. > Yeah, I think you're right this is dead code, essentially. We're only > ever calling create_gather_merge_path() with pathkeys matching the > subpath. And it's like that on REL_12_STABLE too, i.e. before the > incremental sort was introduced. It's probably there by analogy to the other callers of prepare_sort_from_pathkeys, which all do at least a conditional make_sort(). I'd be inclined to leave it there; it's cheap insurance against somebody weakening the existing behavior. regards, tom lane
Re: Why does create_gather_merge_plan need make_sort?
On 11/20/20 11:24 PM, James Coleman wrote: > While looking at another issue I noticed that create_gather_merge_plan > calls make_sort if the subplan isn't sufficiently sorted. In all of > the cases I've seen where a gather merge path (not plan) is created > the input path is expected to be properly sorted, so I was wondering > if anyone happened to know what case is being handled by the make_sort > call. Removing it doesn't seem to break any tests. > Yeah, I think you're right this is dead code, essentially. We're only ever calling create_gather_merge_path() with pathkeys matching the subpath. And it's like that on REL_12_STABLE too, i.e. before the incremental sort was introduced. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [bug+patch] Inserting DEFAULT into generated columns from VALUES RTE
Dean Rasheed writes: > I think it's actually easier to just do it all in the rewriter -- at > the point where we see that we're about to insert potentially illegal > values from a VALUES RTE into a generated column, scan it to see if > all the values in that column are DEFAULTs, and if so trigger the > existing code to replace the Var in the tlist with the generated > column expression. That way we only do extra work in the case for > which we're currently throwing an error, rather than for every query. That makes sense, and it leads to a nicely small patch. I reviewed this and pushed it. I found only one nitpicky bug: in findDefaultOnlyColumns, the test must be bms_is_empty(default_only_cols) not just default_only_cols == NULL, or it will fail to fall out early as intended when the first row contains some DEFAULTs but later rows don't. I did tweak some of the commentary, too. > I haven't touched the existing error messages. I think that's a > subject for a separate patch. Fair. After looking around a bit, I think that getting an error cursor as I'd speculated about is more trouble than it's worth. For starters, we'd have to pass down the query string into this code, and there might be some ticklish issues about whether a given chunk of parsetree came from that string or from some rule or view. However, I think that just adjusting the error string would be helpful, as attached. (I'm also wondering why the second case is generic ERRCODE_SYNTAX_ERROR and not ERRCODE_GENERATED_ALWAYS. Didn't change it here, though.) regards, tom lane diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c index 839583f834..00a48686c4 100644 --- a/src/backend/rewrite/rewriteHandler.c +++ b/src/backend/rewrite/rewriteHandler.c @@ -861,7 +861,7 @@ rewriteTargetListIU(List *targetList, if (!apply_default) ereport(ERROR, (errcode(ERRCODE_GENERATED_ALWAYS), - errmsg("cannot insert into column \"%s\"", + errmsg("cannot insert a non-DEFAULT value into column \"%s\"", NameStr(att_tup->attname)), errdetail("Column \"%s\" is an identity column defined as GENERATED ALWAYS.", NameStr(att_tup->attname)), @@ -900,7 +900,7 @@ rewriteTargetListIU(List *targetList, if (!apply_default) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("cannot insert into column \"%s\"", + errmsg("cannot insert a non-DEFAULT value into column \"%s\"", NameStr(att_tup->attname)), errdetail("Column \"%s\" is a generated column.", NameStr(att_tup->attname; diff --git a/src/test/regress/expected/generated.out b/src/test/regress/expected/generated.out index 559fafa09e..ca721d38bf 100644 --- a/src/test/regress/expected/generated.out +++ b/src/test/regress/expected/generated.out @@ -93,16 +93,16 @@ LINE 1: ...E gtest_err_8 (a int PRIMARY KEY, b int GENERATED BY DEFAULT... INSERT INTO gtest1 VALUES (1); INSERT INTO gtest1 VALUES (2, DEFAULT); -- ok INSERT INTO gtest1 VALUES (3, 33); -- error -ERROR: cannot insert into column "b" +ERROR: cannot insert a non-DEFAULT value into column "b" DETAIL: Column "b" is a generated column. INSERT INTO gtest1 VALUES (3, 33), (4, 44); -- error -ERROR: cannot insert into column "b" +ERROR: cannot insert a non-DEFAULT value into column "b" DETAIL: Column "b" is a generated column. INSERT INTO gtest1 VALUES (3, DEFAULT), (4, 44); -- error -ERROR: cannot insert into column "b" +ERROR: cannot insert a non-DEFAULT value into column "b" DETAIL: Column "b" is a generated column. INSERT INTO gtest1 VALUES (3, 33), (4, DEFAULT); -- error -ERROR: cannot insert into column "b" +ERROR: cannot insert a non-DEFAULT value into column "b" DETAIL: Column "b" is a generated column. INSERT INTO gtest1 VALUES (3, DEFAULT), (4, DEFAULT); -- ok SELECT * FROM gtest1 ORDER BY a; @@ -193,25 +193,25 @@ SELECT * FROM gtest1v; (1 row) INSERT INTO gtest1v VALUES (4, 8); -- error -ERROR: cannot insert into column "b" +ERROR: cannot insert a non-DEFAULT value into column "b" DETAIL: Column "b" is a generated column. INSERT INTO gtest1v VALUES (5, DEFAULT); -- ok INSERT INTO gtest1v VALUES (6, 66), (7, 77); -- error -ERROR: cannot insert into column "b" +ERROR: cannot insert a non-DEFAULT value into column "b" DETAIL: Column "b" is a generated column. INSERT INTO gtest1v VALUES (6, DEFAULT), (7, 77); -- error -ERROR: cannot insert into column "b" +ERROR: cannot insert a non-DEFAULT value into column "b" DETAIL: Column "b" is a generated column. INSERT INTO gtest1v VALUES (6, 66), (7, DEFAULT); -- error -ERROR: cannot insert into column "b" +ERROR: cannot insert a non-DEFAULT value into column "b" DETAIL: Column "b" is a generated column. INSERT INTO gtest1v VALUES (6, DEFAULT), (7, DEFAULT); -- ok ALTER VIEW gtest1v ALTER COLUMN b SET DEFAULT 100; INSERT INTO gtest1v VALUES (8, DEFAULT); --
Re: More time spending with "delete pending"
19.11.2020 01:28, Tom Lane wrote: > Alexander Lakhin writes: >> 18.11.2020 23:39, Tom Lane wrote: >>> Now, the ones on the 10 and 11 branches are all from pg_ctl, which >>> does not use pgwin32_open() in those branches, only native open(). >>> So those aren't fair to count against it. But we have nearly as >>> many similar failures in HEAD, which surely is going through >>> pgwin32_open(). So either we don't really have adequate protection >>> against delete-pending files, or there is some other effect we haven't >>> explained yet. >> Can you confirm that there are no such failures on REL_12_STABLE and >> REL_13_STABLE for last three (or maybe six) months? Maybe something >> changed in HEAD then? > Hmm, that is an interesting question isn't it. Here's a search going > back a full year. There are a few in v12 --- interestingly, all on > the statistics file, none from pg_ctl --- and none in v13. Of course, > v13 has only existed as a separate branch since 2020-06-07. > > There's also a buildfarm test-coverage artifact involved. The bulk > of the HEAD reports are from dory and walleye, neither of which are > building v13 as yet, because of unclear problems [1]. I think those > two animals build much more frequently than our other Windows animals, > too, so the fact that they have more may be just because of that and > not because they're somehow more susceptible. Because of that, I'm not > sure that we have enough evidence to say that v13 is better than HEAD. > If there is some new bug, it's post-v12, but maybe not post-v13. But > v12 is evidently not perfect either. Thanks for the complete list! As I can see from it, only lorikeet fails on REL_12_STABLE. And it has a simple explanation. lorikeet uses cygwin, not win32, but improved open() and fopen() functions are not defined on cygwin: #if defined(WIN32) && !defined(__CYGWIN__) /* * open() and fopen() replacements to allow deletion of open files and * passing of other special options. */ #define O_DIRECT 0x8000 extern int pgwin32_open(const char *, int,...); extern FILE *pgwin32_fopen(const char *, const char *); #define open(a,b,c) pgwin32_open(a,b,c) #define fopen(a,b) pgwin32_fopen(a,b) ... And aside from the "Permission denied" failures (not necessarily related to the "delete pending" state), we can see the "Device or resource busy" failures on the same global.stat file (e. g., [1], [2], [3]). All these failures occur when VACUUM (or REINDEX [3]) tries to access statistics simultaneously with the statistics collector. I've tried to make open()/fopen() replacements for cygwin, but haven't succeeded yet (I need more time to find out how to get original GetLastError() [4], as knowing errno is not sufficient to implement pgwin32_open()' logic). (And I'm inclined to file a separate bug related to this cygwin behavior when I'll get more info.) The failures on HEAD (on dory and walleye) need another investigation (maybe they are caused by modified stat()...). [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lorikeet&dt=2020-09-07%2020%3A31%3A16&stg=install-check-C [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2019-10-27%2009%3A36%3A07 [3] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lorikeet&dt=2020-09-02%2008%3A28%3A53&stg=install-check-C [4] https://github.com/openunix/cygwin/blob/master/winsup/cygwin/errno.cc Best regards, Alexander
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
On Sun, Nov 22, 2020 at 9:21 PM Andy Fan wrote: > Hi David: > > I did a review on the v8, it looks great to me. Here are some tiny > things noted, > just FYI. > > 1. modified src/include/utils/selfuncs.h > @@ -70,9 +70,9 @@ > * callers to provide further details about some assumptions which were > made > * during the estimation. > */ > -#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one of > - * the DEFAULTs as defined above. > - */ > +#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one > + * of the DEFAULTs as defined > + * above. */ > > Looks nothing has changed. > > > 2. leading spaces is not necessary in comments. > > /* > * ResultCacheTuple Stores an individually cached tuple > */ > typedef struct ResultCacheTuple > { > MinimalTuple mintuple; /* Cached tuple */ > struct ResultCacheTuple *next; /* The next tuple with the same parameter > * values or NULL if it's the last one */ > } ResultCacheTuple; > > > 3. We define ResultCacheKey as below. > > /* > * ResultCacheKey > * The hash table key for cached entries plus the LRU list link > */ > typedef struct ResultCacheKey > { > MinimalTuple params; > dlist_node lru_node; /* Pointer to next/prev key in LRU list */ > } ResultCacheKey; > > Since we store it as a MinimalTuple, we need some FETCH_INNER_VAR step for > each element during the ResultCacheHash_equal call. I am thinking if we > can > store a "Datum *" directly to save these steps. > exec_aggvalues/exec_aggnulls looks > a good candidate for me, except that the name looks not good. IMO, we can > rename exec_aggvalues/exec_aggnulls and try to merge > EEOP_AGGREF/EEOP_WINDOW_FUNC into a more generic step which can be > reused in this case. > > 4. I think the ExecClearTuple in prepare_probe_slot is not a must, since > the > data tts_values/tts_flags/tts_nvalid are all reset later, and tts_tid is > not > real used in our case. Since both prepare_probe_slot > and ResultCacheHash_equal are in pretty hot path, we may need to consider > it. > > static inline void > prepare_probe_slot(ResultCacheState *rcstate, ResultCacheKey *key) > { > ... > ExecClearTuple(pslot); > ... > } > > > static void > tts_virtual_clear(TupleTableSlot *slot) > { > if (unlikely(TTS_SHOULDFREE(slot))) > { > VirtualTupleTableSlot *vslot = (VirtualTupleTableSlot *) slot; > > pfree(vslot->data); > vslot->data = NULL; > > slot->tts_flags &= ~TTS_FLAG_SHOULDFREE; > } > > slot->tts_nvalid = 0; > slot->tts_flags |= TTS_FLAG_EMPTY; > ItemPointerSetInvalid(&slot->tts_tid); > } > > -- > Best Regards > Andy Fan > add 2 more comments. 1. I'd suggest adding Assert(false); in RC_END_OF_SCAN case to make the error clearer. case RC_END_OF_SCAN: /* * We've already returned NULL for this scan, but just in case * something call us again by mistake. */ return NULL; 2. Currently we handle the (!cache_store_tuple(node, outerslot))) case by set it to RC_CACHE_BYPASS_MODE. The only reason for the cache_store_tuple failure is we can't cache_reduce_memory. I guess if cache_reduce_memory failed once, it would not succeed later(no more tuples can be stored, nothing is changed). So I think we can record this state and avoid any new cache_reduce_memory call. /* * If we failed to create the entry or failed to store the * tuple in the entry, then go into bypass mode. */ if (unlikely(entry == NULL || !cache_store_tuple(node, outerslot))) to if (unlikely(entry == NULL || node->memory_cant_be_reduced || !cache_store_tuple(node, outerslot))) -- Best Regards Andy Fan
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Hi David: I did a review on the v8, it looks great to me. Here are some tiny things noted, just FYI. 1. modified src/include/utils/selfuncs.h @@ -70,9 +70,9 @@ * callers to provide further details about some assumptions which were made * during the estimation. */ -#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one of - * the DEFAULTs as defined above. - */ +#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one + * of the DEFAULTs as defined + * above. */ Looks nothing has changed. 2. leading spaces is not necessary in comments. /* * ResultCacheTuple Stores an individually cached tuple */ typedef struct ResultCacheTuple { MinimalTuple mintuple; /* Cached tuple */ struct ResultCacheTuple *next; /* The next tuple with the same parameter * values or NULL if it's the last one */ } ResultCacheTuple; 3. We define ResultCacheKey as below. /* * ResultCacheKey * The hash table key for cached entries plus the LRU list link */ typedef struct ResultCacheKey { MinimalTuple params; dlist_node lru_node; /* Pointer to next/prev key in LRU list */ } ResultCacheKey; Since we store it as a MinimalTuple, we need some FETCH_INNER_VAR step for each element during the ResultCacheHash_equal call. I am thinking if we can store a "Datum *" directly to save these steps. exec_aggvalues/exec_aggnulls looks a good candidate for me, except that the name looks not good. IMO, we can rename exec_aggvalues/exec_aggnulls and try to merge EEOP_AGGREF/EEOP_WINDOW_FUNC into a more generic step which can be reused in this case. 4. I think the ExecClearTuple in prepare_probe_slot is not a must, since the data tts_values/tts_flags/tts_nvalid are all reset later, and tts_tid is not real used in our case. Since both prepare_probe_slot and ResultCacheHash_equal are in pretty hot path, we may need to consider it. static inline void prepare_probe_slot(ResultCacheState *rcstate, ResultCacheKey *key) { ... ExecClearTuple(pslot); ... } static void tts_virtual_clear(TupleTableSlot *slot) { if (unlikely(TTS_SHOULDFREE(slot))) { VirtualTupleTableSlot *vslot = (VirtualTupleTableSlot *) slot; pfree(vslot->data); vslot->data = NULL; slot->tts_flags &= ~TTS_FLAG_SHOULDFREE; } slot->tts_nvalid = 0; slot->tts_flags |= TTS_FLAG_EMPTY; ItemPointerSetInvalid(&slot->tts_tid); } -- Best Regards Andy Fan
Parallel plans and "union all" subquery
Hi, I have a question about parallel plans. I also posted it on the general list but perhaps it's a question for hackers. Here is my test case : select version(); version -- PostgreSQL 13.1 (Ubuntu 13.1-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit create unlogged table drop_me as select generate_series(1,7e7) n1; SELECT 7000 explain select count(*) from (select n1 from drop_me ) s; QUERY PLAN -- Finalize Aggregate (cost=675319.13..675319.14 rows=1 width=8) -> Gather (cost=675318.92..675319.13 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=674318.92..674318.93 rows=1 width=8) -> Parallel Seq Scan on drop_me (cost=0.00..601402.13 rows=29166713 width=0) JIT: Functions: 4 Options: Inlining true, Optimization true, Expressions true, Deforming true Parallel plan, 1s explain select count(*) from (select n1 from drop_me union all select n1 from drop_me) ua; QUERY PLAN -- Finalize Aggregate (cost=1640315.00..1640315.01 rows=1 width=8) -> Gather (cost=1640314.96..1640314.99 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=1640304.96..1640304.97 rows=1 width=8) -> Parallel Append (cost=0.00..1494471.40 rows=58333426 width=0) -> Parallel Seq Scan on drop_me (cost=0.00..601402.13 rows=29166713 width=0) -> Parallel Seq Scan on drop_me drop_me_1 (cost=0.00..601402.13 rows=29166713 width=0) JIT: Functions: 6 Options: Inlining true, Optimization true, Expressions true, Deforming true Parallel plan, 2s2 explain select count(*) from (select n1 from drop_me union all values(1)) ua; QUERY PLAN Aggregate (cost=2934739.24..2934739.25 rows=1 width=8) -> Append (cost=0.00..2059737.83 rows=7113 width=32) -> Seq Scan on drop_me (cost=0.00..1009736.12 rows=7112 width=6) -> Subquery Scan on "*SELECT* 2" (cost=0.00..0.02 rows=1 width=32) -> Result (cost=0.00..0.01 rows=1 width=4) JIT: Functions: 4 Options: Inlining true, Optimization true, Expressions true, Deforming true No parallel plan, 2s6 I read the documentation but I don't get the reason of the "noparallel" seq scan of drop_me in the last case ? Best regards, Phil
Re: Removal of currtid()/currtid2() and some table AM cleanup
On Sat, Nov 21, 2020 at 09:39:28PM -0500, Tom Lane wrote: > Michael Paquier writes: >> So, what you are basically saying is to switch currtid_byreloid() to >> become a function local to tid.c. And then have just >> currtid_byrelname() and currtid_for_view() call that, right? > > Yeah, that sounds about right. Okay, here you go with the attached. If there are any other comments, please feel free. -- Michael diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 92b19dba32..54b2eb7378 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -129,7 +129,6 @@ extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation, bool *all_dead, bool first_call); extern void heap_get_latest_tid(TableScanDesc scan, ItemPointer tid); -extern void setLastTid(const ItemPointer tid); extern BulkInsertState GetBulkInsertState(void); extern void FreeBulkInsertState(BulkInsertState); diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h index c6da0df868..5f33dc9db0 100644 --- a/src/include/catalog/catversion.h +++ b/src/include/catalog/catversion.h @@ -53,6 +53,6 @@ */ /* mmddN */ -#define CATALOG_VERSION_NO 202011191 +#define CATALOG_VERSION_NO 202011221 #endif diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 33dacfd340..e7fbda9f81 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -2549,9 +2549,6 @@ { oid => '1292', proname => 'tideq', proleakproof => 't', prorettype => 'bool', proargtypes => 'tid tid', prosrc => 'tideq' }, -{ oid => '1293', descr => 'latest tid of a tuple', - proname => 'currtid', provolatile => 'v', proparallel => 'u', - prorettype => 'tid', proargtypes => 'oid tid', prosrc => 'currtid_byreloid' }, { oid => '1294', descr => 'latest tid of a tuple', proname => 'currtid2', provolatile => 'v', proparallel => 'u', prorettype => 'tid', proargtypes => 'text tid', diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index 29e07b7228..e0f24283b8 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -645,10 +645,7 @@ ExecInsert(ModifyTableState *mtstate, } if (canSetTag) - { (estate->es_processed)++; - setLastTid(&slot->tts_tid); - } /* * If this insert is the result of a partition key update that moved the diff --git a/src/backend/utils/adt/tid.c b/src/backend/utils/adt/tid.c index 509a0fdffc..6f8b400e83 100644 --- a/src/backend/utils/adt/tid.c +++ b/src/backend/utils/adt/tid.c @@ -47,6 +47,8 @@ #define DELIM ',' #define NTIDARGS 2 +static ItemPointer currtid_for_view(Relation viewrel, ItemPointer tid); + /* * tidin * @@ -275,12 +277,44 @@ hashtidextended(PG_FUNCTION_ARGS) * Maybe these implementations should be moved to another place */ -static ItemPointerData Current_last_tid = {{0, 0}, 0}; - -void -setLastTid(const ItemPointer tid) +/* + * Utility wrapper for current CTID functions. + * Returns the latest version of a tuple pointing at "tid" for + * relation "rel". + */ +static ItemPointer +currtid_internal(Relation rel, ItemPointer tid) { - Current_last_tid = *tid; + ItemPointer result; + AclResult aclresult; + Snapshot snapshot; + TableScanDesc scan; + + result = (ItemPointer) palloc(sizeof(ItemPointerData)); + + aclresult = pg_class_aclcheck(RelationGetRelid(rel), GetUserId(), + ACL_SELECT); + if (aclresult != ACLCHECK_OK) + aclcheck_error(aclresult, get_relkind_objtype(rel->rd_rel->relkind), + RelationGetRelationName(rel)); + + if (rel->rd_rel->relkind == RELKIND_VIEW) + return currtid_for_view(rel, tid); + + if (!RELKIND_HAS_STORAGE(rel->rd_rel->relkind)) + elog(ERROR, "cannot look at latest visible tid for relation \"%s.%s\"", + get_namespace_name(RelationGetNamespace(rel)), + RelationGetRelationName(rel)); + + ItemPointerCopy(tid, result); + + snapshot = RegisterSnapshot(GetLatestSnapshot()); + scan = table_beginscan_tid(rel, snapshot); + table_tuple_get_latest_tid(scan, result); + table_endscan(scan); + UnregisterSnapshot(snapshot); + + return result; } /* @@ -288,7 +322,7 @@ setLastTid(const ItemPointer tid) * CTID should be defined in the view and it must * correspond to the CTID of a base relation. */ -static Datum +static ItemPointer currtid_for_view(Relation viewrel, ItemPointer tid) { TupleDesc att = RelationGetDescr(viewrel); @@ -338,12 +372,12 @@ currtid_for_view(Relation viewrel, ItemPointer tid) rte = rt_fetch(var->varno, query->rtable); if (rte) { - Datum result; + ItemPointer result; + Relation rel; - result = DirectFunctionCall2(currtid_byreloid, - ObjectIdGetDatum(rte->relid), - PointerGetDatum(tid)); - table_close(viewrel, A
Re: Issue with server side statement-level rollback
Le 20/11/2020 à 16:18, Gilles Darold a écrit : > I will work later on a POC to demonstrate the use case I want to > implement. Hi Andres, I have created a new version of the pg_statement_rollback extension [1] to demonstrate the use of the hooks on start_xact_command(), finish_xact_command() and AbortCurrentTransaction() to implement the statement-level rollback feature entirely driven at serverside. It require that the patch [2] I've provided be applied on PostgreSQL source first. Here is what can be achieved with this patch: LOAD 'pg_statement_rollback.so'; LOAD SET pg_statement_rollback.enabled TO on; SET CREATE SCHEMA testrsl; CREATE SCHEMA SET search_path TO testrsl,public; SET BEGIN; BEGIN CREATE TABLE tbl_rsl(id integer, val varchar(256)); CREATE TABLE INSERT INTO tbl_rsl VALUES (1, 'one'); INSERT 0 1 WITH write AS (INSERT INTO tbl_rsl VALUES (2, 'two') RETURNING id, val) SELECT * FROM write; id | val +- 2 | two (1 row) UPDATE tbl_rsl SET id = 'two', val = 2 WHERE id = 1; -- > will fail psql:simple.sql:14: ERROR: invalid input syntax for type integer: "two" LINE 1: UPDATE tbl_rsl SET id = 'two', val = 2 WHERE id = 1; ^ SELECT * FROM tbl_rsl; -- Should show records id 1 + 2 id | val +- 1 | one 2 | two (2 rows) COMMIT; COMMIT Actually unlike I've though this is the hook on finish_xact_command() that is useless. In the extension I'm executing the RELEASE/SAVEPOINT in the start_xact_command() hook before executing the next statement. The hook on AbortCurrentTransaction() is used to signal that a ROLLOBACK TO/SAVEPOINT need to be executed into the start_xact_command() hook instead of a RELEASE/SAVEPOINT. This works perfectly and do not crash PG anymore when compiled with assert. Advanced tests (with triggers, client savepoint, CTE, etc.) are available in the test/sql/ directory. Use of "make installcheck" allow to run the regression tests. Based on this result I really think that these hooks should be included to be able to extend PostgreSQL for such feature although I have not though about an other use that this one. Regards, I've attached all code for archiving but the current version can be found here too: [1] https://github.com/darold/pg_statement_rollbackv2 [2] https://raw.githubusercontent.com/darold/pg_statement_rollbackv2/main/command-start-finish-hook-v1.patch -- Gilles Darold http://www.darold.net/ diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 03c553e7ea..81e1df27ef 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -314,6 +314,9 @@ typedef struct SubXactCallbackItem static SubXactCallbackItem *SubXact_callbacks = NULL; +/* Hook for plugins to get control of at end of AbortCurrentTransaction */ +AbortCurrentTransaction_hook_type abort_current_transaction_hook = NULL; + /* local function prototypes */ static void AssignTransactionId(TransactionState s); static void AbortTransaction(void); @@ -3358,6 +3361,9 @@ AbortCurrentTransaction(void) AbortCurrentTransaction(); break; } + + if (abort_current_transaction_hook) + (*abort_current_transaction_hook) (); } /* diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 7c5f7c775b..3fff54ff51 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -192,6 +192,14 @@ static void log_disconnections(int code, Datum arg); static void enable_statement_timeout(void); static void disable_statement_timeout(void); +/* + * Hook for plugins to get control at end/start of + * start_xact_command/finish_xact_command. The hook + * on start_xact_command might be useless as it happens + * before UtilityProccess and the Executor* hooks. + */ +XactCommandStart_hook_type start_xact_command_hook = NULL; +XactCommandFinish_hook_type finish_xact_command_hook = NULL; /* * routines to obtain user input @@ -2634,6 +2642,7 @@ exec_describe_portal_message(const char *portal_name) static void start_xact_command(void) { + if (!xact_started) { StartTransactionCommand(); @@ -2649,11 +2658,19 @@ start_xact_command(void) * not desired, the timeout has to be disabled explicitly. */ enable_statement_timeout(); + + if (start_xact_command_hook) + (*start_xact_command_hook) (); + } + static void finish_xact_command(void) { + if (finish_xact_command_hook) + (*finish_xact_command_hook) (); + /* cancel active statement timeout after each command */ disable_statement_timeout(); diff --git a/src/include/access/xact.h b/src/include/access/xact.h index 7320de345c..2e866b2a91 100644 --- a/src/include/access/xact.h +++ b/src/include/access/xact.h @@ -467,4 +467,8 @@ extern void EnterParallelMode(void); extern void ExitPar
[PATCH 1/1] Initial mach based shared memory support.
OSX implements sysv shmem support via a mach wrapper, however the mach sysv shmem wrapper has some severe restrictions that prevent us from allocating enough memory segments in some cases. These limits appear to be due to the way the wrapper itself is implemented and not mach. For example when running a test suite that spins up many postgres instances I commonly see this issue: DETAIL: Failed system call was shmget(key=5432002, size=56, 03600). In order to avoid hitting these limits we can bypass the wrapper layer and just use mach directly. This is roughly how all major browsers seem to implement shared memory support on OSX. This still needs a lot of cleanup but seems to compile and basic tests seem to pass. --- configure | 15 +- configure.ac | 11 +- src/backend/port/mach_shmem.c | 797 ++ src/include/pg_config.h.in| 3 + src/include/postgres.h| 14 + 5 files changed, 831 insertions(+), 9 deletions(-) create mode 100644 src/backend/port/mach_shmem.c diff --git a/configure b/configure index dd64692345..5a2eb14dea 100755 --- a/configure +++ b/configure @@ -18043,16 +18043,21 @@ fi # Select shared-memory implementation type. -if test "$PORTNAME" != "win32"; then +if test "$PORTNAME" = "win32"; then -$as_echo "#define USE_SYSV_SHARED_MEMORY 1" >>confdefs.h +$as_echo "#define USE_WIN32_SHARED_MEMORY 1" >>confdefs.h - SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c" + SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c" +elif test "$PORTNAME" = "darwin"; then + +$as_echo "#define USE_MACH_SHARED_MEMORY 1" >>confdefs.h + + SHMEM_IMPLEMENTATION="src/backend/port/mach_shmem.c" else -$as_echo "#define USE_WIN32_SHARED_MEMORY 1" >>confdefs.h +$as_echo "#define USE_SYSV_SHARED_MEMORY 1" >>confdefs.h - SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c" + SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c" fi # Select random number source. If a TLS library is used then it will be the diff --git a/configure.ac b/configure.ac index 748fb50236..1bbcd5dc78 100644 --- a/configure.ac +++ b/configure.ac @@ -2144,12 +2144,15 @@ fi # Select shared-memory implementation type. -if test "$PORTNAME" != "win32"; then - AC_DEFINE(USE_SYSV_SHARED_MEMORY, 1, [Define to select SysV-style shared memory.]) - SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c" -else +if test "$PORTNAME" = "win32"; then AC_DEFINE(USE_WIN32_SHARED_MEMORY, 1, [Define to select Win32-style shared memory.]) SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c" +elif test "$PORTNAME" = "darwin"; then + AC_DEFINE(USE_MACH_SHARED_MEMORY, 1, [Define to select Mach-style shared memory.]) + SHMEM_IMPLEMENTATION="src/backend/port/mach_shmem.c" +else + AC_DEFINE(USE_SYSV_SHARED_MEMORY, 1, [Define to select SysV-style shared memory.]) + SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c" fi # Select random number source. If a TLS library is used then it will be the diff --git a/src/backend/port/mach_shmem.c b/src/backend/port/mach_shmem.c new file mode 100644 index 00..2fe42ed770 --- /dev/null +++ b/src/backend/port/mach_shmem.c @@ -0,0 +1,797 @@ +/*- + * + * mach_shmem.c + * Implement shared memory using mach facilities + * + * These routines used to be a fairly thin layer on top of mach shared + * memory functionality. With the addition of anonymous-shmem logic, + * they're a bit fatter now. We still require a mach shmem block to + * exist, though, because mmap'd shmem provides no way to find out how + * many processes are attached, which we need for interlocking purposes. + * + * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/port/mach_shmem.c + * + *- + */ +#include "postgres.h" + +#include +#include +#include +#ifdef HAVE_SYS_IPC_H +#include +#endif + +#include "miscadmin.h" +#include "portability/mem.h" +#include "storage/dsm.h" +#include "storage/ipc.h" +#include "storage/pg_shmem.h" +#include "utils/pidfile.h" + +#include +#include +#include + + +/* + * As of PostgreSQL 9.3, we normally allocate only a very small amount of + * System V shared memory, and only for the purposes of providing an + * interlock to protect the data directory. The real shared memory block + * is allocated using mmap(). This works around the problem that many + * systems have very low limits on the amount of System V shared memory + * that can be allocated. Even a limit of a few megabytes will be enough + * to run many copies of PostgreSQL without needing to adjust system settings. + * + * We assume that no one will attempt to run PostgreSQL 9.3 or later on + * systems that are ancient enough that anonymous shared memory is not