Re: bug in pageinspect's "tuple data" feature

2020-11-22 Thread Heikki Linnakangas

On 21/11/2020 21:32, Alvaro Herrera wrote:

If you have a sufficiently broken data page, pageinspect throws an
error when trying to examine the page:

ERROR:  invalid memory alloc request size 18446744073709551451

This is pretty unhelpful; it would be better not to try to print the
data instead of dying.  With that, at least you can know where the
problem is.

This was introduced in d6061f83a166 (2015).  Proposed patch to fix it
(by having the code print a null "data" instead of dying) is attached.


Null seems misleading. Maybe something like "invalid", or print a warning?

- Heikki




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-11-22 Thread Bharath Rupireddy
>
> Here is how I'm making 4 separate patches:
>
> 1. new function and it's documentation.
> 2. GUC and it's documentation.
> 3. server level option and it's documentation.
> 4. test cases for all of the above patches.
>

Hi, I'm attaching the patches here. Note that, though the code changes
for this feature are small, I divided them up as separate patches to
make review easy.

v1-0001-postgres_fdw-function-to-discard-cached-connections.patch
This patch adds a new function that gets defined as part of CREATE
EXTENSION postgres_fdw; postgres_fdw_disconnect() when called with a
foreign server name discards the associated connections with the
server name. When called without any argument, discards all the
existing cached connections.

v1-0002-postgres_fdw-add-keep_connections-GUC-to-not-cache-connections.patch
This patch adds a new GUC postgres_fdw.keep_connections, default being
on, when set to off no remote connections are cached by the local
session.

v1-0003-postgres_fdw-server-level-option-keep_connection.patch
This patch adds a new server level option, keep_connection, default
being on, when set to off, the local session doesn't cache the
connections associated with the foreign server.

v1-0004-postgres_fdw-connection-cache-discard-tests-and-documentation.patch
This patch adds the tests and documentation related to this feature.

Please review the patches.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-postgres_fdw-function-to-discard-cached-connections.patch
Description: Binary data


v1-0002-postgres_fdw-add-keep_connections-GUC-to-not-cache-connections.patch
Description: Binary data


v1-0003-postgres_fdw-server-level-option-keep_connection.patch
Description: Binary data


v1-0004-postgres_fdw-connection-cache-discard-tests-and-documentation.patch
Description: Binary data


Re: Parallel plans and "union all" subquery

2020-11-22 Thread Greg Nancarrow
On Sun, Nov 22, 2020 at 11:51 PM Phil Florent  wrote:
>
>
> Hi,
>
>
> I have a question about parallel plans. I also posted it on the general list 
> but perhaps it's a question for hackers. Here is my test case :
>
>
> explain
> select count(*)
> from (select
> n1
> from drop_me
> union all
> values(1)) ua;
>
>
> QUERY PLAN
> 
> Aggregate (cost=2934739.24..2934739.25 rows=1 width=8)
> -> Append (cost=0.00..2059737.83 rows=7113 width=32)
> -> Seq Scan on drop_me (cost=0.00..1009736.12 rows=7112 width=6)
> -> Subquery Scan on "*SELECT* 2" (cost=0.00..0.02 rows=1 width=32)
> -> Result (cost=0.00..0.01 rows=1 width=4)
> JIT:
> Functions: 4
> Options: Inlining true, Optimization true, Expressions true, Deforming true
>
>
> No parallel plan, 2s6
>
>
> I read the documentation but I don't get the reason of the "noparallel" seq 
> scan of drop_me in the last case ?
>

Without debugging this, it looks to me that the UNION type resolution
isn't working as well as it possibly could in this case, for the
generation of a parallel plan. I found that with a minor tweak to your
SQL, either for the table creation or query, it will produce a
parallel plan.

Noting that currently you're creating the drop_me table with a
"numeric" column, you can either:

(1) Change the table creation

FROM:
create unlogged table drop_me as select generate_series(1,7e7) n1;
TO:
create unlogged table drop_me as select generate_series(1,7e7)::int n1;


OR


(2) Change the query

FROM:
explain
select count(*)
from (select
n1
from drop_me
union all
values(1)) ua;

TO:

explain
select count(*)
from (select
n1
from drop_me
union all
values(1::numeric)) ua;


QUERY PLAN

 Finalize Aggregate  (cost=821152.71..821152.72 rows=1 width=8)
   ->  Gather  (cost=821152.50..821152.71 rows=2 width=8)
 Workers Planned: 2
 ->  Partial Aggregate  (cost=820152.50..820152.51 rows=1 width=8)
   ->  Parallel Append  (cost=0.00..747235.71 rows=29166714 width=0)
 ->  Result  (cost=0.00..0.01 rows=1 width=0)
 ->  Parallel Seq Scan on drop_me
(cost=0.00..601402.13 rows=29166713 width=0)
(7 rows)


Regards,
Greg Nancarrow
Fujitsu Australia




Re: PoC/WIP: Extended statistics on expressions

2020-11-22 Thread Tomas Vondra



On 11/23/20 3:26 AM, Justin Pryzby wrote:
> On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote:
>> attached is a significantly improved version of the patch, allowing
>> defining extended statistics on expressions. This fixes most of the
>> problems in the previous WIP version and AFAICS it does pass all
>> regression tests (including under valgrind). There's a bunch of FIXMEs
>> and a couple loose ends, but overall I think it's ready for reviews.
> 
> I was looking at the previous patch, so now read this one instead, and attach
> some proposed fixes.
> 
> +  * This matters especially for * expensive expressions, of course.
> 

The point this was trying to make is that we evaluate the expressions
only once, and use the results to build all extended statistics. Instead
of leaving it up to every "build" to re-evaluate it.

> +   The expression can refer only to columns of the underlying table, but
> +   it can use all columns, not just the ones the statistics is defined
> +   on.
> 
> I don't know what these are trying to say?
> 

D'oh. That's bogus para, copied from the CREATE INDEX docs (where it
talked about the index predicate, which is irrelevant here).

> +errmsg("statistics expressions and 
> predicates can refer only to the table being indexed")));
> +* partial-index predicates.  Create it in the per-index context to be
> 
> I think these are copied and shouldn't mention "indexes" or "predicates".  Or
> should statistics support predicates, too ?
> 

Right. Stupid copy-pasto.

> Idea: if a user specifies no stakinds, and there's no expression specified,
> then you automatically build everything except for expressional stats.  But if
> they specify only one statistics "column", it gives an error.  If that's a
> non-simple column reference, should that instead build *only* expressional
> stats (possibly with a NOTICE, since the user might be intending to make MV
> stats).
> 

Right, that was the intention - but I messed up and it works only if you
specify the "expressions" kind explicitly (and I also added the ERROR
message to expected output by mistake). I agree we should handle this
automatically, so that

   CREATE STATISTICS s ON (a+b) FROM t

works and only creates statistics for the expression.


> I think pg_stats_ext should allow inspecting the pg_statistic data in
> pg_statistic_ext_data.stxdexprs.  I guess array_agg() should be ordered by
> something, so maybe it should use ORDINALITY (?)
> 

I agree we should expose the expression statistics, but I'm not
convinced we should do that in the pg_stats_ext view itself. The problem
is that it's a table bested in a table, essentially, with non-trivial
structure, so I was thinking about adding a separate view exposing just
this one part. Something like pg_stats_ext_expressions, with about the
same structure as pg_stats, or something.

> I hacked more on bootstrap.c so included that here.

Thanks. As for the 0004-0007 patches:

0004 - Seems fine. IMHO not really "silly errors" but OK.

0005 - Mostly OK. The docs wording mostly comes from CREATE INDEX docs,
though. The paragraph about "t1" is old, so if we want to reword it then
maybe we should backpatch too.

0006 - Not sure. I think CreateStatistics can be fixed with less code,
keeping it more like PG13 (good for backpatching). Not sure why rename
extended statistics to multi-variate statistics - we use "extended"
everywhere. Not sure what's the point of serialize_expr_stats changes,
that's code is mostly copy-paste from update_attstats.

0007 - I suspect this makes the pg_stats_ext too complex to work with,
IMHO we should move this to a separate view.


Thanks for the review! I'll try to look more closely at those patches
sometime next week, and merge most of it.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




RE: Disable WAL logging to speed up data loading

2020-11-22 Thread tsunakawa.ta...@fujitsu.com
From: Osumi, Takamichi/大墨 昂道 
> This time, I updated my patch to address comments below only.

I forgot to mentionthis.  I confirmed all review comments are reflected 
correctly.


Regards
Takayuki Tsunakawa



RE: Disable WAL logging to speed up data loading

2020-11-22 Thread tsunakawa.ta...@fujitsu.com
From: Osumi, Takamichi/大墨 昂道 
> > case TRANS_STMT_PREPARE:
> > +   if (wal_level ==
> > WAL_LEVEL_NONE)
> > +   ereport(ERROR,
> > +
> > errmsg("cannot execute PREPARE TRANSACTION when WAL logging
> is
> > disabled"));
> Usually, this safeguard is not needed for PREPARE TRANSACTION.
> But, when an unexpected crash happens, user needs to recreate the cluster
> from the backup and execute the operations that are executed into the crashed
> server again if the user sets wal_level=none.
> 
> While 2PC guarantees that after issuing PREPARE TRANSACTION, the
> processing or the contents of the transaction becomes *secure*, setting
> wal_level=none makes the server never start up again if a database crash
> occurs.
> In other words, this safeguard of the new wal_level damages the important
> aspect of PREPARE TRANSACTION's functionality, in my opinion.
> 
> I don't think this is what it should be.
> 
> Therefore, I don't want users to utilize the PREPARE TRANSACTION still in
> order to prevent that user thinks that they can use PREPARE TRANSACTION
> safely during wal_level=none and execute it.
> Did it make sense ?

PREPARE TRANSACTION is the same as COMMIT in that it persists transaction 
updates.  A crash during wal_level = none loses both of them.  So, I don't 
think PREPARE TRANSACTION needs special treatment.

Does PREPARE TRANSACTION complete successfully?  I remember Fujii-san said it 
PANICed if --enable-asserts is used in configure.


Regards
Takayuki Tsunakawa





RE: [POC] Fast COPY FROM command for the table with foreign partitions

2020-11-22 Thread tsunakawa.ta...@fujitsu.com
Hi Andrey-san,

From: Tomas Vondra 
> I needed to look at this patch while working on something related, and I 
> found it
> got broken by 6973533650c a couple days ago. So here's a fixed version, to 
> keep
> cfbot happy. I haven't done any serious review yet.

Could I or my colleague continue this patch in a few days?  It looks it's 
stalled over one month.


Regards
Takayuki Tsunakawa





RE: POC: postgres_fdw insert batching

2020-11-22 Thread tsunakawa.ta...@fujitsu.com
From: Tomas Vondra 
> I don't think this is usable in practice, because a single session may
> be using multiple FDW servers, with different implementations, latency
> to the data nodes, etc. It's unlikely a single GUC value will be
> suitable for all of them.

That makes sense.  The row size varies from table to table, so the user may 
want to tune this option to reduce memory consumption.

I think the attached patch has reflected all your comments.  I hope this will 
pass..


Regards
Takayuki Tsunakawa





v4-0001-Add-bulk-insert-for-foreign-tables.patch
Description: v4-0001-Add-bulk-insert-for-foreign-tables.patch


Re: [PoC] Non-volatile WAL buffer

2020-11-22 Thread Tomas Vondra
Hi,

On 10/30/20 6:57 AM, Takashi Menjo wrote:
> Hi Heikki,
> 
>> I had a new look at this thread today, trying to figure out where 
>> we are.
> 
> I'm a bit confused.
>> 
>> One thing we have established: mmap()ing WAL files performs worse 
>> than the current method, if pg_wal is not on a persistent memory 
>> device. This is because the kernel faults in existing content of 
>> each page, even though we're overwriting everything.
> 
> Yes. In addition, after a certain page (in the sense of OS page) is 
> msync()ed, another page fault will occur again when something is 
> stored into that page.
> 
>> That's unfortunate. I was hoping that mmap() would be a good option
>> even without persistent memory hardware. I wish we could tell the
>> kernel to zero the pages instead of reading them from the file.
>> Maybe clear the file with ftruncate() before mmapping it?
> 
> The area extended by ftruncate() appears as if it were zero-filled 
> [1]. Please note that it merely "appears as if." It might not be 
> actually zero-filled as data blocks on devices, so pre-allocating 
> files should improve transaction performance. At least, on Linux 5.7
>  and ext4, it takes more time to store into the mapped file just 
> open(O_CREAT)ed and ftruncate()d than into the one filled already and
> actually.
> 

Does is really matter that it only appears zero-filled? I think Heikki's
point was that maybe ftruncate() would prevent the kernel from faulting
the existing page content when we're overwriting it.

Not sure I understand what the benchmark with ext4 was doing, exactly.
How was that measured? Might be interesting to have some simple
benchmarking tool to demonstrate this (I believe a small standalone tool
written in C should do the trick).

>> That should not be problem with a real persistent memory device, 
>> however (or when emulating it with DRAM). With DAX, the storage is 
>> memory-mapped directly and there is no page cache, and no 
>> pre-faulting.
> 
> Yes, with filesystem DAX, there is no page cache for file data. A 
> page fault still occurs but for each 2MiB DAX hugepage, so its 
> overhead decreases compared with 4KiB page fault. Such a DAX
> hugepage fault is only applied to DAX-mapped files and is different
> from a general transparent hugepage fault.
> 

I don't follow - if there are page faults even when overwriting all the
data, I'd say it's still an issue even with 2MB DAX pages. How big is
the difference between 4kB and 2MB pages?

Not sure I understand how is this different from general THP fault?

>> Because of that, I'm baffled by what the 
>> v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it 
>> correctly, it puts the WAL buffers in a separate file, which is 
>> stored on the NVRAM. Why? I realize that this is just a Proof of 
>> Concept, but I'm very much not interested in anything that requires
>> the DBA to manage a second WAL location. Did you test the mmap()
>> patches with persistent memory hardware? Did you compare that with
>> the pmem patchset, on the same hardware? If there's a meaningful
>> performance difference between the two, what's causing it?

> Yes, this patchset puts the WAL buffers into the file specified by 
> "nvwal_path" in postgresql.conf.
> 
> Why this patchset puts the buffers into the separated file, not 
> existing segment files in PGDATA/pg_wal, is because it reduces the 
> overhead due to system calls such as open(), mmap(), munmap(), and 
> close(). It open()s and mmap()s the file "nvwal_path" once, and keeps
> that file mapped while running. On the other hand, as for the 
> patchset mmap()ing the segment files, a backend process should 
> munmap() and close() the current mapped file and open() and mmap() 
> the new one for each time the inserting location for that process 
> goes over segments. This causes the performance difference between 
> the two.
> 

I kinda agree with Heikki here - having to manage yet another location
for WAL data is rather inconvenient. We should aim not to make the life
of DBAs unnecessarily difficult, IMO.

I wonder how significant the syscall overhead is - can you show share
some numbers? I don't see any such results in this thread, so I'm not
sure if it means losing 1% or 10% throughput.

Also, maybe there are alternative ways to reduce the overhead? For
example, we can increase the size of the WAL segment, and with 1GB
segments we'd do 1/64 of syscalls. Or maybe we could do some of this
asynchronously - request a segment ahead, and let another process do the
actual work etc. so that the running process does not wait.


Do I understand correctly that the patch removes "regular" WAL buffers
and instead writes the data into the non-volatile PMEM buffer, without
writing that to the WAL segments at all (unless in archiving mode)?

Firstly, I guess many (most?) instances will have to write the WAL
segments anyway because of PITR/backups, so I'm not sure we can save
much here.

But more importantly - doesn't that mean the nvwal_siz

Re: [PoC] Non-volatile WAL buffer

2020-11-22 Thread Tomas Vondra
Hi,

These patches no longer apply :-( A rebased version would be nice.

I've been interested in what performance improvements this might bring,
so I've been running some extensive benchmarks on a machine with PMEM
hardware. So let me share some interesting results. (I used commit from
early September, to make the patch apply cleanly.)

Note: The hardware was provided by Intel, and they are interested in
supporting the development and providing access to machines with PMEM to
developers. So if you're interested in this patch & PMEM, but don't have
access to suitable hardware, try contacting Steve Shaw
 who's the person responsible for open source
databases at Intel (he's also the author of HammerDB).


The benchmarks were done on a machine with 2 x Xeon Platinum (24/48
cores), 128GB RAM, NVMe and PMEM SSDs. I did some basic pgbench tests
with different scales (500, 5000, 15000) with and without these patches.
I did some usual tuning (shared buffers, max_wal_size etc.), the most
important changes being:

- maintenance_work_mem = 256MB
- max_connections = 200
- random_page_cost = 1.2
- shared_buffers = 16GB
- work_mem = 64MB
- checkpoint_completion_target = 0.9
- checkpoint_timeout = 20min
- max_wal_size = 96GB
- autovacuum_analyze_scale_factor = 0.1
- autovacuum_vacuum_insert_scale_factor = 0.05
- autovacuum_vacuum_scale_factor = 0.01
- vacuum_cost_limit = 1000

And on the patched version:

- nvwal_size = 128GB
- nvwal_path = … points to the PMEM DAX device …

The machine has multiple SSDs (all Optane-based, IIRC):

- NVMe SSD (Optane)
- PMEM in BTT mode
- PMEM in DAX mode

So I've tested all of them - the data was always on the NVMe device, and
the WAL was placed on one of those devices. That means we have these
four cases to compare:

- nvme - master with WAL on the NVMe SSD
- pmembtt - master with WAL on PMEM in BTT mode
- pmemdax - master with WAL on PMEM in DAX mode
- pmemdax-ntt - patched version with WAL on PMEM in DAX mode

The "nvme" is a bit disadvantaged as it places both data and WAL on the
same device, so consider that while evaluating the results. But for the
smaller data sets this should be fairly negligible, I believe.

I'm not entirely sure whether the "pmemdax" (i.e. unpatched instance
with WAL on PMEM DAX device) is actually safe, but I included it anyway
to see what difference is.

Now let's look at results for the basic data sizes and client counts.
I've also attached some charts to illustrate this. These numbers are tps
averages from 3 runs, each about 30 minutes long.


1) scale 500 (fits into shared buffers)
---

wal   1   16326496
--
nvme   632173794132687185409192228
pmembtt624860105 85272 82943 84124
pmemdax668686188154850105219149224
pmemdax-ntt8062   104887211722231085252593

The NVMe performs well (the single device is not an issue, as there
should be very little non-WAL I/O). The PMBM/BTT has a clear bottleneck
~85k tps. It's interesting the PMEM/DAX performs much worse without the
patch, and the drop at 64 clients. Not sure what that's about.


2) scale 5000 (fits into RAM)
-

wal   116326496
---
nvme   4804 43636 61443 79807 86414
pmembtt4203 28354 37562 41562 43684
pmemdax5580 62180 92361112935117261
pmemdax-ntt6325 79887128259141793127224

The differences are more significant, compared to the small scale. The
BTT seems to have bottleneck around ~43k tps, the PMEM/DAX dominates.


3) scale 15000 (bigger than RAM)


wal   116326496
---
pmembtt3638 20630 28985 32019 31303
pmemdax5164 48230 69822 85740 90452
pmemdax-ntt5382 62359 80038 83779 80191

I have not included the nvme results here, because the impact of placing
both data and WAL on the same device was too significant IMHO.

The remaining results seem nice. It's interesting the patched case is a
bit slower than master. Not sure why.

Overall, these results seem pretty nice, I guess. Of course, this does
not say the current patch is the best way to implement this (or whether
it's correct), but it does suggest supporting PMEM might bring sizeable
performance boost.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [doc] improve tableoid description

2020-11-22 Thread Ian Lawrence Barwick
2020年11月21日(土) 16:29 Peter Eisentraut :
>
> On 2020-10-19 14:28, Ian Lawrence Barwick wrote:
> > On further reflection, I think trying to explain all that is going to
> > end up as a
> > mini-tutorial which is beyond the scope of the explanation of a column, so
> > the existing reference to pg_class should be enough.
> >
> > Revised patch attached just mentioning partitioned tables.
>
> committed

Thanks!


-- 
EnterpriseDB: https://www.enterprisedb.com




Re: Bogus documentation for bogus geometric operators

2020-11-22 Thread Tom Lane
Pavel Borisov  writes:
>> undocumented.  Maybe instead of removing, change the text to be
>> "Deprecated, use the equivalent XXX operator instead."  Or we could
>> add a footnote similar to what was there for a previous renaming:

> The problem that this new <<| is equivalent to <^ only for points (To
> recap: the source of a problem is the same name of <^  operator for points
> and boxes with different meaning for these types).

I don't think it's that hard to be clear; see proposed wording below.

The other loose end is that I don't think we can take away the opclass
entries for the old spellings, unless we're willing to visibly break
people's queries by removing those operator names altogether.  That
doesn't seem like it'll fly when we haven't even deprecated the old
names yet.  So for now, we have to support both names in the opclasses.
I extended the patch to do that.

This version seems committable to me --- any thoughts?

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7d06b979eb..507bc1a668 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10609,7 +10609,7 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple


 Is first object strictly below second?
-Available for box, polygon,
+Available for point, box, polygon,
 circle.


@@ -10625,7 +10625,7 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple


 Is first object strictly above second?
-Available for box, polygon,
+Available for point, box, polygon,
 circle.


@@ -10680,21 +10680,6 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple

   
 
-  
-   
-point <^ point
-boolean
-   
-   
-Is first object strictly below second?
-(This operator is misnamed; it should be <<|.)
-   
-   
-point '(1,0)' <^ point '(1,1)'
-t
-   
-  
-
   

 box >^ box
@@ -10709,21 +10694,6 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple

   
 
-  
-   
-point >^ point
-boolean
-   
-   
-Is first object strictly above second?
-(This operator is misnamed; it should be |>>.)
-   
-   
-point '(1,1)' >^ point '(1,0)'
-t
-   
-  
-
   

 geometric_type ?# geometric_type
@@ -10877,6 +10847,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
 

 
+   
+
+ Before PostgreSQL 14, the point
+ is strictly below/above comparison operators point
+ <<| point and point
+ |>> point were respectively
+ called <^ and >^.  These
+ names are still available, but are deprecated and will eventually be
+ removed.
+
+   
+

 Geometric Functions
 
diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index 1bf5f09659..d1b6cc9a01 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -118,12 +118,12 @@
 
  
   point_ops
-  >^ (point,point)
+  |>> (point,point)
   <-> (point,point)
  
  << (point,point)
  >> (point,point)
- <^ (point,point)
+ <<| (point,point)
  ~= (point,point)
  <@ (point,box)
  <@ (point,polygon)
diff --git a/doc/src/sgml/spgist.sgml b/doc/src/sgml/spgist.sgml
index 68d09951d9..ea88ae45e5 100644
--- a/doc/src/sgml/spgist.sgml
+++ b/doc/src/sgml/spgist.sgml
@@ -76,7 +76,7 @@
  
   box_ops
   << (box,box)
-  <-> (box,point)  
+  <-> (box,point)
  
  &< (box,box)
  &> (box,box)
@@ -92,12 +92,12 @@
 
  
   kd_point_ops
-  >^ (point,point)
+  |>> (point,point)
   <-> (point,point)
  
  << (point,point)
  >> (point,point)
- <^ (point,point)
+ <<| (point,point)
  ~= (point,point)
  <@ (point,box)
 
@@ -132,16 +132,16 @@
  <<| (polygon,polygon)
  &<| (polygon,polygon)
  |>> (polygon,polygon)
- |&> (polygon,polygon)  
+ |&> (polygon,polygon)
 
  
   quad_point_ops
-  >^ (point,point)
+  |>> (point,point)
   <-> (point,point)
  
  << (point,point)
  >> (point,point)
- <^ (point,point)
+ <<| (point,point)
  ~= (point,point)
  <@ (point,box)
 
@@ -159,7 +159,7 @@
  &< (anyrange,anyrange)
  &> (anyrange,anyrange)
  -|- (anyrange,anyrange)
- 
+
  
   text_ops
   = (text,text)
diff --git a/src/backend/access/gist/gistproc.c b/src/backend/access/gist/gistproc.c
index b03c4b55a0..784807c636 100644
--- a/src/backend/access/gist/gistproc.c
+++ b/src/backend/access/gist/gistproc.c
@@ -1341,8 +1341,18 @@ gist_point_consistent(PG_FUNCTION_ARGS)
 	StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2);
 	bo

Re: Why does create_gather_merge_plan need make_sort?

2020-11-22 Thread Tomas Vondra
On 11/22/20 10:31 PM, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 11/20/20 11:24 PM, James Coleman wrote:
>>> While looking at another issue I noticed that create_gather_merge_plan
>>> calls make_sort if the subplan isn't sufficiently sorted. In all of
>>> the cases I've seen where a gather merge path (not plan) is created
>>> the input path is expected to be properly sorted, so I was wondering
>>> if anyone happened to know what case is being handled by the make_sort
>>> call. Removing it doesn't seem to break any tests.
> 
>> Yeah, I think you're right this is dead code, essentially. We're only
>> ever calling create_gather_merge_path() with pathkeys matching the
>> subpath. And it's like that on REL_12_STABLE too, i.e. before the
>> incremental sort was introduced.
> 
> It's probably there by analogy to the other callers of
> prepare_sort_from_pathkeys, which all do at least a conditional
> make_sort().  I'd be inclined to leave it there; it's cheap insurance
> against somebody weakening the existing behavior.
> 

But how do we know it's safe to actually do the sort there, e.g. in
light of the volatility/parallel-safety issues discussed in other threads?

I agree the check may be useful, but maybe we should just do elog(ERROR)
instead.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Fix generate_useful_gather_paths for parallel unsafe pathkeys

2020-11-22 Thread Tomas Vondra
On 11/21/20 2:55 AM, James Coleman wrote:
> Over on the -bugs list we had a report [1] of a seg fault with
> incremental sort. The short of the investigation there was that a
> subplan wasn't being serialized since it wasn't parallel safe, and
> that attempting to initialize that subplan resulted in the seg fault.
> Tom pushed a change to raise an ERROR when this occurs so that at
> least we don't crash the server.
> 
> There was some small amount of discussion about this on a thread about
> a somewhat similar issue [2] (where volatile expressions were the
> issue), but that thread resulted in a committed patch, and beyond that
> it seems that this issue deserves its own discussion.
> 
> I've been digging further into the problem, and my first question was
> "why doesn't this result in a similar error but with a full sort when
> incremental sort is disabled?" After all, we have code to consider a
> gather merge + full sort on the cheapest partial path in
> generate_useful_gather_paths(), and the plan that I get on Luis's test
> case with incremental sort disabled has an full sort above a gather,
> which presumably it'd be cheaper to push down into the parallel plan
> (using gather merge instead of gather).
> 
> It turns out that there's a separate bug here: I'm guessing that in
> the original commit of generate_useful_gather_paths() we had a
> copy/paste thinko in this code:
> 
> /*
>  * If the path has no ordering at all, then we can't use either
>  * incremental sort or rely on implicit sorting with a gather
>  * merge.
>  */
> if (subpath->pathkeys == NIL)
>  continue;
> 
> The comment claims that we should skip subpaths that aren't sorted at
> all since we can't possibly use a gather merge directly or with an
> incremental sort applied first. It's true that we can't do either of
> those things unless the subpath is already at least partially sorted.
> But subsequently we attempt to construct a gather merge path on top of
> a full sort (for the cheapest path), and there's no reason to limit
> that to partially sorted paths (indeed in the presence of incremental
> sort it seems unlikely that that would be the cheapest path).
> 
> We still don't want to build incremental sort paths if the subpath has
> no pathkeys, but that will imply presorted_keys = 0, which we already
> check.
> 
> So 0001 removes that branch entirely. And, as expected, we now get a
> gather merge + full sort the resulting error about subplan not being
> initialized.
> 

Yeah, that's clearly a thinko in generate_useful_gather_paths. Thanks
for spotting it! The 0001 patch seems fine to me.

> With that oddity out of the way, I started looking at the actually
> reported problem, and nominally issue is that we have a correlated
> subquery (in the final target list and the sort clause), and
> correlated subqueries (not sure why exactly in this case; see [3])
> aren't considered parallel-safe.
> 
> As to why that's happening:
> 1. find_em_expr_usable_for_sorting_rel doesn't exclude that
> expression, and arguably correctly so in the general case since one
> might (though we don't now) use this same kind of logic for
> non-parallel plans.
> 2. generate_useful_pathkeys_for_relation is noting that it'd be useful
> to sort on that expression and sees it as safe in part due to the (1).
> 3. We create a sort node that includes that expression in its output.
> That seems pretty much the same (in terms of safety given generalized
> input paths/plans, not in terms of what Robert brought up in [4]) as
> what would happen in the prepare_sort_from_pathkeys call in
> create_gather_merge_plan.
> 
> So to fix this problem I've added the option to find_em_expr_for_rel
> to restrict to parallel-safe expressions in 0002.
> 

OK, that seems like a valid fix. I wonder if we can have EC with members
mixing parallel-safe and parallel-unsafe expressions? I guess no, but
it's more a feeling than a clear reasoning and so I wonder what would
happen in such cases.

> Given point 3 above (and the discussion with Robert earlier) above it
> seems to me that there's a bigger problem here that just hasn't
> happened to have been exposed yet: in particular the
> prepare_sort_from_pathkeys call in create_gather_merge_plan seems
> suspect to me. But it's also possible other usages of
> prepare_sort_from_pathkeys could be suspect given other seemingly
> unrelated changed to path generation, since nothing other than
> implicit knowledge about the conventions here prevents us from
> introducing paths generating sorts with expressions not in the target
> list (in fact a part of the issue here IMO is that non-var expression
> pathkeys are added to the target list after path generation and
> selection is completed).
> 
> Alternatively we could "fix" this by being even more conservative in
> find_em_expr_usable_for_sorting_rel and disregard any expressions not
> currently found in the target list. But that seems unsatisfactory to
> me since, given we know that there are cases w

Re: Why does create_gather_merge_plan need make_sort?

2020-11-22 Thread Tom Lane
Tomas Vondra  writes:
> On 11/20/20 11:24 PM, James Coleman wrote:
>> While looking at another issue I noticed that create_gather_merge_plan
>> calls make_sort if the subplan isn't sufficiently sorted. In all of
>> the cases I've seen where a gather merge path (not plan) is created
>> the input path is expected to be properly sorted, so I was wondering
>> if anyone happened to know what case is being handled by the make_sort
>> call. Removing it doesn't seem to break any tests.

> Yeah, I think you're right this is dead code, essentially. We're only
> ever calling create_gather_merge_path() with pathkeys matching the
> subpath. And it's like that on REL_12_STABLE too, i.e. before the
> incremental sort was introduced.

It's probably there by analogy to the other callers of
prepare_sort_from_pathkeys, which all do at least a conditional
make_sort().  I'd be inclined to leave it there; it's cheap insurance
against somebody weakening the existing behavior.

regards, tom lane




Re: Why does create_gather_merge_plan need make_sort?

2020-11-22 Thread Tomas Vondra
On 11/20/20 11:24 PM, James Coleman wrote:
> While looking at another issue I noticed that create_gather_merge_plan
> calls make_sort if the subplan isn't sufficiently sorted. In all of
> the cases I've seen where a gather merge path (not plan) is created
> the input path is expected to be properly sorted, so I was wondering
> if anyone happened to know what case is being handled by the make_sort
> call. Removing it doesn't seem to break any tests.
> 

Yeah, I think you're right this is dead code, essentially. We're only
ever calling create_gather_merge_path() with pathkeys matching the
subpath. And it's like that on REL_12_STABLE too, i.e. before the
incremental sort was introduced.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [bug+patch] Inserting DEFAULT into generated columns from VALUES RTE

2020-11-22 Thread Tom Lane
Dean Rasheed  writes:
> I think it's actually easier to just do it all in the rewriter -- at
> the point where we see that we're about to insert potentially illegal
> values from a VALUES RTE into a generated column, scan it to see if
> all the values in that column are DEFAULTs, and if so trigger the
> existing code to replace the Var in the tlist with the generated
> column expression. That way we only do extra work in the case for
> which we're currently throwing an error, rather than for every query.

That makes sense, and it leads to a nicely small patch.  I reviewed
this and pushed it.  I found only one nitpicky bug: in
findDefaultOnlyColumns, the test must be bms_is_empty(default_only_cols)
not just default_only_cols == NULL, or it will fail to fall out early
as intended when the first row contains some DEFAULTs but later rows
don't.  I did tweak some of the commentary, too.

> I haven't touched the existing error messages. I think that's a
> subject for a separate patch.

Fair.  After looking around a bit, I think that getting an error
cursor as I'd speculated about is more trouble than it's worth.
For starters, we'd have to pass down the query string into this
code, and there might be some ticklish issues about whether a given
chunk of parsetree came from that string or from some rule or view.
However, I think that just adjusting the error string would be
helpful, as attached.

(I'm also wondering why the second case is generic ERRCODE_SYNTAX_ERROR
and not ERRCODE_GENERATED_ALWAYS.  Didn't change it here, though.)

regards, tom lane

diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index 839583f834..00a48686c4 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -861,7 +861,7 @@ rewriteTargetListIU(List *targetList,
 	if (!apply_default)
 		ereport(ERROR,
 (errcode(ERRCODE_GENERATED_ALWAYS),
- errmsg("cannot insert into column \"%s\"",
+ errmsg("cannot insert a non-DEFAULT value into column \"%s\"",
 		NameStr(att_tup->attname)),
  errdetail("Column \"%s\" is an identity column defined as GENERATED ALWAYS.",
 		   NameStr(att_tup->attname)),
@@ -900,7 +900,7 @@ rewriteTargetListIU(List *targetList,
 if (!apply_default)
 	ereport(ERROR,
 			(errcode(ERRCODE_SYNTAX_ERROR),
-			 errmsg("cannot insert into column \"%s\"",
+			 errmsg("cannot insert a non-DEFAULT value into column \"%s\"",
 	NameStr(att_tup->attname)),
 			 errdetail("Column \"%s\" is a generated column.",
 	   NameStr(att_tup->attname;
diff --git a/src/test/regress/expected/generated.out b/src/test/regress/expected/generated.out
index 559fafa09e..ca721d38bf 100644
--- a/src/test/regress/expected/generated.out
+++ b/src/test/regress/expected/generated.out
@@ -93,16 +93,16 @@ LINE 1: ...E gtest_err_8 (a int PRIMARY KEY, b int GENERATED BY DEFAULT...
 INSERT INTO gtest1 VALUES (1);
 INSERT INTO gtest1 VALUES (2, DEFAULT);  -- ok
 INSERT INTO gtest1 VALUES (3, 33);  -- error
-ERROR:  cannot insert into column "b"
+ERROR:  cannot insert a non-DEFAULT value into column "b"
 DETAIL:  Column "b" is a generated column.
 INSERT INTO gtest1 VALUES (3, 33), (4, 44);  -- error
-ERROR:  cannot insert into column "b"
+ERROR:  cannot insert a non-DEFAULT value into column "b"
 DETAIL:  Column "b" is a generated column.
 INSERT INTO gtest1 VALUES (3, DEFAULT), (4, 44);  -- error
-ERROR:  cannot insert into column "b"
+ERROR:  cannot insert a non-DEFAULT value into column "b"
 DETAIL:  Column "b" is a generated column.
 INSERT INTO gtest1 VALUES (3, 33), (4, DEFAULT);  -- error
-ERROR:  cannot insert into column "b"
+ERROR:  cannot insert a non-DEFAULT value into column "b"
 DETAIL:  Column "b" is a generated column.
 INSERT INTO gtest1 VALUES (3, DEFAULT), (4, DEFAULT);  -- ok
 SELECT * FROM gtest1 ORDER BY a;
@@ -193,25 +193,25 @@ SELECT * FROM gtest1v;
 (1 row)
 
 INSERT INTO gtest1v VALUES (4, 8);  -- error
-ERROR:  cannot insert into column "b"
+ERROR:  cannot insert a non-DEFAULT value into column "b"
 DETAIL:  Column "b" is a generated column.
 INSERT INTO gtest1v VALUES (5, DEFAULT);  -- ok
 INSERT INTO gtest1v VALUES (6, 66), (7, 77);  -- error
-ERROR:  cannot insert into column "b"
+ERROR:  cannot insert a non-DEFAULT value into column "b"
 DETAIL:  Column "b" is a generated column.
 INSERT INTO gtest1v VALUES (6, DEFAULT), (7, 77);  -- error
-ERROR:  cannot insert into column "b"
+ERROR:  cannot insert a non-DEFAULT value into column "b"
 DETAIL:  Column "b" is a generated column.
 INSERT INTO gtest1v VALUES (6, 66), (7, DEFAULT);  -- error
-ERROR:  cannot insert into column "b"
+ERROR:  cannot insert a non-DEFAULT value into column "b"
 DETAIL:  Column "b" is a generated column.
 INSERT INTO gtest1v VALUES (6, DEFAULT), (7, DEFAULT);  -- ok
 ALTER VIEW gtest1v ALTER COLUMN b SET DEFAULT 100;
 INSERT INTO gtest1v VALUES (8, DEFAULT);  -- 

Re: More time spending with "delete pending"

2020-11-22 Thread Alexander Lakhin
19.11.2020 01:28, Tom Lane wrote:
> Alexander Lakhin  writes:
>> 18.11.2020 23:39, Tom Lane wrote:
>>> Now, the ones on the 10 and 11 branches are all from pg_ctl, which
>>> does not use pgwin32_open() in those branches, only native open().
>>> So those aren't fair to count against it.  But we have nearly as
>>> many similar failures in HEAD, which surely is going through
>>> pgwin32_open().  So either we don't really have adequate protection
>>> against delete-pending files, or there is some other effect we haven't
>>> explained yet.
>> Can you confirm that there are no such failures on REL_12_STABLE and
>> REL_13_STABLE for last three (or maybe six) months? Maybe something
>> changed in HEAD then?
> Hmm, that is an interesting question isn't it.  Here's a search going
> back a full year.  There are a few in v12 --- interestingly, all on
> the statistics file, none from pg_ctl --- and none in v13.  Of course,
> v13 has only existed as a separate branch since 2020-06-07.
>
> There's also a buildfarm test-coverage artifact involved.  The bulk
> of the HEAD reports are from dory and walleye, neither of which are
> building v13 as yet, because of unclear problems [1].  I think those
> two animals build much more frequently than our other Windows animals,
> too, so the fact that they have more may be just because of that and
> not because they're somehow more susceptible.  Because of that, I'm not
> sure that we have enough evidence to say that v13 is better than HEAD.
> If there is some new bug, it's post-v12, but maybe not post-v13.  But
> v12 is evidently not perfect either.
Thanks for the complete list! As I can see from it, only lorikeet fails
on REL_12_STABLE. And it has a simple explanation.
lorikeet uses cygwin, not win32, but improved open() and fopen()
functions are not defined on cygwin:
#if defined(WIN32) && !defined(__CYGWIN__)

/*
 * open() and fopen() replacements to allow deletion of open files and
 * passing of other special options.
 */
#define        O_DIRECT    0x8000
extern int    pgwin32_open(const char *, int,...);
extern FILE *pgwin32_fopen(const char *, const char *);
#define        open(a,b,c) pgwin32_open(a,b,c)
#define        fopen(a,b) pgwin32_fopen(a,b)
...


And aside from the "Permission denied" failures (not necessarily related
to the "delete pending" state), we can see the "Device or resource busy"
failures on the same global.stat file (e. g., [1], [2], [3]). All these
failures occur when VACUUM (or REINDEX [3]) tries to access statistics
simultaneously with the statistics collector.

I've tried to make open()/fopen() replacements for cygwin, but haven't
succeeded yet (I need more time to find out how to get original
GetLastError() [4], as knowing errno is not sufficient to implement
pgwin32_open()' logic). (And I'm inclined to file a separate bug related
to this cygwin behavior when I'll get more info.)

The failures on HEAD (on dory and walleye) need another investigation
(maybe they are caused by modified stat()...).

[1]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lorikeet&dt=2020-09-07%2020%3A31%3A16&stg=install-check-C
[2]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2019-10-27%2009%3A36%3A07
[3]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lorikeet&dt=2020-09-02%2008%3A28%3A53&stg=install-check-C
[4] https://github.com/openunix/cygwin/blob/master/winsup/cygwin/errno.cc

Best regards,
Alexander




Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-22 Thread Andy Fan
On Sun, Nov 22, 2020 at 9:21 PM Andy Fan  wrote:

> Hi David:
>
> I did a review on the v8,  it looks great to me.  Here are some tiny
> things noted,
> just FYI.
>
>  1. modified   src/include/utils/selfuncs.h
> @@ -70,9 +70,9 @@
>   * callers to provide further details about some assumptions which were
> made
>   * during the estimation.
>   */
> -#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one of
> -  * the DEFAULTs as defined above.
> -  */
> +#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one
> + * of the DEFAULTs as defined
> + * above. */
>
> Looks nothing has changed.
>
>
> 2. leading spaces is not necessary in comments.
>
>  /*
>   * ResultCacheTuple Stores an individually cached tuple
>   */
> typedef struct ResultCacheTuple
> {
> MinimalTuple mintuple; /* Cached tuple */
> struct ResultCacheTuple *next; /* The next tuple with the same parameter
> * values or NULL if it's the last one */
> } ResultCacheTuple;
>
>
> 3. We define ResultCacheKey as below.
>
> /*
>  * ResultCacheKey
>  * The hash table key for cached entries plus the LRU list link
>  */
> typedef struct ResultCacheKey
> {
> MinimalTuple params;
> dlist_node lru_node; /* Pointer to next/prev key in LRU list */
> } ResultCacheKey;
>
> Since we store it as a MinimalTuple, we need some FETCH_INNER_VAR step for
> each element during the ResultCacheHash_equal call.  I am thinking if we
> can
> store a "Datum *" directly to save these steps.
> exec_aggvalues/exec_aggnulls looks
> a good candidate for me, except that the name looks not good. IMO, we can
> rename exec_aggvalues/exec_aggnulls and try to merge
> EEOP_AGGREF/EEOP_WINDOW_FUNC into a more generic step which can be
> reused in this case.
>
> 4. I think the  ExecClearTuple in prepare_probe_slot is not a must, since
> the
> data tts_values/tts_flags/tts_nvalid are all reset later, and tts_tid is
> not
> real used in our case.  Since both prepare_probe_slot
> and ResultCacheHash_equal are in  pretty hot path, we may need to consider
> it.
>
> static inline void
> prepare_probe_slot(ResultCacheState *rcstate, ResultCacheKey *key)
> {
> ...
> ExecClearTuple(pslot);
> ...
> }
>
>
> static void
> tts_virtual_clear(TupleTableSlot *slot)
> {
> if (unlikely(TTS_SHOULDFREE(slot)))
> {
> VirtualTupleTableSlot *vslot = (VirtualTupleTableSlot *) slot;
>
> pfree(vslot->data);
> vslot->data = NULL;
>
> slot->tts_flags &= ~TTS_FLAG_SHOULDFREE;
> }
>
> slot->tts_nvalid = 0;
> slot->tts_flags |= TTS_FLAG_EMPTY;
> ItemPointerSetInvalid(&slot->tts_tid);
> }
>
> --
> Best Regards
> Andy Fan
>


add 2 more comments.

1. I'd suggest adding Assert(false); in RC_END_OF_SCAN  case to make the
error clearer.

case RC_END_OF_SCAN:
/*
* We've already returned NULL for this scan, but just in case
* something call us again by mistake.
*/
return NULL;

2. Currently we handle the (!cache_store_tuple(node, outerslot))) case by
set it
   to RC_CACHE_BYPASS_MODE. The only reason for the cache_store_tuple
failure is
   we can't cache_reduce_memory.  I guess if cache_reduce_memory
   failed once, it would not succeed later(no more tuples can be stored,
   nothing is changed). So I think we can record this state and avoid any
new
   cache_reduce_memory call.

/*
* If we failed to create the entry or failed to store the
* tuple in the entry, then go into bypass mode.
*/
if (unlikely(entry == NULL ||
!cache_store_tuple(node, outerslot)))

 to

if (unlikely(entry == NULL || node->memory_cant_be_reduced ||
!cache_store_tuple(node, outerslot)))

-- 
Best Regards
Andy Fan


Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-22 Thread Andy Fan
Hi David:

I did a review on the v8,  it looks great to me.  Here are some tiny things
noted,
just FYI.

 1. modified   src/include/utils/selfuncs.h
@@ -70,9 +70,9 @@
  * callers to provide further details about some assumptions which were
made
  * during the estimation.
  */
-#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one of
-  * the DEFAULTs as defined above.
-  */
+#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one
+ * of the DEFAULTs as defined
+ * above. */

Looks nothing has changed.


2. leading spaces is not necessary in comments.

 /*
  * ResultCacheTuple Stores an individually cached tuple
  */
typedef struct ResultCacheTuple
{
MinimalTuple mintuple; /* Cached tuple */
struct ResultCacheTuple *next; /* The next tuple with the same parameter
* values or NULL if it's the last one */
} ResultCacheTuple;


3. We define ResultCacheKey as below.

/*
 * ResultCacheKey
 * The hash table key for cached entries plus the LRU list link
 */
typedef struct ResultCacheKey
{
MinimalTuple params;
dlist_node lru_node; /* Pointer to next/prev key in LRU list */
} ResultCacheKey;

Since we store it as a MinimalTuple, we need some FETCH_INNER_VAR step for
each element during the ResultCacheHash_equal call.  I am thinking if we can
store a "Datum *" directly to save these steps.
exec_aggvalues/exec_aggnulls looks
a good candidate for me, except that the name looks not good. IMO, we can
rename exec_aggvalues/exec_aggnulls and try to merge
EEOP_AGGREF/EEOP_WINDOW_FUNC into a more generic step which can be
reused in this case.

4. I think the  ExecClearTuple in prepare_probe_slot is not a must, since
the
data tts_values/tts_flags/tts_nvalid are all reset later, and tts_tid is not
real used in our case.  Since both prepare_probe_slot
and ResultCacheHash_equal are in  pretty hot path, we may need to consider
it.

static inline void
prepare_probe_slot(ResultCacheState *rcstate, ResultCacheKey *key)
{
...
ExecClearTuple(pslot);
...
}


static void
tts_virtual_clear(TupleTableSlot *slot)
{
if (unlikely(TTS_SHOULDFREE(slot)))
{
VirtualTupleTableSlot *vslot = (VirtualTupleTableSlot *) slot;

pfree(vslot->data);
vslot->data = NULL;

slot->tts_flags &= ~TTS_FLAG_SHOULDFREE;
}

slot->tts_nvalid = 0;
slot->tts_flags |= TTS_FLAG_EMPTY;
ItemPointerSetInvalid(&slot->tts_tid);
}

-- 
Best Regards
Andy Fan


Parallel plans and "union all" subquery

2020-11-22 Thread Phil Florent

Hi,


I have a question about parallel plans. I also posted it on the general list 
but perhaps it's a question for hackers. Here is my test case :


select version();
version


--
PostgreSQL 13.1 (Ubuntu 13.1-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by 
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit


create unlogged table drop_me as select generate_series(1,7e7) n1;
SELECT 7000


explain
select count(*)
from (select
n1
from drop_me
) s;


QUERY PLAN
--
Finalize Aggregate (cost=675319.13..675319.14 rows=1 width=8)
-> Gather (cost=675318.92..675319.13 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=674318.92..674318.93 rows=1 width=8)
-> Parallel Seq Scan on drop_me (cost=0.00..601402.13 rows=29166713 width=0)
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true


Parallel plan, 1s


explain
select count(*)
from (select
n1
from drop_me
union all
select
n1
from drop_me) ua;


QUERY PLAN
--
Finalize Aggregate (cost=1640315.00..1640315.01 rows=1 width=8)
-> Gather (cost=1640314.96..1640314.99 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=1640304.96..1640304.97 rows=1 width=8)
-> Parallel Append (cost=0.00..1494471.40 rows=58333426 width=0)
-> Parallel Seq Scan on drop_me (cost=0.00..601402.13 rows=29166713 width=0)
-> Parallel Seq Scan on drop_me drop_me_1 (cost=0.00..601402.13 rows=29166713 
width=0)
JIT:
Functions: 6
Options: Inlining true, Optimization true, Expressions true, Deforming true


Parallel plan, 2s2


explain
select count(*)
from (select
n1
from drop_me
union all
values(1)) ua;


QUERY PLAN

Aggregate (cost=2934739.24..2934739.25 rows=1 width=8)
-> Append (cost=0.00..2059737.83 rows=7113 width=32)
-> Seq Scan on drop_me (cost=0.00..1009736.12 rows=7112 width=6)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..0.02 rows=1 width=32)
-> Result (cost=0.00..0.01 rows=1 width=4)
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true


No parallel plan, 2s6


I read the documentation but I don't get the reason of the "noparallel" seq 
scan of drop_me in the last case ?


Best regards,

Phil



Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-11-22 Thread Michael Paquier
On Sat, Nov 21, 2020 at 09:39:28PM -0500, Tom Lane wrote:
> Michael Paquier  writes:
>> So, what you are basically saying is to switch currtid_byreloid() to
>> become a function local to tid.c.  And then have just
>> currtid_byrelname() and currtid_for_view() call that, right?
> 
> Yeah, that sounds about right.

Okay, here you go with the attached.  If there are any other comments,
please feel free.
--
Michael
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 92b19dba32..54b2eb7378 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -129,7 +129,6 @@ extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
    bool *all_dead, bool first_call);
 
 extern void heap_get_latest_tid(TableScanDesc scan, ItemPointer tid);
-extern void setLastTid(const ItemPointer tid);
 
 extern BulkInsertState GetBulkInsertState(void);
 extern void FreeBulkInsertState(BulkInsertState);
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index c6da0df868..5f33dc9db0 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*			mmddN */
-#define CATALOG_VERSION_NO	202011191
+#define CATALOG_VERSION_NO	202011221
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 33dacfd340..e7fbda9f81 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -2549,9 +2549,6 @@
 { oid => '1292',
   proname => 'tideq', proleakproof => 't', prorettype => 'bool',
   proargtypes => 'tid tid', prosrc => 'tideq' },
-{ oid => '1293', descr => 'latest tid of a tuple',
-  proname => 'currtid', provolatile => 'v', proparallel => 'u',
-  prorettype => 'tid', proargtypes => 'oid tid', prosrc => 'currtid_byreloid' },
 { oid => '1294', descr => 'latest tid of a tuple',
   proname => 'currtid2', provolatile => 'v', proparallel => 'u',
   prorettype => 'tid', proargtypes => 'text tid',
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 29e07b7228..e0f24283b8 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -645,10 +645,7 @@ ExecInsert(ModifyTableState *mtstate,
 	}
 
 	if (canSetTag)
-	{
 		(estate->es_processed)++;
-		setLastTid(&slot->tts_tid);
-	}
 
 	/*
 	 * If this insert is the result of a partition key update that moved the
diff --git a/src/backend/utils/adt/tid.c b/src/backend/utils/adt/tid.c
index 509a0fdffc..6f8b400e83 100644
--- a/src/backend/utils/adt/tid.c
+++ b/src/backend/utils/adt/tid.c
@@ -47,6 +47,8 @@
 #define DELIM			','
 #define NTIDARGS		2
 
+static ItemPointer currtid_for_view(Relation viewrel, ItemPointer tid);
+
 /* 
  *		tidin
  * 
@@ -275,12 +277,44 @@ hashtidextended(PG_FUNCTION_ARGS)
  *	Maybe these implementations should be moved to another place
  */
 
-static ItemPointerData Current_last_tid = {{0, 0}, 0};
-
-void
-setLastTid(const ItemPointer tid)
+/*
+ * Utility wrapper for current CTID functions.
+ *		Returns the latest version of a tuple pointing at "tid" for
+ *		relation "rel".
+ */
+static ItemPointer
+currtid_internal(Relation rel, ItemPointer tid)
 {
-	Current_last_tid = *tid;
+	ItemPointer result;
+	AclResult	aclresult;
+	Snapshot	snapshot;
+	TableScanDesc scan;
+
+	result = (ItemPointer) palloc(sizeof(ItemPointerData));
+
+	aclresult = pg_class_aclcheck(RelationGetRelid(rel), GetUserId(),
+  ACL_SELECT);
+	if (aclresult != ACLCHECK_OK)
+		aclcheck_error(aclresult, get_relkind_objtype(rel->rd_rel->relkind),
+	   RelationGetRelationName(rel));
+
+	if (rel->rd_rel->relkind == RELKIND_VIEW)
+		return currtid_for_view(rel, tid);
+
+	if (!RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
+		elog(ERROR, "cannot look at latest visible tid for relation \"%s.%s\"",
+			 get_namespace_name(RelationGetNamespace(rel)),
+			 RelationGetRelationName(rel));
+
+	ItemPointerCopy(tid, result);
+
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	scan = table_beginscan_tid(rel, snapshot);
+	table_tuple_get_latest_tid(scan, result);
+	table_endscan(scan);
+	UnregisterSnapshot(snapshot);
+
+	return result;
 }
 
 /*
@@ -288,7 +322,7 @@ setLastTid(const ItemPointer tid)
  *		CTID should be defined in the view and it must
  *		correspond to the CTID of a base relation.
  */
-static Datum
+static ItemPointer
 currtid_for_view(Relation viewrel, ItemPointer tid)
 {
 	TupleDesc	att = RelationGetDescr(viewrel);
@@ -338,12 +372,12 @@ currtid_for_view(Relation viewrel, ItemPointer tid)
 	rte = rt_fetch(var->varno, query->rtable);
 	if (rte)
 	{
-		Datum		result;
+		ItemPointer	result;
+		Relation	rel;
 
-		result = DirectFunctionCall2(currtid_byreloid,
-	 ObjectIdGetDatum(rte->relid),
-	 PointerGetDatum(tid));
-		table_close(viewrel, A

Re: Issue with server side statement-level rollback

2020-11-22 Thread Gilles Darold
Le 20/11/2020 à 16:18, Gilles Darold a écrit :
> I will work later on a POC to demonstrate the use case I want to
> implement. 


Hi Andres,


I have created a new version of the pg_statement_rollback extension [1]
to demonstrate the use of the hooks on start_xact_command(),
finish_xact_command() and AbortCurrentTransaction() to implement the
statement-level rollback feature entirely driven at serverside. It
require that the patch [2]  I've provided be applied on PostgreSQL
source first.


Here is what can be achieved with this patch:


LOAD 'pg_statement_rollback.so';
LOAD
SET pg_statement_rollback.enabled TO on;
SET
CREATE SCHEMA testrsl;
CREATE SCHEMA
SET search_path TO testrsl,public;
SET
BEGIN;
BEGIN
CREATE TABLE tbl_rsl(id integer, val varchar(256));
CREATE TABLE
INSERT INTO tbl_rsl VALUES (1, 'one');
INSERT 0 1
WITH write AS (INSERT INTO tbl_rsl VALUES (2, 'two') RETURNING id,
val) SELECT * FROM write;
 id | val
+-
  2 | two
(1 row)

UPDATE tbl_rsl SET id = 'two', val = 2 WHERE id = 1; -- > will fail
psql:simple.sql:14: ERROR:  invalid input syntax for type integer: "two"
LINE 1: UPDATE tbl_rsl SET id = 'two', val = 2 WHERE id = 1;
    ^
SELECT * FROM tbl_rsl; -- Should show records id 1 + 2
 id | val
+-
  1 | one
  2 | two
(2 rows)

COMMIT;
COMMIT


Actually unlike I've though this is the hook on finish_xact_command()
that is useless. In the extension I'm executing the RELEASE/SAVEPOINT in
the start_xact_command() hook before executing the next statement. The
hook on AbortCurrentTransaction() is used to signal that a ROLLOBACK
TO/SAVEPOINT need to be executed into the start_xact_command() hook
instead of a RELEASE/SAVEPOINT.


This works perfectly and do not crash PG anymore when compiled with
assert. Advanced tests (with triggers, client savepoint, CTE, etc.) are
available in the test/sql/ directory. Use of "make installcheck" allow
to run the regression tests.


Based on this result I really think that these hooks should be included
to be able to extend PostgreSQL for such feature although I have not
though about an other use that this one.


Regards, 


I've attached all code for archiving but the current version can be
found here too:

[1] https://github.com/darold/pg_statement_rollbackv2

[2]

https://raw.githubusercontent.com/darold/pg_statement_rollbackv2/main/command-start-finish-hook-v1.patch


-- 

Gilles Darold
http://www.darold.net/

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 03c553e7ea..81e1df27ef 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -314,6 +314,9 @@ typedef struct SubXactCallbackItem
 static SubXactCallbackItem *SubXact_callbacks = NULL;
 
 
+/* Hook for plugins to get control of at end of AbortCurrentTransaction */
+AbortCurrentTransaction_hook_type abort_current_transaction_hook = NULL;
+
 /* local function prototypes */
 static void AssignTransactionId(TransactionState s);
 static void AbortTransaction(void);
@@ -3358,6 +3361,9 @@ AbortCurrentTransaction(void)
 			AbortCurrentTransaction();
 			break;
 	}
+
+	if (abort_current_transaction_hook)
+		(*abort_current_transaction_hook) ();
 }
 
 /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7c5f7c775b..3fff54ff51 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -192,6 +192,14 @@ static void log_disconnections(int code, Datum arg);
 static void enable_statement_timeout(void);
 static void disable_statement_timeout(void);
 
+/*
+ * Hook for plugins to get control at end/start of
+ * start_xact_command/finish_xact_command. The hook
+ * on start_xact_command might be useless as it happens
+ * before UtilityProccess and the Executor* hooks.
+ */
+XactCommandStart_hook_type start_xact_command_hook = NULL;
+XactCommandFinish_hook_type finish_xact_command_hook = NULL;
 
 /* 
  *		routines to obtain user input
@@ -2634,6 +2642,7 @@ exec_describe_portal_message(const char *portal_name)
 static void
 start_xact_command(void)
 {
+
 	if (!xact_started)
 	{
 		StartTransactionCommand();
@@ -2649,11 +2658,19 @@ start_xact_command(void)
 	 * not desired, the timeout has to be disabled explicitly.
 	 */
 	enable_statement_timeout();
+
+	if (start_xact_command_hook)
+		(*start_xact_command_hook) ();
+
 }
 
+
 static void
 finish_xact_command(void)
 {
+	if (finish_xact_command_hook)
+		(*finish_xact_command_hook) ();
+
 	/* cancel active statement timeout after each command */
 	disable_statement_timeout();
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7320de345c..2e866b2a91 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -467,4 +467,8 @@ extern void EnterParallelMode(void);
 extern void ExitPar

[PATCH 1/1] Initial mach based shared memory support.

2020-11-22 Thread James Hilliard
OSX implements sysv shmem support via a mach wrapper, however the mach
sysv shmem wrapper has some severe restrictions that prevent us from
allocating enough memory segments in some cases.

These limits appear to be due to the way the wrapper itself is
implemented and not mach.

For example when running a test suite that spins up many postgres
instances I commonly see this issue:
DETAIL:  Failed system call was shmget(key=5432002, size=56, 03600).

In order to avoid hitting these limits we can bypass the wrapper layer
and just use mach directly.

This is roughly how all major browsers seem to implement shared memory
support on OSX.

This still needs a lot of cleanup but seems to compile and basic tests
seem to pass.
---
 configure |  15 +-
 configure.ac  |  11 +-
 src/backend/port/mach_shmem.c | 797 ++
 src/include/pg_config.h.in|   3 +
 src/include/postgres.h|  14 +
 5 files changed, 831 insertions(+), 9 deletions(-)
 create mode 100644 src/backend/port/mach_shmem.c

diff --git a/configure b/configure
index dd64692345..5a2eb14dea 100755
--- a/configure
+++ b/configure
@@ -18043,16 +18043,21 @@ fi
 
 
 # Select shared-memory implementation type.
-if test "$PORTNAME" != "win32"; then
+if test "$PORTNAME" = "win32"; then
 
-$as_echo "#define USE_SYSV_SHARED_MEMORY 1" >>confdefs.h
+$as_echo "#define USE_WIN32_SHARED_MEMORY 1" >>confdefs.h
 
-  SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c"
+  SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
+elif test "$PORTNAME" = "darwin"; then
+
+$as_echo "#define USE_MACH_SHARED_MEMORY 1" >>confdefs.h
+
+  SHMEM_IMPLEMENTATION="src/backend/port/mach_shmem.c"
 else
 
-$as_echo "#define USE_WIN32_SHARED_MEMORY 1" >>confdefs.h
+$as_echo "#define USE_SYSV_SHARED_MEMORY 1" >>confdefs.h
 
-  SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
+  SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c"
 fi
 
 # Select random number source. If a TLS library is used then it will be the
diff --git a/configure.ac b/configure.ac
index 748fb50236..1bbcd5dc78 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2144,12 +2144,15 @@ fi
 
 
 # Select shared-memory implementation type.
-if test "$PORTNAME" != "win32"; then
-  AC_DEFINE(USE_SYSV_SHARED_MEMORY, 1, [Define to select SysV-style shared 
memory.])
-  SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c"
-else
+if test "$PORTNAME" = "win32"; then
   AC_DEFINE(USE_WIN32_SHARED_MEMORY, 1, [Define to select Win32-style shared 
memory.])
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
+elif test "$PORTNAME" = "darwin"; then
+  AC_DEFINE(USE_MACH_SHARED_MEMORY, 1, [Define to select Mach-style shared 
memory.])
+  SHMEM_IMPLEMENTATION="src/backend/port/mach_shmem.c"
+else
+  AC_DEFINE(USE_SYSV_SHARED_MEMORY, 1, [Define to select SysV-style shared 
memory.])
+  SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c"
 fi
 
 # Select random number source. If a TLS library is used then it will be the
diff --git a/src/backend/port/mach_shmem.c b/src/backend/port/mach_shmem.c
new file mode 100644
index 00..2fe42ed770
--- /dev/null
+++ b/src/backend/port/mach_shmem.c
@@ -0,0 +1,797 @@
+/*-
+ *
+ * mach_shmem.c
+ *   Implement shared memory using mach facilities
+ *
+ * These routines used to be a fairly thin layer on top of mach shared
+ * memory functionality.  With the addition of anonymous-shmem logic,
+ * they're a bit fatter now.  We still require a mach shmem block to
+ * exist, though, because mmap'd shmem provides no way to find out how
+ * many processes are attached, which we need for interlocking purposes.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *   src/backend/port/mach_shmem.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include 
+#include 
+#include 
+#ifdef HAVE_SYS_IPC_H
+#include 
+#endif
+
+#include "miscadmin.h"
+#include "portability/mem.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/pg_shmem.h"
+#include "utils/pidfile.h"
+
+#include 
+#include 
+#include 
+
+
+/*
+ * As of PostgreSQL 9.3, we normally allocate only a very small amount of
+ * System V shared memory, and only for the purposes of providing an
+ * interlock to protect the data directory.  The real shared memory block
+ * is allocated using mmap().  This works around the problem that many
+ * systems have very low limits on the amount of System V shared memory
+ * that can be allocated.  Even a limit of a few megabytes will be enough
+ * to run many copies of PostgreSQL without needing to adjust system settings.
+ *
+ * We assume that no one will attempt to run PostgreSQL 9.3 or later on
+ * systems that are ancient enough that anonymous shared memory is not