Re: Single transaction in the tablesync worker?

2020-12-06 Thread Amit Kapila
On Mon, Dec 7, 2020 at 10:02 AM Craig Ringer
 wrote:
>
> On Mon, 7 Dec 2020 at 11:44, Peter Smith  wrote:
>>
>>
>> Basically, I was wondering why can't the "tablesync" worker just
>> gather messages in a similar way to how the current streaming feature
>> gathers messages into a "changes" file, so that they can be replayed
>> later.
>>
>
> See the related thread "Logical archiving"
>
> https://www.postgresql.org/message-id/20d9328b-a189-43d1-80e2-eb25b9284...@yandex-team.ru
>
> where I addressed some parts of this topic in detail earlier today.
>
>> A) The "tablesync" worker (after the COPY) does not ever apply any of
>> the incoming messages, but instead it just gobbles them into a
>> "changes" file until it decides it has reached SYNCDONE state and
>> exits.
>
>
> This has a few issues.
>
> Most importantly, the sync worker must cooperate with the main apply worker 
> to achieve a consistent end-of-sync cutover.
>

In this idea, there is no need to change the end-of-sync cutover. It
will work as it is now. I am not sure what makes you think so.

> The sync worker must have replayed the pending changes in order to make this 
> cut-over, because the non-sync apply worker will need to start applying 
> changes on top of the resync'd table potentially as soon as the next 
> transaction it starts applying, so it needs to see the rows there.
>

The change here would be that the apply worker will check for changes
file and if it exists then apply them before it changes the relstate
to SUBREL_STATE_READY in process_syncing_tables_for_apply(). So, it
will not miss seeing any rows.

> Doing this would also add another round of write multiplication since the 
> data would get spooled then applied to WAL then heap. Write multiplication is 
> already an issue for logical replication so adding to it isn't particularly 
> desirable without a really compelling reason.
>

It will solve our problem of allowing decoding of prepared xacts in
pgoutput. I have explained the problem above [1]. The other idea which
we discussed is to allow having an additional state in
pg_subscription_rel, make the slot as permanent in tablesync worker,
and then process transaction-by-transaction in apply worker. Does that
approach sounds better? Is there any bigger change involved in this
approach (making tablesync slot permanent) which I am missing?

> With  the write multiplication comes disk space management issues for big 
> transactions as well as the obvious performance/throughput impact.
>
> It adds even more latency between upstream commit and downstream apply, 
> something that is again already an issue for logical replication.
>
> Right now we don't have any concept of a durable and locally flushed spool.
>

I think we have a concept quite close to it for writing changes for
in-progress xacts as done in PG-14. It is not durable but that
shouldn't be a big problem if we allow syncing the changes file.

> It's not impossible to do as you suggest but the cutover requirement makes it 
> far from simple. As discussed in the logical archiving thread I think it'd be 
> good to have something like this, and there are times the write 
> multiplication price would be well worth paying. But it's not easy.
>
>> B) Then, when the "apply" worker proceeds, if it detects the existence
>> of the "changes" file it will replay/apply_dispatch all those gobbled
>> messages before just continuing as normal.
>
>
> That's going to introduce a really big stall in the apply worker's progress 
> in many cases. During that time it won't be receiving from upstream (since we 
> don't spool logical changes to disk at this time) so the upstream lag will 
> grow. That will impact synchronous replication, pg_wal size management, 
> catalog bloat, etc. It'll also leave the upstream logical decoding session 
> idle, so when it resumes it may create a spike of I/O and CPU load as it 
> catches up, as well as a spike of network traffic. And depending on how close 
> the upstream write rate is to the max decode speed, network throughput max, 
> and downstream apply speed max, it may take some time to catch up over the 
> resulting lag.
>

This is just for the initial tablesync phase. I think it is equivalent
to saying that during basebackup, we need to parallelly start physical
replication. I agree that sometimes it can take a lot of time to copy
large tables but it will be just one time and no worse than the other
situations like basebackup.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-06 Thread k.jami...@fujitsu.com
On Friday, December 4, 2020 8:27 PM, Amit Kapila  
wrote:
> On Fri, Nov 27, 2020 at 11:36 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com"
> >  wrote in
> > > > From: Kyotaro Horiguchi  Hello, Kirk.
> > > > Thank you for the new version.
> > >
> > > Hi, Horiguchi-san. Thank you for your very helpful feedback.
> > > I'm updating the patches addressing those.
> > >
> > > > +   if (!smgrexists(rels[i], j))
> > > > +   continue;
> > > > +
> > > > +   /* Get the number of blocks for a relation's fork */
> > > > +   blocks[i][numForks] = smgrnblocks(rels[i], j,
> > > > NULL);
> > > >
> > > > If we see a fork which its size is not cached we must give up this
> > > > optimization for all target relations.
> > >
> > > I did not use the "cached" flag in DropRelFileNodesAllBuffers and
> > > use InRecovery when deciding for optimization because of the following
> reasons:
> > > XLogReadBufferExtended() calls smgrnblocks() to apply changes to
> > > relation page contents. So in DropRelFileNodeBuffers(),
> > > XLogReadBufferExtended() is called during VACUUM replay because
> VACUUM changes the page content.
> > > OTOH, TRUNCATE doesn't change the relation content, it just
> > > truncates relation pages without changing the page contents. So
> > > XLogReadBufferExtended() is not called, and the "cached" flag will
> > > always return false. I tested with "cached" flags before, and this
> >
> > A bit different from the point, but if some tuples have been inserted
> > to the truncated table, XLogReadBufferExtended() is called for the
> > table and the length is cached.
> >
> > > always return false, at least in DropRelFileNodesAllBuffers. Due to
> > > this, we cannot use the cached flag in DropRelFileNodesAllBuffers().
> > > However, I think we can still rely on smgrnblocks to get the file
> > > size as long as we're InRecovery. That cached nblocks is still guaranteed
> to be the maximum in the shared buffer.
> > > Thoughts?
> >
> > That means that we always think as if smgrnblocks returns "cached" (or
> > "safe") value during recovery, which is out of our current consensus.
> > If we go on that side, we don't need to consult the "cached" returned
> > from smgrnblocks at all and it's enough to see only InRecovery.
> >
> > I got confused..
> >
> > We are relying on the "fact" that the first lseek() call of a
> > (startup) process tells the truth.  We added an assertion so that we
> > make sure that the cached value won't be cleared during recovery.  A
> > possible remaining danger would be closing of an smgr object of a live
> > relation just after a file extension failure. I think we are thinking
> > that that doesn't happen during recovery.  Although it seems to me
> > true, I'm not confident.
> >
> 
> Yeah, I also think it might not be worth depending upon whether smgr close
> has been done before or not. I feel the current idea of using 'cached'
> parameter is relatively solid and we should rely on that.
> Also, which means that in DropRelFileNodesAllBuffers() we should rely on
> the same, I think doing things differently in this regard will lead to 
> confusion. I
> agree in some cases we might not get benefits but it is more important to be
> correct and keep the code consistent to avoid introducing bugs now or in the
> future.
> 
Hi, 
I have reported before that it is not always the case that the "cached" flag of
srnblocks() return true. So when I checked the truncate test case used in my
patch, it does not enter the optimization path despite doing INSERT before
truncation of table.
The reason for that is because in TRUNCATE, a new RelFileNode is assigned
to the relation when creating a new file. In recovery, XLogReadBufferExtended()
always opens the RelFileNode and calls smgrnblocks() for that RelFileNode for 
the
first time. And for recovery processing, different RelFileNodes are used for the
INSERTs to the table and TRUNCATE to the same table.

As we cannot use "cached" flag for both DropRelFileNodeBuffers() and
DropRelFileNodesAllBuffers() based from above.
I am thinking that if we want consistency, correctness, and to still make use of
the optimization, we can completely drop the "cached" flag parameter in 
smgrnblocks,
and use InRecovery.
Tsunakawa-san mentioned in [1] that it is safe because smgrclose is not called
by the startup process in recovery. Shared-inval messages are not sent to 
startup
process.

Otherwise, we use the current patch form as it is: using "cached" in
DropRelFileNodeBuffers() and InRecovery for DropRelFileNodesAllBuffers().
However, that does not seem to be what is wanted in this thread.

Thoughts?

Regards,
Kirk Jamison

[1] 
https://www.postgresql.org/message-id/TYAPR01MB2990B42570A5FAC349EE983AFEF40%40TYAPR01MB2990.jpnprd01.prod.outlook.com


Re: ModifyTable overheads in generic plans

2020-12-06 Thread Amit Langote
On Thu, Nov 12, 2020 at 5:04 PM Amit Langote  wrote:
> Attached new 0002 which does these adjustments.  I went with
> ri_RootTargetDesc to go along with ri_RelationDesc.
>
> Also, I have updated the original 0002 (now 0003) to make
> GetChildToRootMap() use ri_RootTargetDesc instead of
> ModifyTableState.rootResultRelInfo.ri_RelationDesc, so that even
> AfterTriggerSaveEvent() can now use that function.  This allows us to
> avoid having to initialize ri_ChildToRootMap anywhere but inside
> GetChildRootMap(), with that long comment defending doing so. :-)

These needed to be rebased due to recent copy.c upheavals.  Attached.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v11-0001-Set-ForeignScanState.resultRelInfo-lazily.patch
Description: Binary data


v11-0002-Rethink-ResultRelInfo.ri_PartitionRoot.patch
Description: Binary data


v11-0003-Initialize-result-relation-information-lazily.patch
Description: Binary data


Re: POC: Better infrastructure for automated testing of concurrency issues

2020-12-06 Thread Craig Ringer
On Wed, 25 Nov 2020 at 22:11, Alexander Korotkov 
wrote:

> Hackers,
>
> PostgreSQL is a complex multi-process system, and we are periodically
> faced with complicated concurrency issues. While the postgres community
> does a great job on investigating and fixing the problems, our ability to
> reproduce concurrency issues in the source code test suites is limited.
>
> I think we currently have two general ways to reproduce the concurrency
> issues.
> 1. A text scenario for manual reproduction of the issue, which could
> involve psql sessions, gdb sessions etc. Couple of examples are [1] and
> [2]. This method provides reliable reproduction of concurrency issues. But
> it's  hard to automate, because it requires external instrumentation
> (debugger) and it's not stable in terms of postgres code changes (that is
> particular line numbers for breakpoints could be changed). I think this is
> why we currently don't have such scenarios among postgres test suites.
> 2. Another way is to reproduce the concurrency issue without directly
> touching the database internals using pgbench or other way to simulate the
> workload (see [3] for example). This way is easier to automate, because it
> doesn't need external instrumentation and it's not so sensitive to source
> code changes. But at the same time this way is not reliable and is
> resource-consuming.
>

Agreed.

For a useful but limited set of cases there's (3) the isolation tester and
pg_isolation_regress. But IIRC the patches to teach it to support multiple
upstream nodes never got in, so it's essentially useless for any
replication related testing.

There's also (4), write a TAP test that uses concurrent psql sessions via
IPC::Run. Then play games with heavyweight or advisory lock waits to order
events, use instance starts/stops, change ports or connstrings to simulate
network issues, use SIGSTOP/SIGCONTs, add src/test/modules extensions that
inject faults or provide custom blocking wait functions for the event you
want, etc. I've done that more than I'd care to, and I don't want to do it
any more than I have to in future.

In some cases I've gone further and written tests that use systemtap in
"guru" mode (read/write, with embedded C enabled) to twiddle the memory of
the target process(es) when a probe is hit, e.g. to modify a function
argument or return value or inject a fault. Not exactly portable or
convenient, though very powerful.


In the view of above, I'd like to propose a POC patch, which implements new
> builtin infrastructure for reproduction of concurrency issues in automated
> test suites.  The general idea is so-called "stop events", which are
> special places in the code, where the execution could be stopped on some
> condition.  Stop event also exposes a set of parameters, encapsulated into
> jsonb value.  The condition over stop event parameters is defined using
> jsonpath language.
>

The patched PostgreSQL used by 2ndQuadrant internally has a feature called
PROBE_POINT()s that is somewhat akin to this. Since it's not a customer
facing feature I'm sure I can discuss it here, though I'll need to seek
permission before I can show code.

TL;DR: PROBE_POINT()s let you inject ERRORs, sleeps, crashes, and various
other behaviour at points in the code marked by name, using GUCs, hooks
loaded from test extensions, or even systemtap scripts to control what
fires and when. Performance impact is essentially zero when no probes are
currently enabled at runtime, so they're fine for cassert builds.

Details:

A PROBE_POINT() is a macro that works as a marker, a bit like a
TRACE_POSTGRESQL_ dtrace macro. But instead of the super lightweight
tracepoint that SDT marker points emit, a PROBE_POINT tests an
unlikely(probe_points_enabled) flag, and if true, it prepares arguments for
the probe handler: A probe name, probe action, sleep duration, and a hit
counter.

The default probe action and sleep duration come from GUCs. So your control
of the probe is limited to the granularity you can easily manage GUCs at.
That's often sufficient

But if you want finer control for something, there are two ways to achieve
it.

After picking the default arguments to the handler, the probe point checks
for a hook. If defined, it calls it with the probe point name and pointers
to the action and sleep duration values, so the hook function can modify
them per probe-point hit. That way you can use in src/test/modules
extensions or your own test extensions first, with the probe point name as
an argument and the action and sleep duration as out-params, as well as any
accessible global state, custom GUCs you define in your test extension,
etc. That's usually enough to target a probe very specifically but it's a
bit of a hassle.

Another option is to use a systemtap script. You can write your code in
systemtap with its language. When the systemtap marker for a probe point
event fires, decide if it's the one you want and twiddle the target process
variables that store the probe 

RE: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2020-12-06 Thread tsunakawa.ta...@fujitsu.com
From: Bharath Rupireddy 
> IMHO, we should also change the parent table. Say, I have 2 local
> partitions for a logged table, then I alter that table to
> unlogged(with your patch, parent table doesn't become unlogged whereas
> the partitions will), and I detach all the partitions for some reason.
> Now, the user would think that he set the parent table to unlogged but
> it didn't happen. So, I think we should also change the parent table's
> logged/unlogged property though it may not have data associated with
> it when it has all the partitions. Thoughts?

I'd like to think that the logged/unlogged property is basically specific to 
each storage unit, which is a partition here, and ALTER TABLE on a partitioned 
table conveniently changes the properties of underlying storage units.  (In 
that regard, it's unfortunate that it fails with an ERROR to try to change 
storage parameters like fillfactor with ALTER TABLE.)


> This may lead to a situation where a partition table has few
> partitions as logged and few as unlogged. Will it lead to some
> problems?

No, I don't see any problem.


> I think we can add foreign partition case into postgres_fdw.sql so
> that the tests will run as part of make check-world.

I was hesitant to add this test in postgres_fdw, but it seems to be a good 
place considering that postgres_fdw, and other contrib modules as well, are 
part of Postgres core.


> How about documenting all these points in alter_table.sgml,
> create_table.sgml and create_foreign_table.sgml under the partition
> section?

I'd like to add a statement in the description of ALTER TABLE SET 
LOGGED/UNLOGGED as other ALTER actions do.  I don't want to add a new partition 
section for all CREATE/ALTER actions in this patch.

If there's no objection, I think I'll submit the (hopefully final) revised 
patch after a few days.


Regards
Takayuki Tsunakawa



RE: Parallel Inserts in CREATE TABLE AS

2020-12-06 Thread Hou, Zhijie
Hi

+   /*
+* Flag to let the planner know that the SELECT query is for CTAS. This 
is
+* used to calculate the tuple transfer cost from workers to gather 
node(in
+* case parallelism kicks in for the SELECT part of the CTAS), to zero 
as
+* each worker will insert its share of tuples in parallel.
+*/
+   if (IsParallelInsertInCTASAllowed(into, NULL))
+   query->isForCTAS = true;


+   /*
+* We do not compute the parallel_tuple_cost for CTAS because the 
number of
+* tuples that are transferred from workers to the gather node is zero 
as
+* each worker, in parallel, inserts the tuples that are resulted from 
its
+* chunk of plan execution. This change may make the parallel plan cheap
+* among all other plans, and influence the planner to consider this
+* parallel plan.
+*/
+   if (!(root->parse->isForCTAS &&
+   root->query_level == 1))
+   run_cost += parallel_tuple_cost * path->path.rows;

I noticed that the parallel_tuple_cost will still be ignored,
When Gather is not the top node.

Example:
Create table test(i int);
insert into test values(generate_series(1,1000,1));
explain create table ntest3 as select * from test where i < 200 limit 
1;

  QUERY PLAN   
---
 Limit  (cost=1000.00..97331.33 rows=1000 width=4)
   ->  Gather  (cost=1000.00..97331.33 rows=1000 width=4)
 Workers Planned: 2
 ->  Parallel Seq Scan on test  (cost=0.00..96331.33 rows=417 width=4)
   Filter: (i < 200)


The isForCTAS will be true because [create table as], the
query_level is always 1 because there is no subquery.
So even if gather is not the top node, parallel cost will still be ignored.

Is that works as expected ?

Best regards,
houzj







Re: Improving spin-lock implementation on ARM.

2020-12-06 Thread Amit Khandekar
On Sat, 5 Dec 2020 at 02:55, Alexander Korotkov  wrote:
>
> On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar  wrote:
> > Let me know what do you think about this analysis and any specific 
> > direction that we should consider to help move forward.
>
> BTW, it would be also nice to benchmark my lwlock patch on the
> Kunpeng.  I'm very optimistic about this patch, but it wouldn't be
> fair to completely throw it away.  It still might be useful for
> LSE-disabled builds.

Coincidentally I was also looking at some hotspot locations around
LWLockAcquire() and LWLockAttempt() for read-only work-loads on both
arm64 and x86, and had narrowed down to the place where
pg_atomic_compare_exchange_u32() is called. So it's likely we are
working on the same hotspot area. When I get a chance, I do plan to
look at your patch while I myself am trying to see if we can do some
optimizations. Although, this is unrelated to the optimization of this
mail thread, so this will need a different mail thread.



-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: Parallel Inserts in CREATE TABLE AS

2020-12-06 Thread Bharath Rupireddy
Thanks for the comments.

On Mon, Dec 7, 2020 at 8:56 AM Zhihong Yu  wrote:
>
>
> +   (void) SetCurrentCommandIdUsedForWorker();
> +   myState->output_cid = GetCurrentCommandId(false);
>
> SetCurrentCommandIdUsedForWorker already has void as return type. The 
> '(void)' is not needed.
>

Removed.

>
> +* rd_createSubid is marked invalid, otherwise, the table is
> +* not allowed to extend by the workers.
>
> nit: to extend by the workers -> to be extended by the workers
>

Changed.

>
> For IsParallelInsertInCTASAllowed, logic is inside 'if (IS_CTAS(into))' block.
> You can return false when (!IS_CTAS(into)) - this would save some indentation 
> for the body.
>

Done.

>
> +   if (rel && rel->relpersistence != RELPERSISTENCE_TEMP)
> +   allowed = true;
>
> Similarly, when the above condition doesn't hold, you can return false 
> directly - reducing the next if condition to 'if (queryDesc)'.
>

Done.

>
> The composite condition is negated. Maybe you can write without negation:
>

Done.

>
> +* Write out the number of tuples this worker has inserted. Leader will 
> use
> +* it to inform to the end client.
>
> 'inform to the end client' -> 'inform the end client' (without to)
>

Changed.

Attaching v8 patch. Consider this for further review.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v8-0001-Parallel-Inserts-in-CREATE-TABLE-AS.patch
Description: Binary data


Re: Proposed patch for key managment

2020-12-06 Thread Bruce Momjian
On Mon, Dec  7, 2020 at 11:03:30AM +0900, Michael Paquier wrote:
> On Sat, Dec 05, 2020 at 10:42:05AM -0500, Bruce Momjian wrote:
> > On Sat, Dec  5, 2020 at 09:54:33PM +0900, Michael Paquier wrote:
> >> On Fri, Dec 04, 2020 at 10:52:29PM -0500, Bruce Momjian wrote:
> >>>   if (state->evpctx == NULL)
> >>>   {
> >>> - explicit_bzero(state, sizeof(pg_cryptohash_state));
> >>> - explicit_bzero(ctx, sizeof(pg_cryptohash_ctx));
> >>>  #ifndef FRONTEND
> >>>   ereport(ERROR,
> >>>   (errcode(ERRCODE_OUT_OF_MEMORY),
> >> 
> >> And this original position is IMO more correct.
> > 
> > Do we even need them?
> 
> That's the intention to clean up things in a consistent way.

Ah, I see your point.  Added:

https://github.com/postgres/postgres/compare/master...bmomjian:key.diff

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Proposed patch for key managment

2020-12-06 Thread Bruce Momjian
On Mon, Dec  7, 2020 at 09:30:03AM +0900, Masahiko Sawada wrote:
> Thank you for updating the patch!
> 
> I think we need explicit_bzero() also in freeing the keywrap context.

pg_cryptohash_free() already has this:

explicit_bzero(state, sizeof(pg_cryptohash_state));
explicit_bzero(ctx, sizeof(pg_cryptohash_ctx));

Do we need more?

> BTW, when we need -R option pg_ctl command to start the server, how
> can we start it in the single-user mode?

I added code for that, but I hadn't tested it yet.  Now that I tried it,
I realized that it is awkward to supply a file descriptor number (that
will be closed) from the command-line, so I added code and docs to allow
-1 to duplicate standard error, and it worked:

$ postgres --single -R -1 -D /u/pg/data

Enter password:
PostgreSQL stand-alone backend 14devel
backend> select 100;
 1: ?column?(typeid = 23, len = 4, typmod = -1, byval = t)

 1: ?column? = "100"(typeid = 23, len = 4, typmod = -1, 
byval = t)


Updated patch at the same URL:

https://github.com/postgres/postgres/compare/master...bmomjian:key.diff

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Single transaction in the tablesync worker?

2020-12-06 Thread Craig Ringer
On Mon, 7 Dec 2020 at 11:44, Peter Smith  wrote:

>
> Basically, I was wondering why can't the "tablesync" worker just
> gather messages in a similar way to how the current streaming feature
> gathers messages into a "changes" file, so that they can be replayed
> later.
>
>
See the related thread "Logical archiving"

https://www.postgresql.org/message-id/20d9328b-a189-43d1-80e2-eb25b9284...@yandex-team.ru

where I addressed some parts of this topic in detail earlier today.

A) The "tablesync" worker (after the COPY) does not ever apply any of
> the incoming messages, but instead it just gobbles them into a
> "changes" file until it decides it has reached SYNCDONE state and
> exits.
>

This has a few issues.

Most importantly, the sync worker must cooperate with the main apply worker
to achieve a consistent end-of-sync cutover. The sync worker must have
replayed the pending changes in order to make this cut-over, because the
non-sync apply worker will need to start applying changes on top of the
resync'd table potentially as soon as the next transaction it starts
applying, so it needs to see the rows there.

Doing this would also add another round of write multiplication since the
data would get spooled then applied to WAL then heap. Write multiplication
is already an issue for logical replication so adding to it isn't
particularly desirable without a really compelling reason. With  the write
multiplication comes disk space management issues for big transactions as
well as the obvious performance/throughput impact.

It adds even more latency between upstream commit and downstream apply,
something that is again already an issue for logical replication.

Right now we don't have any concept of a durable and locally flushed spool.

It's not impossible to do as you suggest but the cutover requirement makes
it far from simple. As discussed in the logical archiving thread I think
it'd be good to have something like this, and there are times the write
multiplication price would be well worth paying. But it's not easy.

B) Then, when the "apply" worker proceeds, if it detects the existence
> of the "changes" file it will replay/apply_dispatch all those gobbled
> messages before just continuing as normal.
>

That's going to introduce a really big stall in the apply worker's progress
in many cases. During that time it won't be receiving from upstream (since
we don't spool logical changes to disk at this time) so the upstream lag
will grow. That will impact synchronous replication, pg_wal size
management, catalog bloat, etc. It'll also leave the upstream logical
decoding session idle, so when it resumes it may create a spike of I/O and
CPU load as it catches up, as well as a spike of network traffic. And
depending on how close the upstream write rate is to the max decode speed,
network throughput max, and downstream apply speed max, it may take some
time to catch up over the resulting lag.

Not a big fan of that approach.


RFC: Deadlock detector hooks for edge injection

2020-12-06 Thread Craig Ringer
Hi all

Related to my other post about deadlock detector hooks for victim
selection, I'd like to gauge opinions here about whether it's feasible to
inject edges into the deadlock detector's waits-for graph.

Doing so would help with detecting deadlocks relating to shm_mq waits, help
with implementing distributed deadlock detection solutions, make it
possible to spot deadlocks relating to condition-variable waits, etc.

I'm not sure quite how the implementation would look yet, this is an early
RFC and sanity check so I don't invest any work into it if it has no hope
of going anywhere.

I expect we'd want to build the graph only when the detector is triggered,
rather than proactively maintain such edges, so the code implementing the
hook would be responsible for keeping track of whatever state it needs to
in order to do so.

When called, it'd append "struct EDGE" s to the deadlock detector's
waits-for list.

We'd need to support a node representation other than a LOCK* for struct
EDGE, and to abstract edge sorting (TopoSort etc) to allow for other edge
types. So it wouldn't be a trivial change to make, hence opening with this
RFC.

I expect it'd be fine to require each EDGE* to have a PGPROC and to require
the PGPROC for waits-for and waiting-for not be the same proc. Distributed
systems that use libpq connections to remote nodes, or anything else, would
have to register the local-side PGPROC as the involved waiter or waited-on
object, and handle any mapping between the remote object and local resource
holder/acquirer themselves, probably using their own shmem state.

Bonus points if the callback could assign weights to the injected edges to
bias victim selection more gently. Or a way to tag an waited-for node as
not a candidate victim for cancellation.

General thoughts?


RFC: Deadlock detector hooks for victim selection and edge injection

2020-12-06 Thread Craig Ringer
Hi folks

Now that we're well on track for streaming logical decoding, it's becoming
more practical to look at parallel logical apply.

The support for this in pglogical3 benefits from a deadlock detector hook.
It was added in the optional patched postgres pglogical can use to enable
various extra features that weren't possible without core changes, but
isn't present in community postgres yet.

I'd like to add it.

The main benefit is that it lets the logical replication support tell the
deadlock detector that it should prefer to kill the victim whose txn has a
higher upstream commit lsn. That helps encourage parallel logical apply to
make progress in the face of deadlocks between concurrent txns.

Any in-principle objections?


Re: Single transaction in the tablesync worker?

2020-12-06 Thread Amit Kapila
On Mon, Dec 7, 2020 at 6:20 AM Craig Ringer
 wrote:
>
> On Sat, 5 Dec 2020, 10:03 Amit Kapila,  wrote:
>>
>> On Fri, Dec 4, 2020 at 7:12 PM Ashutosh Bapat
>>  wrote:
>>
>> I think the problem is not that the changes are visible after COPY
>> rather it is that we don't have a mechanism to restart if it crashes
>> after COPY unless we do all the sync up in one transaction. Assume we
>> commit after COPY and then process transaction-by-transaction and it
>> errors out (due to connection loss) or crashes, in-between one of the
>> following transactions after COPY then after the restart we won't know
>> from where to start for that relation. This is because the catalog
>> (pg_subscription_rel) will show the state as 'd' (data is being
>> copied) and the slot would have gone as it was a temporary slot. But
>> as mentioned in one of my emails above [1] we can solve these problems
>> which Craig also seems to be advocating for as there are many
>> advantages of not doing the entire sync (initial copy + stream changes
>> for that relation) in one single transaction. It will allow us to
>> support decode of prepared xacts in the subscriber. Also, it seems
>> pglogical already does processing transaction-by-transaction after the
>> initial copy. The only thing which is not clear to me is why we
>> haven't decided to go ahead initially and it would be probably better
>> if the original authors would also chime-in to at least clarify the
>> same.
>
>
> It's partly a resource management issue.
>
> Replication origins are a limited resource. We need to use a replication 
> origin for any sync we want to be durable across restarts.
>
> Then again so are slots and we use temp slots for each sync.
>
> If a sync fails cleanup on the upstream side is simple with a temp slot. With 
> persistent slots we have more risk of creating upstream issues. But then, so 
> long as the subscriber exists it can deal with that. And if the subscriber no 
> longer exists its primary slot is an issue too.
>

I think if the only issue is slot clean up, then the same exists today
for the slot created by the apply worker (or which I think you are
referring to as a primary slot). This can only happen if the
subscriber goes away without dropping the subscription. Also, if we
are worried about using up too many slots then the slots used by
tablesync workers will probably be freed sooner.

> It'd help if we could register pg_shdepend entries between catalog entries 
> and slots, and from a main subscription slot to any extra slots used for 
> resynchronization.
>

Which catalog entries you are referring to here?

> And I should write a patch for a resource retention summarisation view.
>

That would be great.

>>
>> I am not sure why but it seems acceptable to original authors that the
>> data of transactions are visibly partially during the initial
>> synchronization phase for a subscription.
>
>
> I don't think there's much alternative there.
>

I am not sure about this. I think it is primarily to allow some more
parallelism among apply and sync workers. One primitive way to achieve
parallelism and don't have this problem is to allow apply worker to
wait till all the tablesync workers are in DONE state. Then we will
never have an inconsistency problem or the prepared xact problem. Now,
surely if large copies are required for multiple relations then we
would delay a bit to replay transactions partially by the apply worker
but don't know how much that matters as compared to transaction
visibility issue and anyway we would have achieved the maximum
parallelism by allowing copy via multiple workers.

-- 
With Regards,
Amit Kapila.




Re: Single transaction in the tablesync worker?

2020-12-06 Thread Peter Smith
Hi,

I wanted to float another idea to solve these tablesync/apply worker problems.

This idea may or may not have merit. Please consider it.

~

Basically, I was wondering why can't the "tablesync" worker just
gather messages in a similar way to how the current streaming feature
gathers messages into a "changes" file, so that they can be replayed
later.

e.g. Imagine if

A) The "tablesync" worker (after the COPY) does not ever apply any of
the incoming messages, but instead it just gobbles them into a
"changes" file until it decides it has reached SYNCDONE state and
exits.

B) Then, when the "apply" worker proceeds, if it detects the existence
of the "changes" file it will replay/apply_dispatch all those gobbled
messages before just continuing as normal.

So
- IIUC this kind of replay is like how the current code stream commit
applies the streamed "changes" file.
- "tablesync" worker would only be doing table sync (COPY) as its name
suggests. Any detected "changes" are recorded and left for the "apply"
worker to handle.
- "tablesync" worker would just operate in single tx with a temporary
slot as per current code
- Then the "apply" worker would be the *only* worker that actually
applies anything. (as its name suggests)

Thoughts?

---

Kind Regards,
Peter Smith.
Fujitsu Australia




RFC: Giving bgworkers walsender-like grace during shutdown (for logical replication)

2020-12-06 Thread Craig Ringer
Hi folks

TL;DR: Anyone object to a new bgworker flag that exempts background workers
(such as logical apply workers) from the first round of fast shutdown
signals, and instead lets them to finish their currently in-progress txn
and exit?

This is a change proposal raised for comment before patch submission so
please consider it. Explanation of why I think we need it comes first, then
proposed implementation.

Rationale:

Currently a fast shutdown causes logical replication subscribers to abort
their currently in-progress transaction and terminate along with user
backends. This means they cannot finish receiving and flushing the
currently in-progress transaction, possibly wasting a very large amount of
work.

After restart the subscriber must reconnect, decode and reorder buffer from
the restart_lsn up to the current confirmed_flush_lsn, receive the whole
txn on the wire all over again, and apply the whole txn again locally. We
don't currently spool received txn change-streams to disk on the subscriber
and flush them so we can't repeat just the local apply part (see the
related thread "Logical archiving" for relevant discussion there). This can
create a lot of bloat, a lot of excess WAL, etc, if a big txn was in
progress at the time.

I'd like to add a bgworker flag that tells the postmaster to treat the
logical apply bgworker (or extension equivalents) somewhat like a walsender
for the purpose of fast shutdown. Instead of immediately terminating it
like user backends on fast shutdown, the bgworker should be sent a
ProcSignal warning that shutdown is pending and instructing it to finish
receiving and applying its current transaction, then exit gracefully.

It's not quite the same as the walsender, since there we try to flush
changes to downstreams up to the end of the last commit before shutting
down. That doesn't make sense on a subscriber because the upstream is
likely still generating txns. We just want to avoid wasting our effort on
any in-flight txn.

Any objections?

Proposed implementation:

* Add new bgworker flag like BGW_DELAYED_SHUTDOWN

* Define new ProcSignal PROCSIG_SHUTDOWN_REQUESTED. On fast shutdown send
this instead of a SIGTERM to bgworker backends flagged
BGW_DELAYED_SHUTDOWN. On smart shutdown send it to all backends when the
shutdown request arrives, since that could be handy for other uses too.

* Flagged bgworker is expected to finish its current txn and exit promptly.
Impose a grace period after which they get SIGTERM'd anyway. Also send a
SIGTERM if the postmaster receives a second fast shutdown request.

* Defer sending PROCSIG_WALSND_INIT_STOPPING to walsenders until all
BGW_DELAYED_SHUTDOWN flagged bgworkers have exited, so we can ensure that
cascaded downstreams receive any txns applied from the upstream.

This doesn't look likely to be particularly complicated to implement.

It might be better to use a flag in PGPROC rather than the bgworker struct,
in case we want to extend this to other backend types in future. Also to
make it easier for the postmaster to check the flag during shutdown. Could
just claim a bit from statusFlags for the purpose. Thoughts?


Re: Parallel Inserts in CREATE TABLE AS

2020-12-06 Thread Zhihong Yu
Hi, Bharath :

+   (void) SetCurrentCommandIdUsedForWorker();
+   myState->output_cid = GetCurrentCommandId(false);

SetCurrentCommandIdUsedForWorker already has void as return type. The
'(void)' is not needed.

+* rd_createSubid is marked invalid, otherwise, the table is
+* not allowed to extend by the workers.

nit: to extend by the workers -> to be extended by the workers

For IsParallelInsertInCTASAllowed, logic is inside 'if (IS_CTAS(into))'
block.
You can return false when (!IS_CTAS(into)) - this would save some
indentation for the body.

+   if (rel && rel->relpersistence != RELPERSISTENCE_TEMP)
+   allowed = true;

Similarly, when the above condition doesn't hold, you can return false
directly - reducing the next if condition to 'if (queryDesc)'.

+   if (!(ps && IsA(ps, GatherState) && !ps->ps_ProjInfo &&
+   plannedstmt->parallelModeNeeded &&
+   plannedstmt->planTree &&
+   IsA(plannedstmt->planTree, Gather) &&
+   plannedstmt->planTree->lefttree &&
+   plannedstmt->planTree->lefttree->parallel_aware &&
+   plannedstmt->planTree->lefttree->parallel_safe))

The composite condition is negated. Maybe you can write without negation:

+   return (ps && IsA(ps, GatherState) && !ps->ps_ProjInfo &&
+   plannedstmt->parallelModeNeeded &&
+   plannedstmt->planTree &&
+   IsA(plannedstmt->planTree, Gather) &&
+   plannedstmt->planTree->lefttree &&
+   plannedstmt->planTree->lefttree->parallel_aware &&
+   plannedstmt->planTree->lefttree->parallel_safe)

+* Write out the number of tuples this worker has inserted. Leader will
use
+* it to inform to the end client.

'inform to the end client' -> 'inform the end client' (without to)

Cheers

On Sun, Dec 6, 2020 at 4:37 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Thanks Amit for the review comments.
>
> On Sat, Dec 5, 2020 at 4:27 PM Amit Kapila 
> wrote:
> >
> > > If I'm not wrong, I think currently we have no exec nodes for DDLs.
> > > I'm not sure whether we would like to introduce one for this.
> >
> > Yeah, I am also not in favor of having an executor node for CTAS but
> > OTOH, I also don't like the way you have jammed the relevant
> > information in generic PlanState. How about keeping it in GatherState
> > and initializing it in ExecCreateTableAs() after the executor start.
> > You are already doing special treatment for the Gather node in
> > ExecCreateTableAs (via IsParallelInsertInCTASAllowed) so we can as
> > well initialize the required information in GatherState in
> > ExecCreateTableAs. I think that might help in reducing the special
> > treatment for intoclause at different places.
> >
>
> Done. Added required info to GatherState node. While this reduced the
> changes at many other places, but had to pass the into clause and
> object id to ExecInitParallelPlan() as we do not send GatherState node
> to it. Hope that's okay.
>
> >
> > Few other assorted comments:
> > =
> > 1.
> > This looks a bit odd. The function name
> > 'IsParallelInsertInCTASAllowed' suggests that it just checks whether
> > parallelism is allowed but it is internally changing the plan_rows. It
> > might be better to do this separately if the parallelism is allowed.
> >
>
> Changed.
>
> >
> > 2.
> >  static void ExecShutdownGatherWorkers(GatherState *node);
> > -
> > +static void ExecParallelInsertInCTAS(GatherState *node);
> >
> > Spurious line removal.
> >
>
> Corrected.
>
> >
> > 3.
> > The comment and code appear a bit misleading as the function seems to
> > shutdown the workers rather than waiting for them to finish. How about
> > using something like below:
> >
> > /*
> > * Next, accumulate buffer and WAL usage.  (This must wait for the workers
> > * to finish, or we might get incomplete data.)
> > */
> > if (nworkers > 0)
> > {
> > int i;
> >
> > /* Wait for all vacuum workers to finish */
> > WaitForParallelWorkersToFinish(lps->pcxt);
> >
> > for (i = 0; i < lps->pcxt->nworkers_launched; i++)
> > InstrAccumParallelQuery(>buffer_usage[i], >wal_usage[i]);
> > }
> >
> > This is how it works for parallel vacuum.
> >
>
> Done.
>
> >
> > 4.
> > The above comment doesn't seem to convey what it intends to convey.
> > How about changing it slightly as: "We don't compute the
> > parallel_tuple_cost for CTAS because the number of tuples that are
> > transferred from workers to the gather node is zero as each worker
> > parallelly inserts the tuples that are resulted from its chunk of plan
> > execution. This change may make the parallel plan cheap among all
> > other plans, and influence the planner to consider this parallel
> > plan."
> >
>
> Changed.
>
> >
> > Then, we can also have an Assert for path->path.rows to zero for the
> CTAS case.
> >
>
> We can not have Assert(path->path.rows == 0), because we are not
> 

Re: Logical archiving

2020-12-06 Thread Craig Ringer
Actually CC'd Petr this time.

On Mon, 7 Dec 2020 at 11:05, Craig Ringer 
wrote:

> Reply follows inline. I addressed your last point first, so it's out of
> order.
>
> On Fri, 4 Dec 2020 at 15:33, Andrey Borodin  wrote
>
> > If OLAP cannot consume data fast enough - we are out of space due to
> repl slot.
>
> There is a much simpler solution to this than logical PITR.
>
> What we should be doing is teaching xlogreader how to invoke the
> restore_command to fetch archived WALs for decoding.
>
> Replication slots already have a WAL retention limit, but right now when
> that limit is reached the slot is invalidated and becomes useless, it's
> effectively dropped. Instead, if WAL archiving is enabled, we should leave
> the slot as valid. If a consumer of the slot needs WAL that no longer
> exists in pg_wal, we should have the walsender invoke the restore_command
> to read the missing WAL segment, decode it, and remove it again.
>
> This would not be a technically difficult patch, and it's IMO one of the
> more important ones for improving logical replication.
>
> > I was discussing problems of CDC with scientific community and they
> asked this simple question: "So you have efficient WAL archive on a very
> cheap storage, why don't you have a logical archive too?"
>
> I've done work in this area, as has Petr (CC'd).
>
> In short, logical archiving and PITR is very much desirable, but we're not
> nearly ready for it yet and we're missing a lot of the foundations needed
> to make it really useful.
>
> IMO the strongest pre-requisite is that we need integrated DDL capture and
> replication in Pg. While this could be implemented in the
> publisher/subscriber logic for logical replication, it would make much more
> sense (IMO) to make it easier to feed DDL events into any logical
> replication output plugin.
>
> pglogical3 (the closed one) has quite comprehensive DDL replication
> support. Doing it is not simple though - there are plenty of complexities:
>
> * Reliably identifying the target objects and mapping them to replication
> set memberships for DML-replication
> * Capturing, replicating and managing the search_path and other DDL
> execution context (DateStyle and much more) reliably
>
>- Each statement type needs specific logic to indicate whether it
>needs DDL replication (and often filter functions since we have lots of
>sub-types where some need replication and some don't)
>- Handling DDL affecting global objects in pg_global correctly, like
>those affecting roles, grants, database security labels etc. There's no one
>right answer for this, it depends on the deployment and requires the user
>to cooperate.
>- Correct handling of transactions that mix DDL and DML (mostly only
>an issue for multimaster).
>- Identifying statements that target a mix of replicated and
>non-replicated objects and handling them appropriately, including for
>CASCADEs
>- Gracefully handling DDL statements that mix TEMPORARY and persistent
>targets. We can do this ok for DROPs but it still requires care. Anything
>else gets messier.
>- Lack of hooks into table rewrite operations and the extremely clumsy
>and inefficient way logical decoding currently exposes decoding of the
>temp-table data during decoding of rewrites means handling table-rewriting
>DDL is difficult and impractical to do correctly. In pglogical we punt on
>it entirely and refuse to permit DDL that would rewrite a table except
>where we can prove it's reliant only on immutable inputs so we can discard
>the upstream rewrite and rely on statement replication.
>- As a consequence of the above, reliably determining whether a given
>statement will cause a table rewrite.
>- Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc.
>- Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing
>something sensible for truncate cascade.
>- Probably more I've forgotten
>
>
> If we don't handle these, then any logical change-log archives will become
> largely useless as soon as there's any schema change.
>
> So we kind of have to solve DDL replication first IMO.
>
> Some consideration is also required for metadata management. Right now
> relation and type metadata has session-lifetime, but you'd want to be able
> to discard old logical change-stream archives and have the later ones still
> be usable. So we'd need to define some kind of restartpoint where we repeat
> the metadata, or we'd have to support externalizing the metadata so it can
> be retained when the main change archives get aged out.
>
> We'd also need to separate the existing apply worker into a "receiver" and
> "apply/writer" part, so the wire-protocol handling isn't tightly coupled
> with the actual change apply code, in order to make it possible to actually
> consume those archives and apply them to the database. In pglogical3 we did
> that by splitting them into two processes, 

Re: Logical archiving

2020-12-06 Thread Craig Ringer
Reply follows inline. I addressed your last point first, so it's out of
order.

On Fri, 4 Dec 2020 at 15:33, Andrey Borodin  wrote

> If OLAP cannot consume data fast enough - we are out of space due to repl
slot.

There is a much simpler solution to this than logical PITR.

What we should be doing is teaching xlogreader how to invoke the
restore_command to fetch archived WALs for decoding.

Replication slots already have a WAL retention limit, but right now when
that limit is reached the slot is invalidated and becomes useless, it's
effectively dropped. Instead, if WAL archiving is enabled, we should leave
the slot as valid. If a consumer of the slot needs WAL that no longer
exists in pg_wal, we should have the walsender invoke the restore_command
to read the missing WAL segment, decode it, and remove it again.

This would not be a technically difficult patch, and it's IMO one of the
more important ones for improving logical replication.

> I was discussing problems of CDC with scientific community and they asked
this simple question: "So you have efficient WAL archive on a very cheap
storage, why don't you have a logical archive too?"

I've done work in this area, as has Petr (CC'd).

In short, logical archiving and PITR is very much desirable, but we're not
nearly ready for it yet and we're missing a lot of the foundations needed
to make it really useful.

IMO the strongest pre-requisite is that we need integrated DDL capture and
replication in Pg. While this could be implemented in the
publisher/subscriber logic for logical replication, it would make much more
sense (IMO) to make it easier to feed DDL events into any logical
replication output plugin.

pglogical3 (the closed one) has quite comprehensive DDL replication
support. Doing it is not simple though - there are plenty of complexities:

* Reliably identifying the target objects and mapping them to replication
set memberships for DML-replication
* Capturing, replicating and managing the search_path and other DDL
execution context (DateStyle and much more) reliably

   - Each statement type needs specific logic to indicate whether it needs
   DDL replication (and often filter functions since we have lots of sub-types
   where some need replication and some don't)
   - Handling DDL affecting global objects in pg_global correctly, like
   those affecting roles, grants, database security labels etc. There's no one
   right answer for this, it depends on the deployment and requires the user
   to cooperate.
   - Correct handling of transactions that mix DDL and DML (mostly only an
   issue for multimaster).
   - Identifying statements that target a mix of replicated and
   non-replicated objects and handling them appropriately, including for
   CASCADEs
   - Gracefully handling DDL statements that mix TEMPORARY and persistent
   targets. We can do this ok for DROPs but it still requires care. Anything
   else gets messier.
   - Lack of hooks into table rewrite operations and the extremely clumsy
   and inefficient way logical decoding currently exposes decoding of the
   temp-table data during decoding of rewrites means handling table-rewriting
   DDL is difficult and impractical to do correctly. In pglogical we punt on
   it entirely and refuse to permit DDL that would rewrite a table except
   where we can prove it's reliant only on immutable inputs so we can discard
   the upstream rewrite and rely on statement replication.
   - As a consequence of the above, reliably determining whether a given
   statement will cause a table rewrite.
   - Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc.
   - Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing
   something sensible for truncate cascade.
   - Probably more I've forgotten


If we don't handle these, then any logical change-log archives will become
largely useless as soon as there's any schema change.

So we kind of have to solve DDL replication first IMO.

Some consideration is also required for metadata management. Right now
relation and type metadata has session-lifetime, but you'd want to be able
to discard old logical change-stream archives and have the later ones still
be usable. So we'd need to define some kind of restartpoint where we repeat
the metadata, or we'd have to support externalizing the metadata so it can
be retained when the main change archives get aged out.

We'd also need to separate the existing apply worker into a "receiver" and
"apply/writer" part, so the wire-protocol handling isn't tightly coupled
with the actual change apply code, in order to make it possible to actually
consume those archives and apply them to the database. In pglogical3 we did
that by splitting them into two processes, connected by a shm_mq.
Originally the process split was optional and you could run a combined
receiver/writer process without the shm_mq if you wanted, but we quickly
found it difficult to reliably handle locking issues etc that way so the
writers all moved 

Re: A few new options for CHECKPOINT

2020-12-06 Thread Michael Paquier
On Sun, Dec 06, 2020 at 10:03:08AM -0500, Stephen Frost wrote:
> * Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
>> You keep making this statement, and I don't necessarily disagree, but if
>> that is the case, please explain why don't we have
>> checkpoint_completion_target set to 0.9 by default?  Should we change
>> that?
> 
> Yes, I do think we should change that..

Agreed.  FWIW, no idea for others, but it is one of those parameters I
keep telling to update after a default installation.

> In fact, I'd argue that we can
> probably get rid of checkpoint_completion_target entirely as an option.
> The main argument against that is that it could be annoying for people
> upgrading, but changing the default to 0.9 would definitely be an
> improvement.

Not sure there is enough ground to do that though.
--
Michael


signature.asc
Description: PGP signature


Re: Proposed patch for key managment

2020-12-06 Thread Michael Paquier
On Sat, Dec 05, 2020 at 10:42:05AM -0500, Bruce Momjian wrote:
> On Sat, Dec  5, 2020 at 09:54:33PM +0900, Michael Paquier wrote:
>> On Fri, Dec 04, 2020 at 10:52:29PM -0500, Bruce Momjian wrote:
>>> if (state->evpctx == NULL)
>>> {
>>> -   explicit_bzero(state, sizeof(pg_cryptohash_state));
>>> -   explicit_bzero(ctx, sizeof(pg_cryptohash_ctx));
>>>  #ifndef FRONTEND
>>> ereport(ERROR,
>>> (errcode(ERRCODE_OUT_OF_MEMORY),
>> 
>> And this original position is IMO more correct.
> 
> Do we even need them?

That's the intention to clean up things in a consistent way.
--
Michael


signature.asc
Description: PGP signature


Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-12-06 Thread Zhihong Yu
> +   /* Make a guess at a good size when we're not given a valid size. */
> +   if (size == 0)
> +   size = 1024;
>
> Should the default size be logged ?

> I'm not too sure if I know what you mean here. Should it be a power of
> 2? It is.  Or do you mean I shouldn't use a magic number?

Using 1024 seems to be fine. I meant logging the default value of 1024 so
that user / dev can have better idea the actual value used (without looking
at the code).

>> Or do you think 98% is not a good number?

Since you have played with Result Cache, I would trust your judgment in
choosing the percentage.
It is fine not to expose this constant until the need arises.

Cheers

On Sun, Dec 6, 2020 at 5:15 PM David Rowley  wrote:

> On Sat, 5 Dec 2020 at 16:51, Zhihong Yu  wrote:
> >
> > There are two blocks with almost identical code (second occurrence in
> cache_store_tuple):
> >
> > +   if (rcstate->mem_used > rcstate->mem_upperlimit)
> > +   {
> > It would be nice if the code can be extracted to a method and shared.
>
> It's true, they're *almost* identical.  I quite like the fact that one
> of the cases can have an unlikely() macro in there. It's pretty
> unlikely that we'd go into cache overflow just when adding a new cache
> entry. work_mem would likely have to be set to a couple of dozen bytes
> for that to happen. 64k is the lowest it can be set.  However, I
> didn't really check to see if having that unlikely() macro increases
> performance.  I've changed things locally here to add a new function
> named cache_check_mem(). I'll keep that for now, but I'm not sure if
> there's enough code there to warrant a function. The majority of the
> additional lines are from the comment being duplicated.
>
> > node->rc_status = RC_END_OF_SCAN;
> > return NULL;
> > }
> > else
> >
> > There are several places where the else keyword for else block can be
> omitted because the if block ends with return.
> > This would allow the code in else block to move leftward (for easier
> reading).
>
> OK, I've removed the "else" in places where it can be removed.
>
> >if (!get_op_hash_functions(hashop, _hashfn, _hashfn))
> >
> > I noticed that right_hashfn isn't used. Would this cause some warning
> from the compiler (for some compiler the warning would be treated as error)
> ?
> > Maybe NULL can be passed as the last parameter. The return value of
> get_op_hash_functions would keep the current meaning (find both hash fn's).
>
> It's fine not to use output parameters.  It's not the only one in the
> code base ignoring one from that very function. See
> execTuplesHashPrepare().
>
> > rcstate->mem_lowerlimit = rcstate->mem_upperlimit * 0.98;
> >
> > Maybe (in subsequent patch) GUC variable can be introduced for tuning
> the constant 0.98.
>
> I don't think exposing such knobs for users to adjust is a good idea.
> Can you think of a case where you'd want to change it? Or do you think
> 98% is not a good number?
>
> >
> > For +paraminfo_get_equal_hashops :
> >
> > +   else
> > +   Assert(false);
>
> I'm keen to leave it like it is.  I don't see any need to bloat the
> compiled code with an elog(ERROR).
>
> There's a comment in RelOptInfo.lateral_vars that mentions:
>
> /* LATERAL Vars and PHVs referenced by rel */
>
> So, if anyone, in the future, wants to add some other node type to
> that list then they'll have a bit more work to do. Plus, I'm only
> doing the same as what's done in create_lateral_join_info().
>
> I'll run the updated patch which includes the cache_check_mem()
> function for a bit and post an updated patch to the main thread a bit
> later.
>
> Thanks for having a look at this patch.
>
> David
>


Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-12-06 Thread David Rowley
On Sat, 5 Dec 2020 at 16:51, Zhihong Yu  wrote:
>
> There are two blocks with almost identical code (second occurrence in 
> cache_store_tuple):
>
> +   if (rcstate->mem_used > rcstate->mem_upperlimit)
> +   {
> It would be nice if the code can be extracted to a method and shared.

It's true, they're *almost* identical.  I quite like the fact that one
of the cases can have an unlikely() macro in there. It's pretty
unlikely that we'd go into cache overflow just when adding a new cache
entry. work_mem would likely have to be set to a couple of dozen bytes
for that to happen. 64k is the lowest it can be set.  However, I
didn't really check to see if having that unlikely() macro increases
performance.  I've changed things locally here to add a new function
named cache_check_mem(). I'll keep that for now, but I'm not sure if
there's enough code there to warrant a function. The majority of the
additional lines are from the comment being duplicated.

> node->rc_status = RC_END_OF_SCAN;
> return NULL;
> }
> else
>
> There are several places where the else keyword for else block can be omitted 
> because the if block ends with return.
> This would allow the code in else block to move leftward (for easier reading).

OK, I've removed the "else" in places where it can be removed.

>if (!get_op_hash_functions(hashop, _hashfn, _hashfn))
>
> I noticed that right_hashfn isn't used. Would this cause some warning from 
> the compiler (for some compiler the warning would be treated as error) ?
> Maybe NULL can be passed as the last parameter. The return value of 
> get_op_hash_functions would keep the current meaning (find both hash fn's).

It's fine not to use output parameters.  It's not the only one in the
code base ignoring one from that very function. See
execTuplesHashPrepare().

> rcstate->mem_lowerlimit = rcstate->mem_upperlimit * 0.98;
>
> Maybe (in subsequent patch) GUC variable can be introduced for tuning the 
> constant 0.98.

I don't think exposing such knobs for users to adjust is a good idea.
Can you think of a case where you'd want to change it? Or do you think
98% is not a good number?

>
> For +paraminfo_get_equal_hashops :
>
> +   else
> +   Assert(false);

I'm keen to leave it like it is.  I don't see any need to bloat the
compiled code with an elog(ERROR).

There's a comment in RelOptInfo.lateral_vars that mentions:

/* LATERAL Vars and PHVs referenced by rel */

So, if anyone, in the future, wants to add some other node type to
that list then they'll have a bit more work to do. Plus, I'm only
doing the same as what's done in create_lateral_join_info().

I'll run the updated patch which includes the cache_check_mem()
function for a bit and post an updated patch to the main thread a bit
later.

Thanks for having a look at this patch.

David




Re: Single transaction in the tablesync worker?

2020-12-06 Thread Craig Ringer
On Sat, 5 Dec 2020, 10:03 Amit Kapila,  wrote:

> On Fri, Dec 4, 2020 at 7:12 PM Ashutosh Bapat
>  wrote:
> >
> > On Thu, Dec 3, 2020 at 7:24 PM Amit Kapila 
> wrote:
> > >
> > > On Thu, Dec 3, 2020 at 7:04 PM Ashutosh Bapat
> > >  wrote:
> > > >
> > > > On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila 
> wrote:
> > > > >
> > > > > The tablesync worker in logical replication performs the table data
> > > > > sync in a single transaction which means it will copy the initial
> data
> > > > > and then catch up with apply worker in the same transaction. There
> is
> > > > > a comment in LogicalRepSyncTableStart ("We want to do the table
> data
> > > > > sync in a single transaction.") saying so but I can't find the
> > > > > concrete theory behind the same. Is there any fundamental problem
> if
> > > > > we commit the transaction after initial copy and slot creation in
> > > > > LogicalRepSyncTableStart and then allow the apply of transactions
> as
> > > > > it happens in apply worker? I have tried doing so in the attached
> (a
> > > > > quick prototype to test) and didn't find any problems with
> regression
> > > > > tests. I have tried a few manual tests as well to see if it works
> and
> > > > > didn't find any problem. Now, it is quite possible that it is
> > > > > mandatory to do the way we are doing currently, or maybe something
> > > > > else is required to remove this requirement but I think we can do
> > > > > better with respect to comments in this area.
> > > >
> > > > If we commit the initial copy, the data upto the initial copy's
> > > > snapshot will be visible downstream. If we apply the changes by
> > > > committing changes per transaction, the data visible to the other
> > > > transactions will differ as the apply progresses.
> > > >
> > >
> > > It is not clear what you mean by the above.  The way you have written
> > > appears that you are saying that instead of copying the initial data,
> > > I am saying to copy it transaction-by-transaction. But that is not the
> > > case. I am saying copy the initial data by using REPEATABLE READ
> > > isolation level as we are doing now, commit it and then process
> > > transaction-by-transaction till we reach sync-point (point till where
> > > apply worker has already received the data).
> >
> > Craig in his mail has clarified this. The changes after the initial
> > COPY will be visible before the table sync catches up.
> >
>
> I think the problem is not that the changes are visible after COPY
> rather it is that we don't have a mechanism to restart if it crashes
> after COPY unless we do all the sync up in one transaction. Assume we
> commit after COPY and then process transaction-by-transaction and it
> errors out (due to connection loss) or crashes, in-between one of the
> following transactions after COPY then after the restart we won't know
> from where to start for that relation. This is because the catalog
> (pg_subscription_rel) will show the state as 'd' (data is being
> copied) and the slot would have gone as it was a temporary slot. But
> as mentioned in one of my emails above [1] we can solve these problems
> which Craig also seems to be advocating for as there are many
> advantages of not doing the entire sync (initial copy + stream changes
> for that relation) in one single transaction. It will allow us to
> support decode of prepared xacts in the subscriber. Also, it seems
> pglogical already does processing transaction-by-transaction after the
> initial copy. The only thing which is not clear to me is why we
> haven't decided to go ahead initially and it would be probably better
> if the original authors would also chime-in to at least clarify the
> same.
>

It's partly a resource management issue.

Replication origins are a limited resource. We need to use a replication
origin for any sync we want to be durable across restarts.

Then again so are slots and we use temp slots for each sync.

If a sync fails cleanup on the upstream side is simple with a temp slot.
With persistent slots we have more risk of creating upstream issues. But
then, so long as the subscriber exists it can deal with that. And if the
subscriber no longer exists its primary slot is an issue too.

It'd help if we could register pg_shdepend entries between catalog entries
and slots, and from a main subscription slot to any extra slots used for
resynchronization.

And I should write a patch for a resource retention summarisation view.


> I am not sure why but it seems acceptable to original authors that the
> data of transactions are visibly partially during the initial
> synchronization phase for a subscription.


I don't think there's much alternative there.

Pg would need some kind of cross commit visibility control mechanism that
separates durable commit from visibility


Re: Parallel Inserts in CREATE TABLE AS

2020-12-06 Thread Bharath Rupireddy
Thanks Amit for the review comments.

On Sat, Dec 5, 2020 at 4:27 PM Amit Kapila  wrote:
>
> > If I'm not wrong, I think currently we have no exec nodes for DDLs.
> > I'm not sure whether we would like to introduce one for this.
>
> Yeah, I am also not in favor of having an executor node for CTAS but
> OTOH, I also don't like the way you have jammed the relevant
> information in generic PlanState. How about keeping it in GatherState
> and initializing it in ExecCreateTableAs() after the executor start.
> You are already doing special treatment for the Gather node in
> ExecCreateTableAs (via IsParallelInsertInCTASAllowed) so we can as
> well initialize the required information in GatherState in
> ExecCreateTableAs. I think that might help in reducing the special
> treatment for intoclause at different places.
>

Done. Added required info to GatherState node. While this reduced the
changes at many other places, but had to pass the into clause and
object id to ExecInitParallelPlan() as we do not send GatherState node
to it. Hope that's okay.

>
> Few other assorted comments:
> =
> 1.
> This looks a bit odd. The function name
> 'IsParallelInsertInCTASAllowed' suggests that it just checks whether
> parallelism is allowed but it is internally changing the plan_rows. It
> might be better to do this separately if the parallelism is allowed.
>

Changed.

>
> 2.
>  static void ExecShutdownGatherWorkers(GatherState *node);
> -
> +static void ExecParallelInsertInCTAS(GatherState *node);
>
> Spurious line removal.
>

Corrected.

>
> 3.
> The comment and code appear a bit misleading as the function seems to
> shutdown the workers rather than waiting for them to finish. How about
> using something like below:
>
> /*
> * Next, accumulate buffer and WAL usage.  (This must wait for the workers
> * to finish, or we might get incomplete data.)
> */
> if (nworkers > 0)
> {
> int i;
>
> /* Wait for all vacuum workers to finish */
> WaitForParallelWorkersToFinish(lps->pcxt);
>
> for (i = 0; i < lps->pcxt->nworkers_launched; i++)
> InstrAccumParallelQuery(>buffer_usage[i], >wal_usage[i]);
> }
>
> This is how it works for parallel vacuum.
>

Done.

>
> 4.
> The above comment doesn't seem to convey what it intends to convey.
> How about changing it slightly as: "We don't compute the
> parallel_tuple_cost for CTAS because the number of tuples that are
> transferred from workers to the gather node is zero as each worker
> parallelly inserts the tuples that are resulted from its chunk of plan
> execution. This change may make the parallel plan cheap among all
> other plans, and influence the planner to consider this parallel
> plan."
>

Changed.

>
> Then, we can also have an Assert for path->path.rows to zero for the CTAS 
> case.
>

We can not have Assert(path->path.rows == 0), because we are not
changing this parameter upstream in or before the planning phase. We
are just skipping to take it into account for CTAS. We may have to do
extra checks over different places in case we have to make planner
path->path.rows to 0 for CTAS. IMHO, that's not necessary. We can just
skip taking this value in cost_gather. Thoughts?

>
> 5.
> + /* Prallel inserts in CTAS related info is specified below. */
> + IntoClause *intoclause;
> + Oid objectid;
> + DestReceiver *dest;
>  } PlanState;
>
> Typo. /Prallel/Parallel
>

Corrected.

>
> 6.
> Currently, it seems the plan look like:
>  Gather (actual time=970.524..972.913 rows=0 loops=1)
>->  Create t1_test
>  Workers Planned: 2
>  Workers Launched: 2
>  ->  Parallel Seq Scan on t1 (actual time=0.028..86.623 rows=33 
> loops=3)
>
> I would prefer it to be:
> Gather (actual time=970.524..972.913 rows=0 loops=1)
>  Workers Planned: 2
>  Workers Launched: 2
> ->  Create t1_test
>  ->  Parallel Seq Scan on t1 (actual time=0.028..86.623 rows=33 
> loops=3)
>
> This way it looks like the writing part is done below the Gather node
> and also it will match the Parallel Insert patch of Greg.
>

Done.

Attaching v7 patch. Please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 7eff47d971c0ac5722ddc7de5a4c7e7550bb71aa Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sun, 6 Dec 2020 08:21:42 +0530
Subject: [PATCH v7] Parallel Inserts in CREATE TABLE AS

The idea of this patch is to allow the leader and each worker
insert the tuples in parallel if the SELECT part of the CTAS is
parallelizable.

The design:
Let the planner know that the SELECT is from CTAS in createas.c
so that it can set the number of tuples transferred from the
workers to Gather node to 0. With this change, there are chances
that the planner may choose the parallel plan. After the planning,
check if the upper plan node is Gather in createas.c and mark a
parallelism flag in the CTAS dest receiver. Pass the into clause,
object id, command id from the leader to workers, so that each
worker can create its own CTAS dest receiver. 

Re: Proposed patch for key managment

2020-12-06 Thread Masahiko Sawada
On Sun, 6 Dec 2020 at 00:42, Bruce Momjian  wrote:
>
> On Sat, Dec  5, 2020 at 09:54:33PM +0900, Michael Paquier wrote:
> > On Fri, Dec 04, 2020 at 10:52:29PM -0500, Bruce Momjian wrote:
> > > OK, I worked with Sawada-san and added the attached patch.  The updated
> > > full patch is at the same URL:  :-)
> > >
> > > 
> > > https://github.com/postgres/postgres/compare/master...bmomjian:key.diff
> >
> > Oh, I see that you use SHA256 during firstboot, which is why you
> > bypass the resowner paths in cryptohash_openssl.c.  Wouldn't it be
> > better to use IsBootstrapProcessingMode() then?
>
> I tried that, but we also use the resource references before the
> resource system is started even in non-bootstrap mode.  Maybe we should
> be creating a resource owner for this, but it is so early in the boot
> process I don't know if that is safe.
>
> > > @@ -72,14 +72,15 @@ pg_cryptohash_create(pg_cryptohash_type type)
> > > ctx = ALLOC(sizeof(pg_cryptohash_ctx));
> > > if (ctx == NULL)
> > > return NULL;
> > > +   explicit_bzero(ctx, sizeof(pg_cryptohash_ctx));
> > >
> > > state = ALLOC(sizeof(pg_cryptohash_state));
> > > if (state == NULL)
> > > {
> > > -   explicit_bzero(ctx, sizeof(pg_cryptohash_ctx));
> > > FREE(ctx);
> > > return NULL;
> > > }
> > > +   explicit_bzero(state, sizeof(pg_cryptohash_state));
> >
> > explicit_bzero() is used to give the guarantee that any sensitive data
> > gets cleaned up after an error or on free.  So I think that your use
> > is incorrect here for an initialization.
>
> It turns out what we were missing was a clear of state->resowner in
> cases where the resowner was null.  I removed the bzero calls completely
> and it now runs fine.
>
> > > if (state->evpctx == NULL)
> > > {
> > > -   explicit_bzero(state, sizeof(pg_cryptohash_state));
> > > -   explicit_bzero(ctx, sizeof(pg_cryptohash_ctx));
> > >  #ifndef FRONTEND
> > > ereport(ERROR,
> > > (errcode(ERRCODE_OUT_OF_MEMORY),
> >
> > And this original position is IMO more correct.
>
> Do we even need them?
>
> > Anyway, at quick glance, the backtrace of upthread is indicating a
> > double-free with an attempt to free a resource owner that has been
> > already free'd.
>
> I think that is now fixed too.  Updated patch at the same URL:
>
> 
> https://github.com/postgres/postgres/compare/master...bmomjian:key.diff

Thank you for updating the patch!

I think we need explicit_bzero() also in freeing the keywrap context.

BTW, when we need -R option pg_ctl command to start the server, how
can we start it in the single-user mode?

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Consider parallel for lateral subqueries with limit

2020-12-06 Thread Brian Davis
> Note that near the end of grouping planner we have a similar check:
>
> if (final_rel->consider_parallel && root->query_level > 1 &&
> !limit_needed(parse))
> 
> guarding copying the partial paths from the current rel to the final
> rel. I haven't managed to come up with a test case that exposes that

Played around with this a bit, here's a non-correlated subquery that gets us to 
that if statement

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (bar int);

INSERT INTO foo (bar)
SELECT
  g
FROM
  generate_series(1, 1) AS g;


SELECT
  (
SELECT
  bar
FROM
  foo
LIMIT 1
  ) AS y
FROM
  foo;


I also was thinking about the LATERAL part.

I couldn't think of any reason why the uncorrelated subquery's results would 
need to be shared and therefore the same, when we'll be "looping" over each row 
of the source table, running the subquery anew for each, conceptually.

But then I tried this...

test=# CREATE TABLE foo (bar int);
CREATE TABLE
test=#
test=# INSERT INTO foo (bar)
test-# SELECT
test-#   g
test-# FROM
test-#   generate_series(1, 10) AS g;
INSERT 0 10
test=#
test=#
test=# SELECT
test-#   foo.bar,
test-#   lat.bar
test-# FROM
test-#   foo JOIN LATERAL (
test(# SELECT
test(#   bar
test(# FROM
test(#   foo AS foo2
test(# ORDER BY
test(#   random()
test(# LIMIT 1
test(#   ) AS lat ON true;
 bar | bar
-+-
   1 |   7
   2 |   7
   3 |   7
   4 |   7
   5 |   7
   6 |   7
   7 |   7
   8 |   7
   9 |   7
  10 |   7
(10 rows)


As you can see, random() is only called once.  If postgres were supposed to be 
running the subquery for each source row, conceptually, it would be a mistake 
to cache the results of a volatile function like random().

The docs say: "When a FROM item contains LATERAL cross-references, evaluation 
proceeds as follows: for each row of the FROM item providing the 
cross-referenced column(s), or set of rows of multiple FROM items providing the 
columns, the LATERAL item is evaluated using that row or row set's values of 
the columns. The resulting row(s) are joined as usual with the rows they were 
computed from. This is repeated for each row or set of rows from the column 
source table(s)."

They don't say what happens with LATERAL when there aren't cross-references 
though.  As we expect, adding one does show random() being called once for each 
source row.

test=# SELECT
test-#   foo.bar,
test-#   lat.bar
test-# FROM
test-#   foo JOIN LATERAL (
test(# SELECT
test(#   bar
test(# FROM
test(#   foo AS foo2
test(# WHERE
test(#   foo2.bar < foo.bar + 10
test(# ORDER BY
test(#   random()
test(# LIMIT 1
test(#   ) AS lat ON true;
 bar | bar
-+-
   1 |   5
   2 |   8
   3 |   3
   4 |   4
   5 |   5
   6 |   5
   7 |   1
   8 |   3
   9 |   7
  10 |   3
(10 rows)

It seems like to keep the same behavior that exists today, results of LATERAL 
subqueries would need to be the same if they aren't correlated, and so you 
couldn't run them in parallel with a limit if the order wasn't guaranteed.  But 
I'll be the first to admit that it's easy enough for me to miss a key piece of 
logic on something like this, so I could be way off base too.




Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-12-06 Thread David Rowley
Thanks for having a look at this.

On Sat, 5 Dec 2020 at 14:08, Zhihong Yu  wrote:
> +#define SH_EQUAL(tb, a, b) ResultCacheHash_equal(tb, a, b) == 0
>
> I think it would be safer if the comparison is enclosed in parentheses (in 
> case the macro appears in composite condition).

That seems fair.  Likely it might be nicer if simplehash.h played it
safe and put usages of the macro in additional parenthesis.  I see a
bit of a mix of other places where we #define SH_EQUAL. It looks like
the one in execGrouping.c and tidbitmap.c are also missing the
additional parenthesis.

> +ResultCacheHash_equal(struct resultcache_hash *tb, const ResultCacheKey 
> *key1,
> + const ResultCacheKey *key2)
>
> Since key2 is not used, maybe name it unused_key ?

I'm not so sure it's a great change.  The only place where people see
that is in the same area that mentions " 'key2' is never used"

> +   /* Make a guess at a good size when we're not given a valid size. */
> +   if (size == 0)
> +   size = 1024;
>
> Should the default size be logged ?

I'm not too sure if I know what you mean here. Should it be a power of
2? It is.  Or do you mean I shouldn't use a magic number?

> +   /* Update the memory accounting */
> +   rcstate->mem_used -= freed_mem;
>
> Maybe add an assertion that mem_used is >= 0 after the decrement (there is an 
> assertion in remove_cache_entry however, that assertion is after another 
> decrement).

Good idea.

> + * 'specialkey', if not NULL, causes the function to return false if the 
> entry
> + * entry which the key belongs to is removed from the cache.
>
> duplicate entry (one at the end of first line and one at the beginning of 
> second line).

Well spotted.

> For cache_lookup(), new key is allocated before checking whether 
> rcstate->mem_used > rcstate->mem_upperlimit. It seems new entries should 
> probably have the same size.
> Can we check whether upper limit is crossed (assuming the addition of new 
> entry) before allocating new entry ?

I'd like to leave this as is. If we were to check if we've gone over
memory budget before the resultcache_insert() then we're doing a
memory check even for cache hits. That's additional effort. I'd prefer
only to check if we've gone over the memory budget in cases where
we've actually allocated more memory.

In each case we can go one allocation over budget, so I don't see what
doing the check beforehand gives us.

> +   if (unlikely(!cache_reduce_memory(rcstate, key)))
> +   return NULL;
>
> Does the new entry need to be released in the above case?

No. cache_reduce_memory returning false will have removed "key" from the cache.

I'll post an updated patch on the main thread once I've looked at your
followup review.

David




Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-12-06 Thread David Rowley
On Sun, 6 Dec 2020 at 03:52, Andy Fan  wrote:
>
> On Fri, Dec 4, 2020 at 10:41 PM David Rowley  wrote:
>>
>> I also
>> noticed that the code I'd written to build the cache lookup expression
>> included a step to deform the outer tuple. This was unnecessary and
>> slowed down the expression evaluation.
>>
>
> I thought it would be something like my 3rd suggestion on [1], however after
> I read the code,  it looked like no.  Could you explain what changes it is?
> I probably missed something.

Basically, an extra argument in ExecBuildParamSetEqual() which allows
the TupleTableSlotOps for the left and right side to be set
individually. Previously I was passing a single TupleTableSlotOps of
TTSOpsMinimalTuple.  The probeslot is a TTSOpsVirtual tuple, so
passing TTSOpsMinimalTuple causes the function to add a needless
EEOP_OUTER_FETCHSOME step to the expression.

David




Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-12-06 Thread David Rowley
Thanks for this review. I somehow missed addressing what's mentioned
here for the v10 patch. Comments below.

On Mon, 23 Nov 2020 at 02:21, Andy Fan  wrote:
>  1. modified   src/include/utils/selfuncs.h
> @@ -70,9 +70,9 @@
>   * callers to provide further details about some assumptions which were made
>   * during the estimation.
>   */
> -#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one of
> -  * the DEFAULTs as defined above.
> -  */
> +#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one
> + * of the DEFAULTs as defined
> + * above. */
>
> Looks nothing has changed.

I accidentally took the changes made by pgindent into the wrong patch.
Fixed that in v10.

> 2. leading spaces is not necessary in comments.
>
>  /*
>   * ResultCacheTuple Stores an individually cached tuple
>   */
> typedef struct ResultCacheTuple
> {
> MinimalTuple mintuple; /* Cached tuple */
> struct ResultCacheTuple *next; /* The next tuple with the same parameter
> * values or NULL if it's the last one */
> } ResultCacheTuple;

OK, I've changed that so that they're on 1 line instead of 3.

> 3. We define ResultCacheKey as below.
>
> /*
>  * ResultCacheKey
>  * The hash table key for cached entries plus the LRU list link
>  */
> typedef struct ResultCacheKey
> {
> MinimalTuple params;
> dlist_node lru_node; /* Pointer to next/prev key in LRU list */
> } ResultCacheKey;
>
> Since we store it as a MinimalTuple, we need some FETCH_INNER_VAR step for
> each element during the ResultCacheHash_equal call.  I am thinking if we can
> store a "Datum *" directly to save these steps.  exec_aggvalues/exec_aggnulls 
> looks
> a good candidate for me, except that the name looks not good. IMO, we can
> rename exec_aggvalues/exec_aggnulls and try to merge
> EEOP_AGGREF/EEOP_WINDOW_FUNC into a more generic step which can be
> reused in this case.

I think this is along the lines of what I'd been thinking about and
mentioned internally to Thomas and Andres.  I called it a MemTuple and
it was basically a contiguous block of memory with Datum and isnull
arrays and any varlena attributes at the end of the contiguous
allocation.  These could quickly be copied into a VirtualSlot with
zero deforming.  I've not given this too much thought yet, but if I
was to do this I'd be aiming to store the cached tuple this way to so
save having to deform it each time we get a cache hit.   We'd use more
memory storing entries this way, but if we're not expecting the Result
Cache to fill work_mem, then perhaps it's another approach that the
planner could decide on.  Perhaps the cached tuple pointer could be a
union to allow us to store either without making the struct any
larger.

However, FWIW, I'd prefer to think about this later though.

> 4. I think the  ExecClearTuple in prepare_probe_slot is not a must, since the
> data tts_values/tts_flags/tts_nvalid are all reset later, and tts_tid is not
> real used in our case.  Since both prepare_probe_slot
> and ResultCacheHash_equal are in  pretty hot path, we may need to consider it.

I agree that it would be nice not to do the ExecClearTuple(), but the
only way I can see to get rid of it also requires getting rid of the
ExecStoreVirtualTuple().  The problem is ExecStoreVirtualTuple()
Asserts that the slot is empty, which it won't be the second time
around unless we ExecClearTuple it.  It seems that to make that work
we'd have to manually set slot->tts_nvalid. I see other places in the
code doing this ExecClearTuple() / ExecStoreVirtualTuple() dance, so I
don't think it's going to be up to this patch to start making
optimisations just for this 1 case.

David




Re: Default role -> Predefined role

2020-12-06 Thread Daniel Gustafsson
> On 20 Nov 2020, at 22:13, Stephen Frost  wrote:

> Attached is a patch to move from 'default role' terminology to
> 'predefined role' in the documentation.  In the code, I figured it made
> more sense to avoid saying either one and instead opted for just
> 'ROLE_NAME_OF_ROLE' as the #define, which we were very close to (one
> removing the 'DEFAULT' prefix) but didn't actually entirely follow.

Predefined is a much better terminology than default for these roles, makes the
concept a lot clearer.  +1 on this patch.

> I had contemplated calling these 'capabilities' or similar but ended up
> deciding not to- they really are roles, after all.

Agreed, roles is better here.

cheers ./daniel



Wrong check in pg_visibility?

2020-12-06 Thread Konstantin Knizhnik

Hi hackers!

Due to the error in PG-ProEE we have added the following test to 
pg_visibility:


create table vacuum_test as select 42 i;
vacuum vacuum_test;
select count(*) > 0 from pg_check_visible('vacuum_test');
drop table vacuum_test;

Sometime (very rarely) this test failed: pg_visibility reports 
"corrupted" tuples.
The same error can be reproduced not only with PG-Pro but also with 
vanilla REL_11_STABLE, REL_12_STABLE and REL_13_STABLE.
It is not reproduced with master after Andres snapshot optimization - 
commit dc7420c2.


It is not so easy to reproduce this error: it is necessary to repeat 
this tests many times.

Probability increased with more aggressive autovacuum settings.
But even with such settings and thousands of iterations I was not able 
to reproduce this error at my notebook - only at virtual machine.


The error is reported here:

            /*
             * If we're checking whether the page is all-visible, we expect
             * the tuple to be all-visible.
             */
            if (check_visible &&
                !tuple_all_visible(, OldestXmin, buffer))
            {
                TransactionId RecomputedOldestXmin;

                /*
                 * Time has passed since we computed OldestXmin, so it's
                 * possible that this tuple is all-visible in reality even
                 * though it doesn't appear so based on our
                 * previously-computed value.  Let's compute a new 
value so we

                 * can be certain whether there is a problem.
                 *
                 * From a concurrency point of view, it sort of sucks to
                 * retake ProcArrayLock here while we're holding the buffer
                 * exclusively locked, but it should be safe against
                 * deadlocks, because surely GetOldestXmin() should 
never take

                 * a buffer lock. And this shouldn't happen often, so it's
                 * worth being careful so as to avoid false positives.
                 */
                RecomputedOldestXmin = GetOldestXmin(NULL, 
PROCARRAY_FLAGS_VACUUM);


                if (!TransactionIdPrecedes(OldestXmin, 
RecomputedOldestXmin))

                    record_corrupt_item(items, _self);


I debugger I have checked that OldestXmin = RecomputedOldestXmin = 
tuple.t_data->xmin
I wonder if this check in pg_visibility is really correct and it can not 
happen that OldestXmin=tuple.t_data->xmin?
Please notice that tuple_all_visible returns false if 
!TransactionIdPrecedes(xmin, OldestXmin)


Thanks in advance,
Konstantin






pg_upgrade test for binary compatibility of core data types

2020-12-06 Thread Justin Pryzby
I'm finally returning to this 14 month old thread:
(was: Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade 
from PG11 to PG12)

On Tue, Oct 15, 2019 at 09:07:25AM +0200, Tomas Vondra wrote:
> On Mon, Oct 14, 2019 at 11:41:18PM -0500, Justin Pryzby wrote:
> > 
> > Perhaps it'd be worth creating a test for on-disk format ?
> > 
> > Like a table with a column for each core type, which is either SELECTed from
> > after pg_upgrade, or pg_dump output compared before and after.
> 
> IMO that would be useful - we now have a couple of these checks for
> different data types (line, unknown, sql_identifier), with a couple of
> combinations each. And I've been looking if we do similar pg_upgrade
> tests, but I haven't found anything. I mean, we do pg_upgrade the
> cluster used for regression tests, but here we need to test a number of
> cases that are meant to abort the pg_upgrade. So we'd need a number of
> pg_upgrade runs, to test that.

I meant to notice if the binary format is accidentally changed again, which was
what happened here:
7c15cef86 Base information_schema.sql_identifier domain on name, not varchar.

I added a table to the regression tests so it's processed by pg_upgrade tests,
run like:

| time make -C src/bin/pg_upgrade check oldsrc=`pwd`/11 
oldbindir=`pwd`/11/tmp_install/usr/local/pgsql/bin

I checked that if I cherry-pick 0002 to v11, and comment out
old_11_check_for_sql_identifier_data_type_usage(), then pg_upgrade/test.sh
detects the original problem:
pg_dump: error: Error message from server: ERROR:  invalid memory alloc request 
size 18446744073709551613

I understand the buildfarm has its own cross-version-upgrade test, which I
think would catch this on its own.

These all seem to complicate use of pg_upgrade/test.sh, so 0001 is needed to
allow testing upgrade from older releases.

e78900afd217fa3eaa77c51e23a94c1466af421c Create by default sql/ and expected/ 
for output directory in pg_regress
40b132c1afbb4b1494aa8e48cc35ec98d2b90777 In the pg_upgrade test suite, don't 
write to src/test/regress.
fc49e24fa69a15efacd5b8958115ed9c43c48f9a Make WAL segment size configurable at 
initdb time.
c37b3d08ca6873f9d4eaf24c72a90a550970cbb8 Allow group access on PGDATA
da9b580d89903fee871cf54845ffa2b26bda2e11 Refactor dir/file permissions

-- 
Justin
>From e9d70f3d043211011f7c7774ed2ed5eaee3760dc Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 5 Dec 2020 22:31:19 -0600
Subject: [PATCH 1/2] WIP: pg_upgrade/test.sh: changes needed to allow testing
 upgrade from v11

---
 src/bin/pg_upgrade/check.c |  2 +-
 src/bin/pg_upgrade/test.sh | 44 --
 2 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 357997972b..6dfe3cff65 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -122,7 +122,7 @@ check_and_dump_old_cluster(bool live_check)
 	 * to prevent upgrade when used in user objects (tables, indexes, ...).
 	 */
 	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1100)
-		old_11_check_for_sql_identifier_data_type_usage(_cluster);
+		; // old_11_check_for_sql_identifier_data_type_usage(_cluster);
 
 	/*
 	 * Pre-PG 10 allowed tables with 'unknown' type columns and non WAL logged
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index 04aa7fd9f5..b39265f66d 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -23,7 +23,7 @@ standard_initdb() {
 	# To increase coverage of non-standard segment size and group access
 	# without increasing test runtime, run these tests with a custom setting.
 	# Also, specify "-A trust" explicitly to suppress initdb's warning.
-	"$1" -N --wal-segsize 1 -g -A trust
+	"$1" -N -A trust
 	if [ -n "$TEMP_CONFIG" -a -r "$TEMP_CONFIG" ]
 	then
 		cat "$TEMP_CONFIG" >> "$PGDATA/postgresql.conf"
@@ -108,6 +108,9 @@ export EXTRA_REGRESS_OPTS
 mkdir "$outputdir"
 mkdir "$outputdir"/testtablespace
 
+mkdir "$outputdir"/sql
+mkdir "$outputdir"/expected
+
 logdir=`pwd`/log
 rm -rf "$logdir"
 mkdir "$logdir"
@@ -175,13 +178,36 @@ if "$MAKE" -C "$oldsrc" installcheck-parallel; then
 fix_sql="DROP FUNCTION public.myfunc(integer); DROP FUNCTION public.oldstyle_length(integer, text);"
 ;;
 			*)
-fix_sql="DROP FUNCTION public.oldstyle_length(integer, text);"
+fix_sql="DROP FUNCTION IF EXISTS public.oldstyle_length(integer, text);"
+
+# commit 1ed6b8956
+fix_sql="$fix_sql DROP OPERATOR public.#@# (pg_catalog.int8, NONE);"
+fix_sql="$fix_sql DROP OPERATOR public.#%# (pg_catalog.int8, NONE);"
+fix_sql="$fix_sql DROP OPERATOR public.!=- (pg_catalog.int8, NONE);"
+fix_sql="$fix_sql DROP OPERATOR public.#@%# (pg_catalog.int8, NONE);"
+
+# commit 76f412ab3
+fix_sql="$fix_sql DROP OPERATOR IF EXISTS @#@(bigint,NONE);"
+fix_sql="$fix_sql DROP OPERATOR IF EXISTS @#@(NONE,bigint);"
+
+# commit 9e38c2bb5 and 97f73a978
+fix_sql="$fix_sql DROP AGGREGATE IF 

Re: Change definitions of bitmap flags to bit-shifting style

2020-12-06 Thread Andrew Dunstan


On 12/6/20 11:44 AM, James Coleman wrote:
> On Sun, Dec 6, 2020 at 1:25 AM Michael Paquier  > wrote:
>
> On Sat, Dec 05, 2020 at 10:31:09PM -0300, Alvaro Herrera wrote:
> > The hexadecimal representation is more natural to me than
> bit-shifting,
> > so I would prefer to use that style too.  But maybe I'm trained
> to it
> > because of looking at t_infomask symbols constantly.
>
> If we are going to change all that, hexa style sounds good to me too.
> Would it be worth an addition to the docs, say in [1] to tell that
> this is a preferred style?
>
> [1]: https://www.postgresql.org/docs/devel/source-conventions.html
> ?
> --
> Michael
>
>
>
> In my view the bit shifting approach makes it more obvious a single
> bit is being set, but on the other hand the hex approach makes it
> easier to compare in debugging. 
>
> I’m not really sure which to prefer, though I think I would have
> leaned slightly towards the former. 
>
>

Perhaps we should put one style or the other in a comment. I take Tom's
point, but after the number of bits shifted gets above some number I
have trouble remembering which bit it is, and while of course I can work
it out, it can be a very minor nuisance.


cheers


andrew





Re: Change definitions of bitmap flags to bit-shifting style

2020-12-06 Thread James Coleman
On Sun, Dec 6, 2020 at 1:25 AM Michael Paquier  wrote:

> On Sat, Dec 05, 2020 at 10:31:09PM -0300, Alvaro Herrera wrote:
> > The hexadecimal representation is more natural to me than bit-shifting,
> > so I would prefer to use that style too.  But maybe I'm trained to it
> > because of looking at t_infomask symbols constantly.
>
> If we are going to change all that, hexa style sounds good to me too.
> Would it be worth an addition to the docs, say in [1] to tell that
> this is a preferred style?
>
> [1]: https://www.postgresql.org/docs/devel/source-conventions.html?
> --
> Michael



In my view the bit shifting approach makes it more obvious a single bit is
being set, but on the other hand the hex approach makes it easier to
compare in debugging.

I’m not really sure which to prefer, though I think I would have leaned
slightly towards the former.

James

>


Re: A few new options for CHECKPOINT

2020-12-06 Thread Stephen Frost
Greetings,

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> On 2020-Dec-05, Stephen Frost wrote:
> > So- just to be clear, CHECKPOINTs are more-or-less always happening in
> > PG, and running this command might do something or might end up doing
> > nothing depending on if a checkpoint is already in progress and this
> > request just gets consolidated into an existing one, and it won't
> > actually reduce the amount of WAL replay except in the case where
> > checkpoint completion target is set to make a checkpoint happen in less
> > time than checkpoint timeout, which ultimately isn't a great way to run
> > the system anyway.
> 
> You keep making this statement, and I don't necessarily disagree, but if
> that is the case, please explain why don't we have
> checkpoint_completion_target set to 0.9 by default?  Should we change
> that?

Yes, I do think we should change that..  In fact, I'd argue that we can
probably get rid of checkpoint_completion_target entirely as an option.
The main argument against that is that it could be annoying for people
upgrading, but changing the default to 0.9 would definitely be an
improvement.

Thanks,

Stephen


signature.asc
Description: PGP signature