Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Tom Lane
Alexander Korotkov  writes:
> On Fri, Nov 27, 2020 at 1:55 AM Tom Lane  wrote:
>> ... and, after retrieving my jaw from the floor, I present the
>> attached.  Apple's chips evidently like this style of spinlock a LOT
>> better.  The difference is so remarkable that I wonder if I made a
>> mistake somewhere.  Can anyone else replicate these results?

> Results look very surprising to me.  I didn't expect there would be
> any very busy spin-lock when the number of clients is as low as 4.

Yeah, that wasn't making sense to me either.  The most likely explanation
seems to be that I messed up the test somehow ... but I don't see where.
So, again, I'm wondering if anyone else can replicate or refute this.
I can't be the only geek around here who sprang for an M1.

regards, tom lane




Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-26 Thread Bharath Rupireddy
On Fri, Nov 27, 2020 at 12:26 PM Fujii Masao
 wrote:
>
> >  > > When I read the patch again, I found that, with the patch, the shutdown
> >  > > of worker_spi causes to report the following FATAL message.
> >  > >
> >  > >  FATAL:  terminating connection due to administrator command
> >  > >
> >  > > Isn't this message confusing because it's not a connection? If so,
> >  > > we need to update ProcessInterrupts() so that the proper message is
> >  > > reported like other bgworkers do.
> >  > >
> >  >
> >  > This is also true for all the bgworker that use the die() handler. How 
> > about doing it the way bgworker_die() does in ProcessInterrupts()? This 
> > would give meaningful information. Thoughts? If okay, I can make a separate 
> > patch.
> >  >
> >
> > Attaching the patch that improved the message for bg workers in 
> > ProcessInterrupts(). For instance, now it looks like *FATAL:  terminating 
> > background worker "worker_spi" due to administrator command* or *FATAL:  
> > terminating background worker "parallel worker" due to administrator 
> > command *and so on for other bg workers.*
> > *
> >
> > Please review the patch.
>
> Thanks for the patch! It looks good to me.
>

Thanks!

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-26 Thread Bharath Rupireddy
On Fri, Nov 27, 2020 at 12:22 PM Luc Vlaming  wrote:
>
> >
> > I wonder, how can the do_multi_insert() API decide on when to flush, I
> > mean, based on the number of pages to flush? Do we need to pass the
> > maximum number of pages the buffered tuples can occupy and track the
> > pages currently buffered tuples occupy to decide when to flush? Or is
> > it something that the existing table AM infrastructure already
> > supports? If we use the number of pages to decide on when to flush,
> > how well it works with parallel inserts?
>
> I was assuming each tableam to use its own logic, based on its needs and
> the tradeoffs a storage engine might want to provide. This does not mean
> it should not consider outside parameters, like the aforementioned
> memory usage.
> I think it would imply that each tableam implements its own tracking
> mechanism for how much has accumulated, how, and when to flush, because
> they might track different statistics. IMHO given that each tableam
> anyway would want to implement its own logic on how to store a slot into
> a page, tracking the logic for tracking these statistics seemed minor to
> me. Maybe I missed some parts that should be extracted out to a generic
> interface however?
>

Agree with you that tracking page level or some other info is
dependent on table am implementations.

>
> Some examples of why a tableam could decide on its own on when to flush:
> - the current heap implementation could accumulate a few pages (say up
> to 64) and thereby limit the amount of calls to write() and limit the
> accompanying blocks/context switches. This would also then make the
> writes more sequential wrt the processes which can help with the
> flushing I presume, like how the sequential scan was optimized to
> process a consequtive set of blocks per worker (see
> table_block_parallelscan_nextpage).
> - something like zheap could accumulate data based on the amount of
> columns so that a page with column data is completely filled, thereby
> limiting the write amplification.
> - something that would implement an lsm storage might accumulate a full
> in-memory level before flushing it out.
>

Thanks for the details.

>
> >> One thing I'm wondering is in which memory context the slots end up
> >> being allocated. I'd assume we would want to keep the slots around
> >> between flushes. If they are in the temporary context this might prove
> >> problematic however?
> >>
> >
> > I should not have used the word temporary, it actually is not
> > temporary. This memory conext will be created in begin_multi_insert(),
> > all the buffered tuples are copied using this context, it will be
> > reset at the end of each flush and reused. It can get destroyed at the
> > end in end_multi_insert(). I think we should even do this with the new
> > APIs implementation.
> >
>
> Okay. But in which context are the slots themselves allocated then?
> Because if we allocate the slots themselves in the context and then
> reset with each flush we would have to also re-allocate the slots every
> flush, which seems wasteful to me?
>

Buffer slots are allocated in the memory context in which the new APIs
get called. We don't have to re-allocate the slots every time after
flushing, but we have to clear them using ExecClearTuple() and reuse.

And the memory context I specified in the MultiInsertStateData
structure is for using table_multi_insert() inside the new
do_multi_insert API after we decide to flush. There's a comment in the
existing table_multi_insert() usage in copy code, which says that
table_multi_insert() may leak the memory, for the same reason we need
that temporary memory context, which gets set just before
table_multi_insert(), and reset after that. This happens for each
batch of tuples. And in the end this context can be deleted in the
end_multi_insert API.

Hope this helps.

I'm planning to summarize and post the new APIs description here again
for other opinions.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel Inserts in CREATE TABLE AS

2020-11-26 Thread Luc Vlaming

On 25-11-2020 03:40, Bharath Rupireddy wrote:

On Tue, Nov 24, 2020 at 4:43 PM Hou, Zhijie  wrote:


I'm very interested in this feature,
and I'm looking at the patch, here are some comments.



Thanks for the review.



How about the following style:

 if(TupIsNull(outerTupleSlot))
 Break;

 (void) node->ps.dest->receiveSlot(outerTupleSlot, 
node->ps.dest);
 node->ps.state->es_processed++;

Which looks cleaner.



Done.



The check can be replaced by ISCTAS(into).



Done.



'inerst' looks like a typo (insert).



Corrected.



The code here call strlen(intoclausestr) for two times,
After checking the existing code in ExecInitParallelPlan,
It used to store the strlen in a variable.

So how about the following style:

 intoclause_len = strlen(intoclausestr);
 ...
 /* Store serialized intoclause. */
 intoclause_space = shm_toc_allocate(pcxt->toc, intoclause_len + 1);
 memcpy(shmptr, intoclausestr, intoclause_len + 1);
 shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space);



Done.



The two check about intoclausestr seems can be combined like:

if (intoclausestr != NULL)
{
...
}
else
{
...
}



Done.

Attaching v5 patch. Please consider it for further review.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Disclaimer: I have by no means throughly reviewed all the involved parts 
and am probably missing quite a bit of context so if I understood parts 
wrong or they have been discussed before then I'm sorry. Most notably 
the whole situation about the command-id is still elusive for me and I 
can really not judge yet anything related to that.


IMHO The patch makes that we now have the gather do most of the CTAS 
work, which seems unwanted. For the non-ctas insert/update case it seems 
that a modifytable node exists to actually do the work. What I'm 
wondering is if it is maybe not better to introduce a CreateTable node 
as well?

This would have several merits:
- the rowcount of that node would be 0 for the parallel case, and 
non-zero for the serial case. Then the gather ndoe and the Query struct 
don't have to know about CTAS for the most part, removing e.g. the case 
distinctions in cost_gather.
- the inserted rows can now be accounted in this new node instead of the 
parallel executor state, and this node can also do its own DSM 
intializations
- the generation of a partial variants of the CreateTable node can now 
be done in the optimizer instead of the ExecCreateTableAs which IMHO is 
a more logical place to make these kind of decisions. which then also 
makes it potentially play nicer with costs and the like.
- the explain code can now be in its own place instead of part of the 
gather node
- IIUC it would allow the removal of the code to only launch parallel 
workers if its not CTAS, which IMHO would be quite a big benefit.


Thoughts?

Some small things I noticed while going through the patch:
- Typo for the comment about "inintorel_startup" which should be 
intorel_startup
-   if (node->nworkers_launched == 0 && !node->need_to_scan_locally) 


  can be changed into
  if (node->nworkers_launched == 0
  because either way it'll be true.

Regards,
Luc
Swarm64




Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Alexander Korotkov
On Fri, Nov 27, 2020 at 2:20 AM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas  wrote:
> >> Is there some official ARM documentation, like a programmer's reference
> >> manual or something like that, that would show a reference
> >> implementation of a spinlock on ARM? It would be good to refer to an
> >> authoritative source on this.
>
> > I've compared assembly output of gcc implementations of CAS and TAS.
>
> FWIW, I see quite different assembly using Apple's clang on their M1
> processor.  What I get for SpinLockAcquire on HEAD is (lock pointer
> initially in x0):

Yep, arm v8.1 implements single-instruction atomic operations swpal
and casa, which much more look like x86 atomic instructions rather
than loops of ldxr/stlxr.

So, all the reasoning upthread shouldn't work here, but the advantage
is much more huge.

--
Regards,
Alexander Korotkov




Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Alexander Korotkov
On Fri, Nov 27, 2020 at 1:55 AM Tom Lane  wrote:
>
> Krunal Bauskar  writes:
> > On Thu, 26 Nov 2020 at 10:50, Tom Lane  wrote:
> >> Also, exactly what hardware/software platform were these curves
> >> obtained on?
>
> > Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for
> > server and 8 for client) [2 numa nodes]
> > Storage: 3.2 TB NVMe SSD
> > OS: CentOS Linux release 7.6
> > PGSQL: baseline = Release Tag 13.1
>
> Hmm, might not be the sort of hardware ordinary mortals can get their
> hands on.  What's likely to be far more common ARM64 hardware in the
> near future is Apple's new gear.  So I thought I'd try this on the new
> M1 mini I just got.
>
> ... and, after retrieving my jaw from the floor, I present the
> attached.  Apple's chips evidently like this style of spinlock a LOT
> better.  The difference is so remarkable that I wonder if I made a
> mistake somewhere.  Can anyone else replicate these results?

Results look very surprising to me.  I didn't expect there would be
any very busy spin-lock when the number of clients is as low as 4.
Especially in read-only pgbench.

I don't have an M1 at hand.  Could you do some profiling to identify
the source of such a huge difference.

--
Regards,
Alexander Korotkov




RE: Disable WAL logging to speed up data loading

2020-11-26 Thread tsunakawa.ta...@fujitsu.com
From: Masahiko Sawada 
> While testing the patch on some workload, I realized that
> XLOG_FPI_FOR_HINT record could still be emitted even when wal_level =
> none. IIUC that WAL record is not necessary during wal_level = none
> since the server cannot be the primary server and the server crash
> ends up requiring to restore the whole database.

Nice catch!  XLOG_FPI_FOR_HINT and XLOG_FPI should be eliminated, otherwise 
large amount of WAL may be written.  (It seems that other RMIDs than RM_XLOG_ID 
and RM_XACT_ID do not have to be written.)

I'm afraid "none" doesn't represent the behavior because RM_XLOG_ID and 
RM_XACT_ID WAL records, except for XLOG_FPI_*, are emitted.  What's the good 
name?  IIUC, "minimal" is named after the fact that the minimal amount of WAL 
necessary for crash recovery is generated.  "norecovery" or "unrecoverable"?

Regards
Takayuki Tsunakawa



Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-26 Thread Fujii Masao




On 2020/11/27 12:13, Bharath Rupireddy wrote:

On Wed, Nov 25, 2020 at 8:08 PM Bharath Rupireddy 
mailto:bharath.rupireddyforpostg...@gmail.com>> wrote:
 >
 > > When I read the patch again, I found that, with the patch, the shutdown
 > > of worker_spi causes to report the following FATAL message.
 > >
 > >      FATAL:  terminating connection due to administrator command
 > >
 > > Isn't this message confusing because it's not a connection? If so,
 > > we need to update ProcessInterrupts() so that the proper message is
 > > reported like other bgworkers do.
 > >
 >
 > This is also true for all the bgworker that use the die() handler. How about 
doing it the way bgworker_die() does in ProcessInterrupts()? This would give 
meaningful information. Thoughts? If okay, I can make a separate patch.
 >

Attaching the patch that improved the message for bg workers in ProcessInterrupts(). For instance, 
now it looks like *FATAL:  terminating background worker "worker_spi" due to 
administrator command* or *FATAL:  terminating background worker "parallel worker" due to 
administrator command *and so on for other bg workers.*
*

Please review the patch.


Thanks for the patch! It looks good to me.




I'm also mentioning the 2 previous patches posted in [1]. One of the patch is 
for using die() instead of handle_sigterm() in test_shm_mq/worker.c and another 
is for replacing custom SIGTERM handler worker_spi_sigterm() with die() and 
custom SIGHUP handler worker_spi_sighup() with standard 
SignalHandlerForConfigReload()


Yeah, I pushed them. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Parallel plans and "union all" subquery

2020-11-26 Thread Luc Vlaming

On 27-11-2020 04:14, Greg Nancarrow wrote:

On Thu, Nov 26, 2020 at 6:11 PM Luc Vlaming  wrote:


If interesting I can make a draft of what this would look like if this
makes it easier to discuss?



Sure, that would help clarify it.
Okay. I will try to build an example but this will take a few weeks as 
vacations and such are coming up too.




I did debug this a bit, but it seems my gut feeling was wrong, even
though it knows a type coercion is required and can be done, the
parse/analyze code doesn't actually modify the nodes in place "for
fear of changing the semantics", so when the types don't exactly match
it's all left up to the planner, but for this parse tree it fails to
produce a parallel plan.



Yes. However I think here also lies an opportunity, because to me it 
seems much more appealing to have the planner being able to deal 
correctly with all the situations rather than having things like 
flatten_simple_union_all() that provide a solution for the ideal case.



Regards,
Greg Nancarrow
Fujitsu Australia



Regards,
Luc
Swarm64




Re: Parallel Inserts in CREATE TABLE AS

2020-11-26 Thread Bharath Rupireddy
On Fri, Nov 27, 2020 at 11:57 AM Hou, Zhijie  wrote:
>
> > Thanks a lot for the use case. Yes with the current patch table will lose
> > data related to the subplan. On analyzing further, I think we can not allow
> > parallel inserts in the cases when the Gather node has some projections
> > to do. Because the workers can not perform that projection. So, having
> > ps_ProjInfo in the Gather node is an indication for us to disable parallel
> > inserts and only the leader can do the insertions after the Gather node
> > does the required projections.
> >
> > Thoughts?
> >
>
> Agreed.
>

Thanks! I will add/modify IsParallelInsertInCTASAllowed() to return
false in this case.

>
> 2.
> @@ -166,6 +228,16 @@ ExecGather(PlanState *pstate)
> {
> ParallelContext *pcxt;
>
> +   /*
> +* Take the necessary information to be passed to 
> workers for
> +* parallel inserts in CTAS.
> +*/
> +   if (ISCTAS(node->ps.intoclause))
> +   {
> +   node->ps.lefttree->intoclause = 
> node->ps.intoclause;
> +   node->ps.lefttree->objectid = 
> node->ps.objectid;
> +   }
> +
> /* Initialize, or re-initialize, shared state needed 
> by workers. */
> if (!node->pei)
> node->pei = 
> ExecInitParallelPlan(node->ps.lefttree,
>
> I found the code pass intoclause and objectid to Gather node's lefttree.
> Is it necessary? It seems only Gather node will use the information.
>

I am passing the required information from the up to here through
PlanState structure. Since the Gather node's leftree is also a
PlanState structure variable, here I just assigned them to pass that
information to ExecInitParallelPlan().

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-26 Thread Luc Vlaming

On 26-11-2020 14:45, Bharath Rupireddy wrote:

On Thu, Nov 26, 2020 at 5:34 PM Luc Vlaming  wrote:


On 26-11-2020 12:36, Bharath Rupireddy wrote:

Few things:

IIUC Andres mentioned similar kinds of APIs earlier in [1].

[1] -
https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de


I would like to add some more info to one of the API:

typedef struct MultiInsertStateData
{
  MemoryContext micontext; /* A temporary memory context for
multi insert. */
  BulkInsertStateData *bistate;   /* Bulk insert state. */
  TupleTableSlot  **mislots; /* Array of buffered slots. */
  uint32  nslots; /* Total number of buffered slots. */
  int64  nbytes; /* Flush buffers if the total tuple size
  >= nbytes. */
  int32  nused; /* Number of current buffered slots for a
multi insert batch. */
  int64  nsize; /* Total tuple size for a multi insert
batch. */
} MultiInsertStateData;

/* Creates a temporary memory context, allocates the
MultiInsertStateData, BulkInsertStateData and initializes other members. */
  void(*begin_multi_insert) (Relation rel,
MultiInsertStateData **mistate, uint32 nslots, uint64 nbytes);

/* Buffers the input slot into mistate slots, computes the size of the
tuple, and adds it total buffer tuple size, if this size crosses
mistate->nbytes, flush the buffered tuples into table. For heapam,
existing heap_multi_insert can be used. Once the buffer is flushed, then
the micontext can be reset and buffered slots can be cleared. *If nbytes
i.e. total tuple size of the batch is not given, tuple size is not
calculated, tuples are buffered until all the nslots are filled and then
flushed.* */
  void(*do_multi_insert) (Relation rel, MultiInsertStateData
*mistate, struct TupleTableSlot *slot, CommandId cid, int options);

/* Flush the buffered tuples if any. For heapam, existing
heap_multi_insert can be used. Deletes temporary memory context and
deallocates mistate. */
  void(*end_multi_insert) (Relation rel, MultiInsertStateData
*mistate, CommandId cid, int options);


Looks all good to me, except for the nbytes part.
Could you explain to me what use case that supports? IMHO the tableam
can best decide itself that its time to flush, based on its
implementation that e.g. considers how many pages to flush at a time and
such, etc? This means also that most of the fields of
MultiInsertStateData can be private as each tableam would return a
derivative of that struct (like with the destreceivers).



nbytes is basically to support the following case, say the number of
tuples to buffer is 1000, and if all the tuples are toasted with size
in few hundred MB or even GB, then do we want to wait until 1000
tuples are buffered in which case we occupy for one query 1000*toasted
tuple size in GB. So, if we have a memory limit, then it will give
flexibility. Whether to use it or not is up to the table AM
implementation. And also that existing copy code(since it can know the
tuple size after parsing input data) uses this mechanism to decide
when to flush.

If the nbytes is not used in a table am, then the multi insert can
wait until the total tuples, how much ever large memory they occupy,
are buffered.

IMO, we can retain nbytes for now to decide on when to flush. Thoughts?


I'm very sorry I had not realized at all that the toasted data would be 
kept in memory until written out. I guess I'm not familiar enough with 
that part yet. I assumed this would be toasted beforehand and be tableam 
agnostic, and that any decision from the tableam to flush would happen 
way before a lot memory would have accumulated, which is a bit naive in 
hindsight.




I wonder, how can the do_multi_insert() API decide on when to flush, I
mean, based on the number of pages to flush? Do we need to pass the
maximum number of pages the buffered tuples can occupy and track the
pages currently buffered tuples occupy to decide when to flush? Or is
it something that the existing table AM infrastructure already
supports? If we use the number of pages to decide on when to flush,
how well it works with parallel inserts?



I was assuming each tableam to use its own logic, based on its needs and 
the tradeoffs a storage engine might want to provide. This does not mean 
it should not consider outside parameters, like the aforementioned 
memory usage.
I think it would imply that each tableam implements its own tracking 
mechanism for how much has accumulated, how, and when to flush, because 
they might track different statistics. IMHO given that each tableam 
anyway would want to implement its own logic on how to store a slot into 
a page, tracking the logic for tracking these statistics seemed minor to 
me. Maybe I missed some parts that should be extracted out to a generic 
interface however?


Some examples of why

Re: Disable WAL logging to speed up data loading

2020-11-26 Thread Kyotaro Horiguchi
At Fri, 27 Nov 2020 15:07:34 +0900, Masahiko Sawada  
wrote in 
> On Tue, Nov 24, 2020 at 11:19 AM osumi.takami...@fujitsu.com
>  wrote:
> >
> > Hi
> >
> > On Monday, Nov 23, 2020 12:17 PM Tsunakawa, Takayuki 
> >  wrote:
> > > PREPARE TRANSACTION is the same as COMMIT in that it persists
> > > transaction updates.  A crash during wal_level = none loses both of them.
> > > So, I don't think PREPARE TRANSACTION needs special treatment.
> > Yeah, I got it. That makes sense.
> > Attached is the one I removed the special treatment.
> >
> >
> > > Does PREPARE TRANSACTION complete successfully?  I remember
> > > Fujii-san said it PANICed if --enable-asserts is used in configure.
> > Yes. That assertion failure Fujii-San reported in the past
> > was solved by having rmid != RM_XACT_ID in XLogInsert()
> > to add WAL generation for transaction completes.
> > That I missed the condition was the cause but fine now.
> >
> > Plus, PREPARE TRANSACTION and COMMIT PREPARED
> > works successfully at present when no happening occurs.
> >
> 
> Thank you for updating the patch.
> 
> -   (errmsg("WAL was generated with wal_level=minimal,
> data may be missing"),
> +   (errmsg("WAL was generated with wal_level<=minimal,
> data may be missing"),
>  errhint("This happens if you temporarily set
> wal_level=minimal without taking a new base backup.")));
> 
> 'wal_level=minimal' in errhint also needs to be changed to 
> 'wal_level<=minimal'?
> 
> While testing the patch on some workload, I realized that
> XLOG_FPI_FOR_HINT record could still be emitted even when wal_level =
> none. IIUC that WAL record is not necessary during wal_level = none
> since the server cannot be the primary server and the server crash
> ends up requiring to restore the whole database.

It seems to be on purpose.

@@ -449,6 +449,14 @@ XLogInsert(RmgrId rmid, uint8 info)
return EndPos;
}
 
+   /* Issues WAL related to XLOG resources and transactions only */
+   if (wal_level == WAL_LEVEL_NONE &&
+   rmid != RM_XLOG_ID && rmid != RM_XACT_ID)
+   {
+   XLogResetInsertion();
+   return GetXLogInsertRecPtr();

What is the reason for those kinds of records to be emitted?

I think we must emit at least the shutdown checkpoint record, which
has redo-LSN that points to the record itself.  I'm not sure what
other kinds of records are needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: Parallel Inserts in CREATE TABLE AS

2020-11-26 Thread Hou, Zhijie
Hi,

> > I took a deep look at the projection logic.
> > In most cases, you are right that Gather node does not need projection.
> >
> > In some rare cases, such as Subplan (or initplan I guess).
> > The projection will happen in Gather node.
> >
> > The example:
> >
> > Create table test(i int);
> > Create table test2(a int, b int);
> > insert into test values(generate_series(1,1000,1));
> > insert into test2 values(generate_series(1,1000,1),
> > generate_series(1,1000,1));
> >
> > postgres=# explain(verbose, costs off) select test.i,(select i from
> (select * from test2) as tt limit 1) from test where test.i < 2000;
> >QUERY PLAN
> > 
> >  Gather
> >Output: test.i, (SubPlan 1)
> >Workers Planned: 2
> >->  Parallel Seq Scan on public.test
> >  Output: test.i
> >  Filter: (test.i < 2000)
> >SubPlan 1
> >  ->  Limit
> >Output: (test.i)
> >->  Seq Scan on public.test2
> >  Output: test.i
> >
> > In this case, projection is necessary, because the subplan will be
> > executed in projection.
> >
> > If skipped, the table created will loss some data.
> >
> 
> Thanks a lot for the use case. Yes with the current patch table will lose
> data related to the subplan. On analyzing further, I think we can not allow
> parallel inserts in the cases when the Gather node has some projections
> to do. Because the workers can not perform that projection. So, having
> ps_ProjInfo in the Gather node is an indication for us to disable parallel
> inserts and only the leader can do the insertions after the Gather node
> does the required projections.
> 
> Thoughts?
> 

Agreed.


2.
@@ -166,6 +228,16 @@ ExecGather(PlanState *pstate)
{
ParallelContext *pcxt;
 
+   /*
+* Take the necessary information to be passed to 
workers for
+* parallel inserts in CTAS.
+*/
+   if (ISCTAS(node->ps.intoclause))
+   {
+   node->ps.lefttree->intoclause = 
node->ps.intoclause;
+   node->ps.lefttree->objectid = node->ps.objectid;
+   }
+
/* Initialize, or re-initialize, shared state needed by 
workers. */
if (!node->pei)
node->pei = 
ExecInitParallelPlan(node->ps.lefttree,

I found the code pass intoclause and objectid to Gather node's lefttree.
Is it necessary? It seems only Gather node will use the information.


Best regards,
houzj





Re: Disable WAL logging to speed up data loading

2020-11-26 Thread Masahiko Sawada
On Tue, Nov 24, 2020 at 11:19 AM osumi.takami...@fujitsu.com
 wrote:
>
> Hi
>
> On Monday, Nov 23, 2020 12:17 PM Tsunakawa, Takayuki 
>  wrote:
> > PREPARE TRANSACTION is the same as COMMIT in that it persists
> > transaction updates.  A crash during wal_level = none loses both of them.
> > So, I don't think PREPARE TRANSACTION needs special treatment.
> Yeah, I got it. That makes sense.
> Attached is the one I removed the special treatment.
>
>
> > Does PREPARE TRANSACTION complete successfully?  I remember
> > Fujii-san said it PANICed if --enable-asserts is used in configure.
> Yes. That assertion failure Fujii-San reported in the past
> was solved by having rmid != RM_XACT_ID in XLogInsert()
> to add WAL generation for transaction completes.
> That I missed the condition was the cause but fine now.
>
> Plus, PREPARE TRANSACTION and COMMIT PREPARED
> works successfully at present when no happening occurs.
>

Thank you for updating the patch.

-   (errmsg("WAL was generated with wal_level=minimal,
data may be missing"),
+   (errmsg("WAL was generated with wal_level<=minimal,
data may be missing"),
 errhint("This happens if you temporarily set
wal_level=minimal without taking a new base backup.")));

'wal_level=minimal' in errhint also needs to be changed to 'wal_level<=minimal'?

While testing the patch on some workload, I realized that
XLOG_FPI_FOR_HINT record could still be emitted even when wal_level =
none. IIUC that WAL record is not necessary during wal_level = none
since the server cannot be the primary server and the server crash
ends up requiring to restore the whole database.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-26 Thread Kyotaro Horiguchi
At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com" 
 wrote in 
> > From: Kyotaro Horiguchi 
> > Hello, Kirk. Thank you for the new version.
> 
> Hi, Horiguchi-san. Thank you for your very helpful feedback.
> I'm updating the patches addressing those.
> 
> > +   if (!smgrexists(rels[i], j))
> > +   continue;
> > +
> > +   /* Get the number of blocks for a relation's fork */
> > +   blocks[i][numForks] = smgrnblocks(rels[i], j,
> > NULL);
> > 
> > If we see a fork which its size is not cached we must give up this 
> > optimization
> > for all target relations.
> 
> I did not use the "cached" flag in DropRelFileNodesAllBuffers and use 
> InRecovery
> when deciding for optimization because of the following reasons:
> XLogReadBufferExtended() calls smgrnblocks() to apply changes to relation page
> contents. So in DropRelFileNodeBuffers(), XLogReadBufferExtended() is called
> during VACUUM replay because VACUUM changes the page content.
> OTOH, TRUNCATE doesn't change the relation content, it just truncates 
> relation pages
> without changing the page contents. So XLogReadBufferExtended() is not 
> called, and
> the "cached" flag will always return false. I tested with "cached" flags 
> before, and this

A bit different from the point, but if some tuples have been inserted
to the truncated table, XLogReadBufferExtended() is called for the
table and the length is cached.

> always return false, at least in DropRelFileNodesAllBuffers. Due to this, we 
> cannot use
> the cached flag in DropRelFileNodesAllBuffers(). However, I think we can 
> still rely on
> smgrnblocks to get the file size as long as we're InRecovery. That cached 
> nblocks is still
> guaranteed to be the maximum in the shared buffer.
> Thoughts?

That means that we always think as if smgrnblocks returns "cached" (or
"safe") value during recovery, which is out of our current
consensus. If we go on that side, we don't need to consult the
"cached" returned from smgrnblocks at all and it's enough to see only
InRecovery.

I got confused..

We are relying on the "fact" that the first lseek() call of a
(startup) process tells the truth.  We added an assertion so that we
make sure that the cached value won't be cleared during recovery.  A
possible remaining danger would be closing of an smgr object of a live
relation just after a file extension failure. I think we are thinking
that that doesn't happen during recovery.  Although it seems to me
true, I'm not confident.

If that's true, we don't even need to look at the "cached" flag at all
and always be able to rely on the returned value from msgrnblocks()
during recovery.  Otherwise, we need to avoid the danger situation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: POC: postgres_fdw insert batching

2020-11-26 Thread tsunakawa.ta...@fujitsu.com
From: Craig Ringer  
> But in the libpq pipelining patch I demonstrated a 300 times (3000%) 
> performance improvement on a test workload...

Wow, impressive  number.  I've just seen it in the beginning of the libpq 
pipelining thread (oh, already four years ago..!)  Could you share the workload 
and the network latency (ping time)?  I'm sorry I'm just overlooking it.

Thank you for your (always) concise explanation.  I'd like to check other DBMSs 
and your rich references for the FDW interface.  (My first intuition is that 
many major DBMSs might not have client C APIs that can be used to implement an 
async pipelining FDW interface.  Also, I'm afraid it requires major surgery or 
reform of executor.  I don't want it to delay the release of reasonably good 
(10x) improvement with the synchronous interface.)

(It'd be kind of you to send emails in text format.  I've changed the format of 
this reply from HTML to text.)


Regards
Takayuki Tsunakawa
 


Re: Add Information during standby recovery conflicts

2020-11-26 Thread Masahiko Sawada
On Thu, Nov 26, 2020 at 12:49 AM Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 11/25/20 2:20 PM, Masahiko Sawada wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> >
> >
> >
> > On Fri, Nov 20, 2020 at 6:18 PM Drouvot, Bertrand  
> > wrote:
> >> Hi,
> >>
> >> On 11/17/20 4:44 PM, Fujii Masao wrote:
> >>> Thanks for updating the patch! Here are review comments.
> >>>
> >>> +Controls whether a log message is produced when the startup
> >>> process
> >>> +is waiting longer than deadlock_timeout
> >>> +for recovery conflicts.
> >>>
> >>> But a log message can be produced also when the backend is waiting
> >>> for recovery conflict. Right? If yes, this description needs to be
> >>> corrected.
> >> Thanks for looking at it!
> >>
> >> I don't think so, only the startup process should write those new log
> >> messages.
> >>
> >> What makes you think that would not be the case?
> >>
> >>>
> >>> +for recovery conflicts.  This is useful in determining if
> >>> recovery
> >>> +conflicts prevents the recovery from applying WAL.
> >>>
> >>> "prevents" should be "prevent"?
> >> Indeed: fixed in the new attached patch.
> >>
> >>>
> >>> +   TimestampDifference(waitStart, GetCurrentTimestamp(), &secs,
> >>> &usecs);
> >>> +   msecs = secs * 1000 + usecs / 1000;
> >>>
> >>> GetCurrentTimestamp() is basically called before LogRecoveryConflict()
> >>> is called. So isn't it better to avoid calling GetCurrentTimestamp()
> >>> newly in
> >>> LogRecoveryConflict() and to reuse the timestamp that we got?
> >>> It's helpful to avoid the waste of cycles.
> >>>
> >> good catch! fixed in the new attached patch.
> >>
> >>> +   while (VirtualTransactionIdIsValid(*vxids))
> >>> +   {
> >>> +   PGPROC *proc =
> >>> BackendIdGetProc(vxids->backendId);
> >>>
> >>> BackendIdGetProc() can return NULL if the backend is not active
> >>> at that moment. This case should be handled.
> >>>
> >> handled in the new attached patch.
> >>> +   case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
> >>> +   reasonDesc = gettext_noop("recovery is still
> >>> waiting recovery conflict on buffer pin");
> >>>
> >>> It's natural to use "waiting for recovery" rather than "waiting
> >>> recovery"?
> >>>
> >> I would be tempted to say so, the new patch makes use of "waiting for".
> >>> +   /* Also, set deadlock timeout for logging purpose if
> >>> necessary */
> >>> +   if (log_recovery_conflict_waits)
> >>> +   {
> >>> +   timeouts[cnt].id = STANDBY_TIMEOUT;
> >>> +   timeouts[cnt].type = TMPARAM_AFTER;
> >>> +   timeouts[cnt].delay_ms = DeadlockTimeout;
> >>> +   cnt++;
> >>> +   }
> >>>
> >>> This needs to be executed only when the message has not been logged yet.
> >>> Right?
> >>>
> >> good catch: fixed in the new attached patch.
> >>
> > Thank you for updating the patch! Here are review comments on the
> > latest version patch.
> >
> > +   while (VirtualTransactionIdIsValid(*vxids))
> > +   {
> > +   PGPROC *proc = BackendIdGetProc(vxids->backendId);
> > +
> > +   if (proc)
> > +   {
> > +   if (nprocs == 0)
> > +   appendStringInfo(&buf, "%d", proc->pid);
> > +   else
> > +   appendStringInfo(&buf, ", %d", proc->pid);
> > +
> > +   nprocs++;
> > +   vxids++;
> > +   }
> > +   }
> >
> > We need to increment vxids even if *proc is null. Otherwise, the loop won't 
> > end.
>
> My bad, that's fixed.
>
> >
> > ---
> > +   TimestampTz cur_ts = GetCurrentTimestamp();;
> Fixed
> >
> > There is an extra semi-colon.
> >
> > ---
> >   intmax_standby_streaming_delay = 30 * 1000;
> > +bool   log_recovery_conflict_waits = false;
> > +bool   logged_lock_conflict = false;
> >
> >
> > +   if (log_recovery_conflict_waits && !logged_lock_conflict)
> > +   {
> > +   timeouts[cnt].id = STANDBY_TIMEOUT;
> > +   timeouts[cnt].type = TMPARAM_AFTER;
> > +   timeouts[cnt].delay_ms = DeadlockTimeout;
> > +   cnt++;
> > +   }
> >
> > Can we pass a bool indicating if a timeout may be needed for recovery
> > conflict logging from ProcSleep() to ResolveRecoveryConflictWithLock()
> > instead of using a static variable?
>
> Yeah that makes more sense, specially as we already have
> logged_recovery_conflict at our disposal.
>
> New patch version attached.
>

Thank you for updating the patch! The patch works fine and looks good
to me except for the following small comments:

+/*
+ * Log the recovery conflict.
+ *
+ * waitStart is the timestamp when the caller started to wait. This
function also
+ * reports the details about 

Re: POC: postgres_fdw insert batching

2020-11-26 Thread Craig Ringer
On Fri, Nov 27, 2020 at 10:47 AM tsunakawa.ta...@fujitsu.com <
tsunakawa.ta...@fujitsu.com> wrote:

Covering this one first:

I expect postgresExecForeignBatchInsert() would be able to use the libpq
> batching API, because it receives an array of tuples and can generate and
> issue INSERT statement for each tuple.


Sure, you can generate big multi-inserts. Or even do a COPY. But you still
have to block for a full round-trip until the foreign server replies. So if
you have 6000 calls to postgresExecForeignBatchInsert() during a single
query, and a 100ms round trip time to the foreign server, you're going to
waste 6000*0.1 = 600s = 10 min blocked in  postgresExecForeignBatchInsert()
waiting for results from the foreign server.

Such batches have some major downsides:

* The foreign server cannot start executing the first query in the batch
until the last query in the batch has been accumulated and the whole batch
has been sent to the foreign server;
* The FDW has to block waiting for the batch to execute on the foreign
server and for a full network round-trip before it can start another batch
or let the backend do other work
This means RTTs get multiplied by batch counts. Still a lot better than
individual statements, but plenty slow for high latency connections.

* Prepare 1000 rows to insert [10ms]
* INSERT 1000 values [100ms RTT + 50ms foreign server execution time]
* Prepare 1000 rows to insert [10ms]
* INSERT 1000 values [100ms RTT + 50ms foreign server execution time]
* ...

If you can instead send new inserts (or sets of inserts) to the foreign
server without having to wait for the result of the previous batch to
arrive, you can spend 100ms total waiting for results instead of 10 mins.
You can start the execution of the first query earlier, spend less time
blocked waiting on network, and let the local backend continue doing other
work while the foreign server is busy executing the statements.

The time spent preparing local rows to insert now overlaps with the RTT and
remote execution time, instead of happening serially. And there only has to
be one RTT wait, assuming the foreign server and network can keep up with
the rate we are generating requests at.

I can throw together some diagrams if it'll help. But in the libpq
pipelining patch I demonstrated a 300 times (3000%) performance improvement
on a test workload...

Anyway, this thread's batch insert can be progressed (and hopefully
> committed), and once the libpq batching has been committed, we can give it
> a try to use it and modify postgres_fdw to see if we can get further
> performance boost.
>

My point is that you should seriously consider whether batching is the
appropriate interface here, or whether the FDW can expose a pipeline-like
"queue work" then "wait for results" interface. That can be used to
implement batching exactly as currently proposed, it does not have to wait
for any libpq pipelining features. But it can *also* be used to implement
concurrent async requests in other FDWs, and to implement pipelining in
postgres_fdw once the needed libpq support is available.

I don't know the FDW to postgres API well enough, and it's possible I'm
talking entirely out of my hat here.


> From: Tomas Vondra 
> > Not sure how is this related to app developers? I think the idea was
> > that the libpq features might be useful between the two PostgreSQL
> > instances. I.e. the postgres_fdw would use the libpq batching to send
> > chunks of data to the other side.
>
> > Well, my point was that we could keep the API, but maybe it should be
> > implemented using the proposed libpq batching. They could still use the
> > postgres_fdw example how to use the API, but the internals would need to
> > be different, of course.
>
> Yes, I understand them.  I just wondered if app developers use the
> statement batching API for libpq or JDBC in what kind of apps.


For JDBC, yes, it's used very heavily and has been for a long time, because
PgJDBC doesn't rely on libpq - it implements the protocol directly and
isn't bound by libpq's limitations. The application interface for it in
JDBC is a batch interface [1][2], not a pipelined interface, so that's what
PgJDBC users interact with [3] but batch execution is implemented using
protocol pipelining support inside PgJDBC [4]. A while ago I did some work
on deadlock prevention to work around issues with PgJDBC's implementation
[5] which was needed because the feature was so heavily used. Both were to
address customer needs in real world applications. The latter increased
application performance over 50x through round-trip elimination.

For libpq, no, batching and pipelining are not yet used by anybody because
application authors have to write to the libpq API and there hasn't been
any in-core support for batching. We've had async / non-blocking support
for a while, but it still enforces strict request/response ordering without
interleaving, so application authors cannot make use of the same postgres
server and proto

Re: Parallel plans and "union all" subquery

2020-11-26 Thread Greg Nancarrow
On Thu, Nov 26, 2020 at 6:11 PM Luc Vlaming  wrote:
>
> If interesting I can make a draft of what this would look like if this
> makes it easier to discuss?
>

Sure, that would help clarify it.

I did debug this a bit, but it seems my gut feeling was wrong, even
though it knows a type coercion is required and can be done, the
parse/analyze code doesn't actually modify the nodes in place "for
fear of changing the semantics", so when the types don't exactly match
it's all left up to the planner, but for this parse tree it fails to
produce a parallel plan.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-26 Thread Bharath Rupireddy
On Wed, Nov 25, 2020 at 8:08 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:
>
> > When I read the patch again, I found that, with the patch, the shutdown
> > of worker_spi causes to report the following FATAL message.
> >
> >  FATAL:  terminating connection due to administrator command
> >
> > Isn't this message confusing because it's not a connection? If so,
> > we need to update ProcessInterrupts() so that the proper message is
> > reported like other bgworkers do.
> >
>
> This is also true for all the bgworker that use the die() handler. How
about doing it the way bgworker_die() does in ProcessInterrupts()? This
would give meaningful information. Thoughts? If okay, I can make a separate
patch.
>

Attaching the patch that improved the message for bg workers in
ProcessInterrupts(). For instance, now it looks like *FATAL:  terminating
background worker "worker_spi" due to administrator command* or *FATAL:
 terminating background worker "parallel worker" due to administrator
command *and so on for other bg workers.

Please review the patch.

I'm also mentioning the 2 previous patches posted in [1]. One of the patch
is for using die() instead of handle_sigterm() in test_shm_mq/worker.c and
another is for replacing custom SIGTERM handler worker_spi_sigterm() with
die() and custom SIGHUP handler worker_spi_sighup() with standard
SignalHandlerForConfigReload()

[1]
https://www.postgresql.org/message-id/CALj2ACWWy1YcngpCUn09AsXMfOzwjfNqbVosfoRY0vhhVWhVBw%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Improve-error-message-in-ProcessInterrupts-for-bg-worker.patch
Description: Binary data


Re: Allow some recovery parameters to be changed with reload

2020-11-26 Thread Kyotaro Horiguchi
At Fri, 27 Nov 2020 09:48:25 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/11/27 9:30, Kyotaro Horiguchi wrote:
> > At Thu, 26 Nov 2020 22:43:48 +0900, Fujii Masao
> >  wrote in
> >>
> >>
> >> On 2020/11/12 4:38, Sergei Kornilov wrote:
> >>> Hello
> >>>
>  Anyway, for now I think that your first patch would be enough, i.e.,
>  just change the context of restore_command to PGC_SIGHUP.
> >>> Glad to hear. Attached a rebased version of the original proposal.
> >>
> >> Thanks for rebasing the patch!
> >>
> >>  This parameter is required for archive recovery,
> >>
> >> I found the above description in config.sgml. I was just wondering
> >> if it should be updated so that the actual specification is described
> >> or not.
> >> The actual spec is that restore_command is required to start archive
> >> recovery, but optional (i.e., the parameter can be reset to an empty)
> >> after archive recovery has started. But this updated version of
> >> description would be rather confusing to users. So I'm now thinking
> >> not to update that.
> >>
> >> Does anyone object to the patch? If no, I'm thinking to commit the
> >> patch.
> > Although I don't object to make the parameter reloadable, I think it
> > needs to be documented that server could stop after reloading if the
> > server failed to execute the new command line.
> 
> You mean that we should document that if restore_command is set to
> improper command mistakenly, archive recovery may fail to restore some
> archived WAL files and finish without replaying those WAL? But isn't
> this true even without applying the patch?

If we set a wrong command in .conf and start the server in recovery
mode, the server immediately stops. It's fine.  If we change
restore_command wrong way on a running server, "pg_ctl reload" stops
the server.  If it is a hot standby, the server stops unexpectedly.

However, after rechecking, I found that recovery_end_command with
wrong content causes server stop at the end of recovery, or at
promotion. And that variable is PGC_SIGHUP.

So I agree not to document that.  Sorry for the noise.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: POC: postgres_fdw insert batching

2020-11-26 Thread tsunakawa.ta...@fujitsu.com
From: Tomas Vondra 
> Not sure how is this related to app developers? I think the idea was
> that the libpq features might be useful between the two PostgreSQL
> instances. I.e. the postgres_fdw would use the libpq batching to send
> chunks of data to the other side.

> Well, my point was that we could keep the API, but maybe it should be
> implemented using the proposed libpq batching. They could still use the
> postgres_fdw example how to use the API, but the internals would need to
> be different, of course.

Yes, I understand them.  I just wondered if app developers use the statement 
batching API for libpq or JDBC in what kind of apps.  That is, I talked about 
the batching API itself, not related to FDW.  (So, I mentioned I think I should 
ask such a question in the libpq batching thread.)

I expect postgresExecForeignBatchInsert() would be able to use the libpq 
batching API, because it receives an array of tuples and can generate and issue 
INSERT statement for each tuple.  But I'm not sure either if the libpq batching 
is likely to be committed in the near future.  (The thread looks too long...)  
Anyway, this thread's batch insert can be progressed (and hopefully committed), 
and once the libpq batching has been committed, we can give it a try to use it 
and modify postgres_fdw to see if we can get further performance boost.


Regards
Takayuki Tsunakawa




Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-26 Thread Andy Fan
On Fri, Nov 27, 2020 at 8:10 AM David Rowley  wrote:

> Thanks for having another look at this.
>
> > On Sun, Nov 22, 2020 at 9:21 PM Andy Fan 
> wrote:
> > add 2 more comments.
> >
> > 1. I'd suggest adding Assert(false); in RC_END_OF_SCAN  case to make the
> error clearer.
> >
> > case RC_END_OF_SCAN:
> > /*
> > * We've already returned NULL for this scan, but just in case
> > * something call us again by mistake.
> > */
> > return NULL;
>
> This just took some inspiration from nodeMaterial.c where it says:
>
> /*
> * If necessary, try to fetch another row from the subplan.
> *
> * Note: the eof_underlying state variable exists to short-circuit further
> * subplan calls.  It's not optional, unfortunately, because some plan
> * node types are not robust about being called again when they've already
> * returned NULL.
> */
>
> I'm not feeling a pressing need to put an Assert(false); in there as
> it's not what nodeMaterial.c does.  nodeMaterial is nodeResultCache's
> sister node which can also be seen below Nested Loops.
>
>
OK, even though I am not quite understanding the above now,  I will try to
figure it
by myself. I'm OK with this decision.




> > 2. Currently we handle the (!cache_store_tuple(node, outerslot))) case
> by set it
> >to RC_CACHE_BYPASS_MODE. The only reason for the cache_store_tuple
> failure is
> >we can't cache_reduce_memory.  I guess if cache_reduce_memory
> >failed once, it would not succeed later(no more tuples can be stored,
> >nothing is changed). So I think we can record this state and avoid
> any new
> >cache_reduce_memory call.
> >
> > /*
> > * If we failed to create the entry or failed to store the
> > * tuple in the entry, then go into bypass mode.
> > */
> > if (unlikely(entry == NULL ||
> > !cache_store_tuple(node, outerslot)))
> >
> >  to
> >
> > if (unlikely(entry == NULL || node->memory_cant_be_reduced ||
> > !cache_store_tuple(node, outerslot)))
>
> The reason for RC_CACHE_BYPASS_MODE is if there's a single set of
> parameters that have so many results that they, alone, don't fit in
> the cache. We call cache_reduce_memory() whenever we go over our
> memory budget. That function returns false if it was unable to free
> enough memory without removing the "specialkey", which in this case is
> the current cache entry that's being populated. Later, when we're
> caching some entry that isn't quite so large, we still want to be able
> to cache that.  In that case, we'll have removed the remnants of the
> overly large entry that didn't fit to way for newer and, hopefully,
> smaller entries. No problems.  I'm not sure why there's a need for
> another flag here.
>
>
Thanks for the explanation, I'm sure I made some mistakes before at
this part.


-- 
Best Regards
Andy Fan


Re: Add table access method as an option to pgbench

2020-11-26 Thread David Zhang

Thanks Fabien for the comments.

On 2020-11-25 11:29 p.m., Fabien COELHO wrote:


Hello David,

The previous patch was based on branch "REL_13_STABLE". Now, the 
attached new patch v2 is based on master branch. I followed the new 
code structure using appendPQExpBuffer to append the new clause 
"using TABLEAM" in a proper position and tested. In the meantime, I 
also updated the pgbench.sqml file to reflect the changes.


My 0.02€: I'm fine with the added feature.

The patch lacks minimal coverage test. Consider adding something to 
pgbench tap tests, including failures (ie non existing method).
I added 3 test cases to the tap tests, but not sure if they are 
following the rules. One question about the failures test, my thoughts 
is that it should be in the no_server test cases, but it is hard to 
verify the input as the table access method can be any name, such as 
zheap, zedstore or zheap2 etc. Any suggestion will be great.


The option in the help string is not at the right ab place.

Fixed in v3 patch.


I would merge the tableam declaration to the previous one with a 
extended comments, eg "tablespace and access method selection".

Updated in v3 patch.


escape_tableam -> escaped_tableam ?


Updated in v3 patch.

By the way, I saw the same style for other variables, such as 
escape_tablespace, should this be fixed as well?


--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca





Re: Add table access method as an option to pgbench

2020-11-26 Thread David Zhang

Thanks a lot again for the review and comments.

On 2020-11-25 9:36 p.m., Michael Paquier wrote:

On Wed, Nov 25, 2020 at 12:13:55PM -0800, David Zhang wrote:

The previous patch was based on branch "REL_13_STABLE". Now, the attached
new patch v2 is based on master branch. I followed the new code structure
using appendPQExpBuffer to append the new clause "using TABLEAM" in a proper
position and tested. In the meantime, I also updated the pgbench.sqml file
to reflect the changes.

+ 
+  
--table-am=TABLEAM
+  
+   
+Create all tables with specified table access method
+TABLEAM.
+If unspecified, default is heap.
+   
+  
+ 
This needs to be in alphabetical order.

The order issue is fixed in v3 patch.

And what you say here is
wrong.  The default would not be heap if default_table_access_method
is set to something else.
Right, if user change the default settings in GUC, then the default is 
not `heap` any more.

I would suggest to use table_access_method
instead of TABLEAM,
All other options if values are required, the words are all capitalized, 
such as TABLESPACE. Therefore, I used TABLEACCESSMETHOD in stead of 
table_access_method in v3 patch.

and keep the paragraph minimal, say:
"Create tables using the specified table access method, rather than
the default."

Updated in v3 patch.


--table-am is really the best option name?  --table-access-method
sounds a bit more consistent to me.

Updated in v3 patch.


+   if (tableam != NULL)
+   {
+   char   *escape_tableam;
+
+   escape_tableam = PQescapeIdentifier(con, tableam, strlen(tableam));
+   appendPQExpBuffer(&query, " using %s", escape_tableam);
+   PQfreemem(escape_tableam);
+   }

Thanks for pointing this out. The order I believe is fixed in v3 patch.

The order is wrong here, generating an unsupported grammar, see by
yourself with this command:
pgbench --partition-method=hash --table-am=heap -i  --partitions 2


Tested in v3 patch, with a command like,

pgbench --partition-method=hash --table-access-method=heap -i --partitions 2



This makes the patch trickier than it looks as the USING clause is
located between PARTITION BY and WITH.  Also, partitioned tables
cannot use the USING clause so you need to be careful that
createPartitions() also uses the correct table AM.

This stuff needs tests.
3 test cases has been added the tap test, but honestly I am not sure if 
I am following the rules. Any comments will be great.

--
Michael

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca
>From c64173842a163abbdd6210af5f9643b3301d8de0 Mon Sep 17 00:00:00 2001
From: David Zhang 
Date: Thu, 26 Nov 2020 18:08:49 -0800
Subject: [PATCH] add table access method as an option to pgbench

---
 doc/src/sgml/ref/pgbench.sgml|  9 ++
 src/bin/pgbench/pgbench.c| 23 ++-
 src/bin/pgbench/t/001_pgbench_with_server.pl | 30 +++-
 3 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 7180fedd65..3af5230f85 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -359,6 +359,15 @@ pgbench  options 
 d
   
  
 
+ 
+  
--table-access-method=TABLEACCESSMETHOD
+  
+   
+Create tables using the specified table access method, rather than the 
default. 
+   
+  
+ 
+ 
  
   
--tablespace=tablespace
   
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3057665bbe..c3939b6bb4 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -184,10 +184,11 @@ doublethrottle_delay = 0;
 int64  latency_limit = 0;
 
 /*
- * tablespace selection
+ * tablespace and table access method selection
  */
 char  *tablespace = NULL;
 char  *index_tablespace = NULL;
+char  *table_access_method = NULL;
 
 /*
  * Number of "pgbench_accounts" partitions.  0 is the default and means no
@@ -641,6 +642,9 @@ usage(void)
   "  --partition-method=(range|hash)\n"
   "   partition pgbench_accounts with 
this method (default: range)\n"
   "  --partitions=NUM partition pgbench_accounts into 
NUM parts (default: 0)\n"
+  "  --table-access-method=TABLEACCESSMETHOD\n"
+  "   create tables with using 
specified table access method, \n"
+  "   rather than the default.\n"
   "  --tablespace=TABLESPACE  create tables in the specified 
tablespace\n"
   "  --unlogged-tablescreate tables as unlogged 
tables\n"
   "\nOptions to select what to run:\n"
@@ -3759,6 +3763,18 @@ initCreateTables(PGconn *con)
  ddl->table,
 

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-26 Thread k.jami...@fujitsu.com
> From: Kyotaro Horiguchi 
> Hello, Kirk. Thank you for the new version.

Hi, Horiguchi-san. Thank you for your very helpful feedback.
I'm updating the patches addressing those.

> + if (!smgrexists(rels[i], j))
> + continue;
> +
> + /* Get the number of blocks for a relation's fork */
> + blocks[i][numForks] = smgrnblocks(rels[i], j,
> NULL);
> 
> If we see a fork which its size is not cached we must give up this 
> optimization
> for all target relations.

I did not use the "cached" flag in DropRelFileNodesAllBuffers and use InRecovery
when deciding for optimization because of the following reasons:
XLogReadBufferExtended() calls smgrnblocks() to apply changes to relation page
contents. So in DropRelFileNodeBuffers(), XLogReadBufferExtended() is called
during VACUUM replay because VACUUM changes the page content.
OTOH, TRUNCATE doesn't change the relation content, it just truncates relation 
pages
without changing the page contents. So XLogReadBufferExtended() is not called, 
and
the "cached" flag will always return false. I tested with "cached" flags 
before, and this
always return false, at least in DropRelFileNodesAllBuffers. Due to this, we 
cannot use
the cached flag in DropRelFileNodesAllBuffers(). However, I think we can still 
rely on
smgrnblocks to get the file size as long as we're InRecovery. That cached 
nblocks is still
guaranteed to be the maximum in the shared buffer.
Thoughts?

Regards,
Kirk Jamison




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-11-26 Thread Bharath Rupireddy
On Wed, Nov 25, 2020 at 12:13 AM Alexey Kondratov
 wrote:
>
> 1) Return 'true' if there were open connections and we successfully
> closed them.
> 2) Return 'false' in the no-op case, i.e. there were no open
> connections.
> 3) Rise an error if something went wrong. And non-existing server case
> belongs to this last category, IMO.
>

Done this way.

>
> I am not sure, but I think, that instead of adding this additional flag
> into ConnCacheEntry structure we can look on entry->xact_depth and use
> local:
>
> bool used_in_current_xact = entry->xact_depth > 0;
>
> for exactly the same purpose. Since we set entry->xact_depth to zero at
> the end of xact, then it was used if it is not zero. It is set to 1 by
> begin_remote_xact() called by GetConnection(), so everything seems to be
> fine.
>

Done.

>
> In the case of pgfdw_inval_callback() this argument makes sense, since
> syscache callbacks work that way, but here I can hardly imagine a case
> where we can use it. Thus, it still looks as a preliminary complication
> for me, since we do not have plans to use it, do we? Anyway, everything
> seems to be working fine, so it is up to you to keep this additional
> argument.
>

Removed the cacheid variable.

>
> Following this logic:
>
> 1) If keep_connections == true, then per-server keep_connection has a
> *higher* priority, so one can disable caching of a single foreign
> server.
>
> 2) But if keep_connections == false, then it works like a global switch
> off indifferently of per-server keep_connection's, i.e. they have a
> *lower* priority.
>
> It looks fine for me, at least I cannot propose anything better, but
> maybe it should be documented in 0004?
>

Done.

>
> I think that GUC acronym is used widely only in the source code and
> Postgres docs tend to do not use it at all, except from acronyms list
> and a couple of 'GUC parameters' collocation usage. And it never used in
> a singular form there, so I think that it should be rather:
>
> A configuration parameter,
> postgres_fdw.keep_connections, default being...
>

Done.

>
> The whole paragraph is really difficult to follow. It could be something
> like that:
>
>   
>Note that setting postgres_fdw.keep_connections
> to
>off does not discard any previously made and still open
> connections immediately.
>They will be closed only at the end of a future transaction, which
> operated on them.
>
>To close all connections immediately use
>postgres_fdw_disconnect function.
>   
>

Done.

Attaching the v2 patch set. Please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v2-0001-postgres_fdw-connection-cache-disconnect-function.patch
Description: Binary data


v2-0004-postgre_fdw-connection-cache-discard-tests-and-doc.patch
Description: Binary data


v2-0002-postgres_fdw-add-keep_connections-GUC-to-not-cache.patch
Description: Binary data


v2-0003-postgres_fdw-server-level-option-keep_connection.patch
Description: Binary data


Re: POC: postgres_fdw insert batching

2020-11-26 Thread Craig Ringer
On Fri, Nov 27, 2020 at 3:34 AM Tomas Vondra 
wrote:


> Not sure how is this related to app developers? I think the idea was
> that the libpq features might be useful between the two PostgreSQL
> instances. I.e. the postgres_fdw would use the libpq batching to send
> chunks of data to the other side.
>

Right. Or at least, when designing the FDW API, do so in a way that doesn't
strictly enforce Request/Response alternation without interleaving, so you
can benefit from it in the future.

It's hardly just libpq after all. A *lot* of client libraries and drivers
will be capable of non-blocking reads or writes with multiple ones in
flight at once. Any REST-like API generally can, for example. So for
performance reasons we should if possible avoid baking the assumption that
a request cannot be made until the response from the previous request is
received, and instead have a wait interface to use for when a new request
requires the prior response's result before it can proceed.

Well, my point was that we could keep the API, but maybe it should be
> implemented using the proposed libpq batching. They could still use the
> postgres_fdw example how to use the API, but the internals would need to
> be different, of course.
>

Sure. Or just allow room for it in the FDW API, though using the pipelining
support natively would be nice.

If the FDW interface allows Pg to

Insert A
Insert B
Insert C
Wait for outcome of insert A
...

then that'll be useful for using libpq pipelining, but also FDWs for all
sorts of other DBs, especially cloud-y ones where latency is a big concern.


Re: Allow some recovery parameters to be changed with reload

2020-11-26 Thread Fujii Masao




On 2020/11/27 9:30, Kyotaro Horiguchi wrote:

At Thu, 26 Nov 2020 22:43:48 +0900, Fujii Masao  
wrote in



On 2020/11/12 4:38, Sergei Kornilov wrote:

Hello


Anyway, for now I think that your first patch would be enough, i.e.,
just change the context of restore_command to PGC_SIGHUP.

Glad to hear. Attached a rebased version of the original proposal.


Thanks for rebasing the patch!

 This parameter is required for archive recovery,

I found the above description in config.sgml. I was just wondering
if it should be updated so that the actual specification is described
or not.
The actual spec is that restore_command is required to start archive
recovery, but optional (i.e., the parameter can be reset to an empty)
after archive recovery has started. But this updated version of
description would be rather confusing to users. So I'm now thinking
not to update that.

Does anyone object to the patch? If no, I'm thinking to commit the
patch.


Although I don't object to make the parameter reloadable, I think it
needs to be documented that server could stop after reloading if the
server failed to execute the new command line.


You mean that we should document that if restore_command is set to improper 
command mistakenly, archive recovery may fail to restore some archived WAL 
files and finish without replaying those WAL? But isn't this true even without 
applying the patch?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Allow some recovery parameters to be changed with reload

2020-11-26 Thread Kyotaro Horiguchi
At Thu, 26 Nov 2020 22:43:48 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/11/12 4:38, Sergei Kornilov wrote:
> > Hello
> > 
> >> Anyway, for now I think that your first patch would be enough, i.e.,
> >> just change the context of restore_command to PGC_SIGHUP.
> > Glad to hear. Attached a rebased version of the original proposal.
> 
> Thanks for rebasing the patch!
> 
> This parameter is required for archive recovery,
> 
> I found the above description in config.sgml. I was just wondering
> if it should be updated so that the actual specification is described
> or not.
> The actual spec is that restore_command is required to start archive
> recovery, but optional (i.e., the parameter can be reset to an empty)
> after archive recovery has started. But this updated version of
> description would be rather confusing to users. So I'm now thinking
> not to update that.
> 
> Does anyone object to the patch? If no, I'm thinking to commit the
> patch.

Although I don't object to make the parameter reloadable, I think it
needs to be documented that server could stop after reloading if the
server failed to execute the new command line.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Rename of triggers for partitioned tables

2020-11-26 Thread Arne Roland
I'm sorry for the spam. Here a patch with updated comments, I missed one before.


Regards

Arne


From: Arne Roland 
Sent: Friday, November 27, 2020 1:05:02 AM
To: Pg Hackers
Subject: Re: Rename of triggers for partitioned tables


I'm too unhappy with the interface change, so please view attached patch 
instead.


Regards

Arne


From: Arne Roland 
Sent: Friday, November 27, 2020 12:28:31 AM
To: Pg Hackers
Subject: Rename of triggers for partitioned tables


Hello,


I got too annoyed at building queries for gexec all the time. So wrote a patch 
to fix the issue that the rename of partitioned trigger doesn't affect children.


In case there isn't a dependent trigger with the same name on the child the 
function errors out. If the user touched the trigger of the child, it's not a 
sensitive idea to mess with it.


I'm not happy with the interface change in trigger.h. Yet I don't like the idea 
of creating another layer of indirection either.


Even though it's a small patch, there might be other things I'm missing, so I'd 
appreciate feedback.


Regards

Arne
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index c336b238aa..f4647d38c5 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -1357,12 +1357,7 @@ get_trigger_oid(Oid relid, const char *trigname, bool missing_ok)
 	return oid;
 }
 
-/*
- * Perform permissions and integrity checks before acquiring a relation lock.
- */
-static void
-RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
- void *arg)
+static void callbackForRenameTrigger(char* relname, Oid relid)
 {
 	HeapTuple	tuple;
 	Form_pg_class form;
@@ -1379,25 +1374,36 @@ RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
 		ereport(ERROR,
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
  errmsg("\"%s\" is not a table, view, or foreign table",
-		rv->relname)));
+		relname)));
 
 	/* you must own the table to rename one of its triggers */
 	if (!pg_class_ownercheck(relid, GetUserId()))
-		aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relid)), rv->relname);
+		aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relid)), relname);
 	if (!allowSystemTableMods && IsSystemClass(relid, form))
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  errmsg("permission denied: \"%s\" is a system catalog",
-		rv->relname)));
+		relname)));
 
 	ReleaseSysCache(tuple);
 }
 
 /*
- *		renametrig		- changes the name of a trigger on a relation
+ * Perform permissions and integrity checks before acquiring a relation lock.
+ */
+static void
+RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
+ void *arg)
+{
+		callbackForRenameTrigger(rv->relname, relid);
+}
+
+/*
+ *		renameonlytrig		- changes the name of a trigger on a relation
  *
  *		trigger name is changed in trigger catalog.
  *		No record of the previous name is kept.
+ *	  This function assumes that the row is already locked
  *
  *		get proper relrelation from relation catalog (if not arg)
  *		scan trigger catalog
@@ -1407,41 +1413,26 @@ RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
  *		update row in catalog
  */
 ObjectAddress
-renametrig(RenameStmt *stmt)
+renameonlytrig(RenameStmt *stmt, Relation tgrel, Oid relid, Oid tgparent)
 {
 	Oid			tgoid;
 	Relation	targetrel;
-	Relation	tgrel;
 	HeapTuple	tuple;
 	SysScanDesc tgscan;
 	ScanKeyData key[2];
-	Oid			relid;
 	ObjectAddress address;
 
-	/*
-	 * Look up name, check permissions, and acquire lock (which we will NOT
-	 * release until end of transaction).
-	 */
-	relid = RangeVarGetRelidExtended(stmt->relation, AccessExclusiveLock,
-	 0,
-	 RangeVarCallbackForRenameTrigger,
-	 NULL);
 
 	/* Have lock already, so just need to build relcache entry. */
 	targetrel = relation_open(relid, NoLock);
 
+
 	/*
 	 * Scan pg_trigger twice for existing triggers on relation.  We do this in
 	 * order to ensure a trigger does not exist with newname (The unique index
 	 * on tgrelid/tgname would complain anyway) and to ensure a trigger does
 	 * exist with oldname.
 	 *
-	 * NOTE that this is cool only because we have AccessExclusiveLock on the
-	 * relation, so the trigger set won't be changing underneath us.
-	 */
-	tgrel = table_open(TriggerRelationId, RowExclusiveLock);
-
-	/*
 	 * First pass -- look for name conflict
 	 */
 	ScanKeyInit(&key[0],
@@ -1472,6 +1463,7 @@ renametrig(RenameStmt *stmt)
 Anum_pg_trigger_tgname,
 BTEqualStrategyNumber, F_NAMEEQ,
 PointerGetDatum(stmt->subname));
+
 	tgscan = systable_beginscan(tgrel, TriggerRelidNameIndexId, true,
 NULL, 2, key);
 	if (HeapTupleIsValid(tuple = systable_getnext(tgscan)))
@@ -1484,7 +1476,12 @@ renametrig(RenameStmt *stmt)
 		tuple = heap_copytuple(tuple);	/* need a modifiable copy */
 		tr

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-26 Thread David Rowley
Thanks for having another look at this.

> On Sun, Nov 22, 2020 at 9:21 PM Andy Fan  wrote:
> add 2 more comments.
>
> 1. I'd suggest adding Assert(false); in RC_END_OF_SCAN  case to make the 
> error clearer.
>
> case RC_END_OF_SCAN:
> /*
> * We've already returned NULL for this scan, but just in case
> * something call us again by mistake.
> */
> return NULL;

This just took some inspiration from nodeMaterial.c where it says:

/*
* If necessary, try to fetch another row from the subplan.
*
* Note: the eof_underlying state variable exists to short-circuit further
* subplan calls.  It's not optional, unfortunately, because some plan
* node types are not robust about being called again when they've already
* returned NULL.
*/

I'm not feeling a pressing need to put an Assert(false); in there as
it's not what nodeMaterial.c does.  nodeMaterial is nodeResultCache's
sister node which can also be seen below Nested Loops.

> 2. Currently we handle the (!cache_store_tuple(node, outerslot))) case by set 
> it
>to RC_CACHE_BYPASS_MODE. The only reason for the cache_store_tuple failure 
> is
>we can't cache_reduce_memory.  I guess if cache_reduce_memory
>failed once, it would not succeed later(no more tuples can be stored,
>nothing is changed). So I think we can record this state and avoid any new
>cache_reduce_memory call.
>
> /*
> * If we failed to create the entry or failed to store the
> * tuple in the entry, then go into bypass mode.
> */
> if (unlikely(entry == NULL ||
> !cache_store_tuple(node, outerslot)))
>
>  to
>
> if (unlikely(entry == NULL || node->memory_cant_be_reduced ||
> !cache_store_tuple(node, outerslot)))

The reason for RC_CACHE_BYPASS_MODE is if there's a single set of
parameters that have so many results that they, alone, don't fit in
the cache. We call cache_reduce_memory() whenever we go over our
memory budget. That function returns false if it was unable to free
enough memory without removing the "specialkey", which in this case is
the current cache entry that's being populated. Later, when we're
caching some entry that isn't quite so large, we still want to be able
to cache that.  In that case, we'll have removed the remnants of the
overly large entry that didn't fit to way for newer and, hopefully,
smaller entries. No problems.  I'm not sure why there's a need for
another flag here.

A bit more background.

When caching a new entry, or finding an existing entry, we move that
entry to the top of the MRU dlist.  When adding entries or tuples to
existing entries, if we've gone over memory budget, then we remove
cache entries from the MRU list starting at the tail (lease recently
used). If we begin caching tuples for an entry and need to free some
space, then since we've put the current entry to the top of the MRU
list, it'll be the last one to be removed. However, it's still
possible that we run through the entire MRU list and end up at the
most recently used item. So the entry we're populating can also be
removed if freeing everything else was still not enough to give us
enough free memory. The code refers to this as a cache overflow. This
causes the state machine to move into RC_CACHE_BYPASS_MODE mode. We'll
just read tuples directly from the subnode in that case, no need to
attempt to cache them. They're not going to fit.   We'll come out of
RC_CACHE_BYPASS_MODE when doing the next rescan with a different set
of parameters.  This is our chance to try caching things again. The
code does that. There might be far fewer tuples for the next parameter
we're scanning for, or those tuples might be more narrow. So it makes
sense to give caching them another try. Perhaps there's some point
where we should give up doing that, but given good statistics, it's
unlikely the planner would have thought a result cache would have been
worth the trouble and would likely have picked some other way to
execute the plan.  The planner does estimate the average size of a
cache entry and calculates how many of those fit into a hash_mem.  If
that number is too low then Result Caching the inner side won't be too
appealing.  Of course, calculating the average does not mean there are
no outliers. We'll deal with the large side of the outliers with the
bypass code.

I currently don't really see what needs to be changed about that.

David




Re: Rename of triggers for partitioned tables

2020-11-26 Thread Arne Roland
I'm too unhappy with the interface change, so please view attached patch 
instead.


Regards

Arne


From: Arne Roland 
Sent: Friday, November 27, 2020 12:28:31 AM
To: Pg Hackers
Subject: Rename of triggers for partitioned tables


Hello,


I got too annoyed at building queries for gexec all the time. So wrote a patch 
to fix the issue that the rename of partitioned trigger doesn't affect children.


In case there isn't a dependent trigger with the same name on the child the 
function errors out. If the user touched the trigger of the child, it's not a 
sensitive idea to mess with it.


I'm not happy with the interface change in trigger.h. Yet I don't like the idea 
of creating another layer of indirection either.


Even though it's a small patch, there might be other things I'm missing, so I'd 
appreciate feedback.


Regards

Arne
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index c336b238aa..f4647d38c5 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -1357,12 +1357,7 @@ get_trigger_oid(Oid relid, const char *trigname, bool missing_ok)
 	return oid;
 }
 
-/*
- * Perform permissions and integrity checks before acquiring a relation lock.
- */
-static void
-RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
- void *arg)
+static void callbackForRenameTrigger(char* relname, Oid relid)
 {
 	HeapTuple	tuple;
 	Form_pg_class form;
@@ -1379,25 +1374,36 @@ RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
 		ereport(ERROR,
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
  errmsg("\"%s\" is not a table, view, or foreign table",
-		rv->relname)));
+		relname)));
 
 	/* you must own the table to rename one of its triggers */
 	if (!pg_class_ownercheck(relid, GetUserId()))
-		aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relid)), rv->relname);
+		aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relid)), relname);
 	if (!allowSystemTableMods && IsSystemClass(relid, form))
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  errmsg("permission denied: \"%s\" is a system catalog",
-		rv->relname)));
+		relname)));
 
 	ReleaseSysCache(tuple);
 }
 
 /*
- *		renametrig		- changes the name of a trigger on a relation
+ * Perform permissions and integrity checks before acquiring a relation lock.
+ */
+static void
+RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
+ void *arg)
+{
+		callbackForRenameTrigger(rv->relname, relid);
+}
+
+/*
+ *		renameonlytrig		- changes the name of a trigger on a relation
  *
  *		trigger name is changed in trigger catalog.
  *		No record of the previous name is kept.
+ *	  This function assumes that the row is already locked
  *
  *		get proper relrelation from relation catalog (if not arg)
  *		scan trigger catalog
@@ -1407,41 +1413,26 @@ RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
  *		update row in catalog
  */
 ObjectAddress
-renametrig(RenameStmt *stmt)
+renameonlytrig(RenameStmt *stmt, Relation tgrel, Oid relid, Oid tgparent)
 {
 	Oid			tgoid;
 	Relation	targetrel;
-	Relation	tgrel;
 	HeapTuple	tuple;
 	SysScanDesc tgscan;
 	ScanKeyData key[2];
-	Oid			relid;
 	ObjectAddress address;
 
-	/*
-	 * Look up name, check permissions, and acquire lock (which we will NOT
-	 * release until end of transaction).
-	 */
-	relid = RangeVarGetRelidExtended(stmt->relation, AccessExclusiveLock,
-	 0,
-	 RangeVarCallbackForRenameTrigger,
-	 NULL);
 
 	/* Have lock already, so just need to build relcache entry. */
 	targetrel = relation_open(relid, NoLock);
 
+
 	/*
 	 * Scan pg_trigger twice for existing triggers on relation.  We do this in
 	 * order to ensure a trigger does not exist with newname (The unique index
 	 * on tgrelid/tgname would complain anyway) and to ensure a trigger does
 	 * exist with oldname.
 	 *
-	 * NOTE that this is cool only because we have AccessExclusiveLock on the
-	 * relation, so the trigger set won't be changing underneath us.
-	 */
-	tgrel = table_open(TriggerRelationId, RowExclusiveLock);
-
-	/*
 	 * First pass -- look for name conflict
 	 */
 	ScanKeyInit(&key[0],
@@ -1472,6 +1463,7 @@ renametrig(RenameStmt *stmt)
 Anum_pg_trigger_tgname,
 BTEqualStrategyNumber, F_NAMEEQ,
 PointerGetDatum(stmt->subname));
+
 	tgscan = systable_beginscan(tgrel, TriggerRelidNameIndexId, true,
 NULL, 2, key);
 	if (HeapTupleIsValid(tuple = systable_getnext(tgscan)))
@@ -1484,7 +1476,12 @@ renametrig(RenameStmt *stmt)
 		tuple = heap_copytuple(tuple);	/* need a modifiable copy */
 		trigform = (Form_pg_trigger) GETSTRUCT(tuple);
 		tgoid = trigform->oid;
-
+		if (tgparent != 0 && tgparent != trigform->tgparentid) {
+			ereport(ERROR,
+	(errcode(ERRCODE_UNDEFINED_OBJECT),
+	 errmsg("trigger \"%s\" for table \"%s\" is not dependent on the 

Re: [PoC] Non-volatile WAL buffer

2020-11-26 Thread Tomas Vondra



On 11/26/20 10:19 PM, Tomas Vondra wrote:
> 
> 
> On 11/26/20 9:59 PM, Heikki Linnakangas wrote:
>> On 26/11/2020 21:27, Tomas Vondra wrote:
>>> Hi,
>>>
>>> Here's the "simple patch" that I'm currently experimenting with. It
>>> essentially replaces open/close/write/fsync with pmem calls
>>> (map/unmap/memcpy/persist variants), and it's by no means committable.
>>> But it works well enough for experiments / measurements, etc.
>>>
>>> The numbers (5-minute pgbench runs on scale 500) look like this:
>>>
>>>   master/btt    master/dax   ntt    simple
>>>     ---
>>>   1 5469  7402  7977  6746
>>>  16    48222 80869    107025 82343
>>>  32    73974    158189    214718    158348
>>>  64    85921    154540    225715    164248
>>>  96   150602    221159    237008    217253
>>>
>>> A chart illustrating these results is attached. The four columns are
>>> showing unpatched master with WAL on a pmem device, in BTT or DAX modes,
>>> "ntt" is the patch submitted to this thread, and "simple" is the patch
>>> I've hacked together.
>>>
>>> As expected, the BTT case performs poorly (compared to the rest).
>>>
>>> The "master/dax" and "simple" perform about the same. There are some
>>> differences, but those may be attributed to noise. The NTT patch does
>>> outperform these cases by ~20-40% in some cases.
>>>
>>> The question is why. I recall suggestions this is due to page faults
>>> when writing data into the WAL, but I did experiment with various
>>> settings that I think should prevent that (e.g. disabling WAL reuse
>>> and/or disabling zeroing the segments) but that made no measurable
>>> difference.
>>
>> The page faults are only a problem when mmap() is used *without* DAX.
>>
>> Takashi tried a patch earlier to mmap() WAL segments and insert WAL to
>> them directly. See 0002-Use-WAL-segments-as-WAL-buffers.patch at
>> https://www.postgresql.org/message-id/01d5dff4%24995ed180%24cc1c7480%24%40hco.ntt.co.jp_1.
>> Could you test that patch too, please? Using your nomenclature, that
>> patch skips wal_buffers and does:
>>
>>   clients -> wal segments (PMEM DAX)
>>
>> He got good results with that with DAX, but otherwise it performed
>> worse. And then we discussed why that might be, and the page fault
>> hypothesis was brought up.
>>
> 
> D'oh, I haven't noticed there's a patch doing that. This thread has so
> many different patches - which is good, but a bit confusing.
> 
>> I think 0002-Use-WAL-segments-as-WAL-buffers.patch is the most promising
>> approach here. But because it's slower without DAX, we need to keep the
>> current code for non-DAX systems. Unfortunately it means that we need to
>> maintain both implementations, selectable with a GUC or some DAX
>> detection magic. The question then is whether the code complexity is
>> worth the performance gin on DAX-enabled systems.
>>
> 
> Sure, I can give it a spin. The question is whether it applies to
> current master, or whether some sort of rebase is needed. I'll try.
> 

Unfortunately, that patch seems to fail for me :-(

The patches seem to be for PG12, so I applied them on REL_12_STABLE (all
the parts 0001-0005) and then I did this:

LIBS="-lpmem" ./configure --prefix=/home/tomas/pg-12-pmem --enable-debug
make -s install

initdb -X /opt/pmemdax/benchmarks/wal -D /opt/nvme/benchmarks/data

pg_ctl -D /opt/nvme/benchmarks/data/ -l pg.log start

createdb test
pgbench -i -s 500 test


which however fails after just about 70k rows generated (PQputline
failed), and the pg.log says this:

PANIC:  could not open or mmap file
"pg_wal/00010006": No such file or directory
CONTEXT:  COPY pgbench_accounts, line 721000
STATEMENT:  copy pgbench_accounts from stdin

Takashi-san, can you check and provide a fixed version? Ideally, I'll
take a look too, but I'm not familiar with this patch so it may take
more time.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Rename of triggers for partitioned tables

2020-11-26 Thread Arne Roland
Hello,


I got too annoyed at building queries for gexec all the time. So wrote a patch 
to fix the issue that the rename of partitioned trigger doesn't affect children.


In case there isn't a dependent trigger with the same name on the child the 
function errors out. If the user touched the trigger of the child, it's not a 
sensitive idea to mess with it.


I'm not happy with the interface change in trigger.h. Yet I don't like the idea 
of creating another layer of indirection either.


Even though it's a small patch, there might be other things I'm missing, so I'd 
appreciate feedback.


Regards

Arne
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index b11ebf0f61..e92dae78f7 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -365,7 +365,7 @@ ExecRenameStmt(RenameStmt *stmt)
 	 stmt->newname);
 
 		case OBJECT_TRIGGER:
-			return renametrig(stmt);
+			return renametrig(stmt, 0, 0);
 
 		case OBJECT_POLICY:
 			return rename_policy(stmt);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index c336b238aa..039386958a 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -1357,12 +1357,7 @@ get_trigger_oid(Oid relid, const char *trigname, bool missing_ok)
 	return oid;
 }
 
-/*
- * Perform permissions and integrity checks before acquiring a relation lock.
- */
-static void
-RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
- void *arg)
+static void callbackForRenameTrigger(char* relname, Oid relid)
 {
 	HeapTuple	tuple;
 	Form_pg_class form;
@@ -1379,22 +1374,32 @@ RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
 		ereport(ERROR,
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
  errmsg("\"%s\" is not a table, view, or foreign table",
-		rv->relname)));
+		relname)));
 
 	/* you must own the table to rename one of its triggers */
 	if (!pg_class_ownercheck(relid, GetUserId()))
-		aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relid)), rv->relname);
+		aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relid)), relname);
 	if (!allowSystemTableMods && IsSystemClass(relid, form))
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  errmsg("permission denied: \"%s\" is a system catalog",
-		rv->relname)));
+		relname)));
 
 	ReleaseSysCache(tuple);
 }
 
 /*
- *		renametrig		- changes the name of a trigger on a relation
+ * Perform permissions and integrity checks before acquiring a relation lock.
+ */
+static void
+RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
+ void *arg)
+{
+		callbackForRenameTrigger(rv->relname, relid);
+}
+
+/*
+ *		renameonlytrig		- changes the name of a trigger on a relation
  *
  *		trigger name is changed in trigger catalog.
  *		No record of the previous name is kept.
@@ -1407,41 +1412,26 @@ RangeVarCallbackForRenameTrigger(const RangeVar *rv, Oid relid, Oid oldrelid,
  *		update row in catalog
  */
 ObjectAddress
-renametrig(RenameStmt *stmt)
+renameonlytrig(RenameStmt *stmt, Relation tgrel, Oid relid, Oid tgparent)
 {
 	Oid			tgoid;
 	Relation	targetrel;
-	Relation	tgrel;
 	HeapTuple	tuple;
 	SysScanDesc tgscan;
 	ScanKeyData key[2];
-	Oid			relid;
 	ObjectAddress address;
 
-	/*
-	 * Look up name, check permissions, and acquire lock (which we will NOT
-	 * release until end of transaction).
-	 */
-	relid = RangeVarGetRelidExtended(stmt->relation, AccessExclusiveLock,
-	 0,
-	 RangeVarCallbackForRenameTrigger,
-	 NULL);
 
 	/* Have lock already, so just need to build relcache entry. */
 	targetrel = relation_open(relid, NoLock);
 
+
 	/*
 	 * Scan pg_trigger twice for existing triggers on relation.  We do this in
 	 * order to ensure a trigger does not exist with newname (The unique index
 	 * on tgrelid/tgname would complain anyway) and to ensure a trigger does
 	 * exist with oldname.
 	 *
-	 * NOTE that this is cool only because we have AccessExclusiveLock on the
-	 * relation, so the trigger set won't be changing underneath us.
-	 */
-	tgrel = table_open(TriggerRelationId, RowExclusiveLock);
-
-	/*
 	 * First pass -- look for name conflict
 	 */
 	ScanKeyInit(&key[0],
@@ -1472,6 +1462,7 @@ renametrig(RenameStmt *stmt)
 Anum_pg_trigger_tgname,
 BTEqualStrategyNumber, F_NAMEEQ,
 PointerGetDatum(stmt->subname));
+
 	tgscan = systable_beginscan(tgrel, TriggerRelidNameIndexId, true,
 NULL, 2, key);
 	if (HeapTupleIsValid(tuple = systable_getnext(tgscan)))
@@ -1484,7 +1475,12 @@ renametrig(RenameStmt *stmt)
 		tuple = heap_copytuple(tuple);	/* need a modifiable copy */
 		trigform = (Form_pg_trigger) GETSTRUCT(tuple);
 		tgoid = trigform->oid;
-
+		if (tgparent != 0 && tgparent != trigform->tgparentid) {
+			ereport(ERROR,
+	(errcode(ERRCODE_UNDEFINED_OBJECT),
+	 errmsg("trigger \"%s\" for table \"%s\" is not dependent on the trigger on

Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Tom Lane
Alexander Korotkov  writes:
> On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas  wrote:
>> Is there some official ARM documentation, like a programmer's reference
>> manual or something like that, that would show a reference
>> implementation of a spinlock on ARM? It would be good to refer to an
>> authoritative source on this.

> I've compared assembly output of gcc implementations of CAS and TAS.

FWIW, I see quite different assembly using Apple's clang on their M1
processor.  What I get for SpinLockAcquire on HEAD is (lock pointer
initially in x0):

mov x19, x0
mov w8, #1
swpal   w8, w8, [x0]
cbz w8, LBB0_2
adrpx1, l_.str@PAGE
add x1, x1, l_.str@PAGEOFF
adrpx3, l___func__.foo@PAGE
add x3, x3, l___func__.foo@PAGEOFF
mov x0, x19
mov w2, #12
bl  _s_lock
LBB0_2:
... lock is acquired

while SpinLockRelease is just

stlrwzr, [x19]

With the patch, I get

mov x19, x0
mov w8, #0
mov w9, #1
casaw8, w9, [x0]
cmp w8, #0  ; =0
b.eqLBB0_2
adrpx1, l_.str@PAGE
add x1, x1, l_.str@PAGEOFF
adrpx3, l___func__.foo@PAGE
add x3, x3, l___func__.foo@PAGEOFF
mov x0, x19
mov w2, #12
bl  _s_lock
LBB0_2:
... lock is acquired

and SpinLockRelease is the same.

Don't know much of anything about ARM assembly, so I don't
know if these instructions are late-model-only.

regards, tom lane




Re: pgbench and timestamps (bounced)

2020-11-26 Thread Fabien COELHO




CFM reminder.

Hi, this entry is "Waiting on Author" and the thread was inactive for a 
while. I see this discussion still has some open questions. Are you 
going to continue working on it, or should I mark it as "returned with 
feedback" until a better time?


IMHO the proposed fix is reasonable and addresses the issue.

I have responded to Tom's remarks about it, and it is waiting for his 
answer to my counter arguments. So ISTM that the patch is currently still 
waiting for some feedback.


--
Fabien.




Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Tom Lane
Krunal Bauskar  writes:
> On Thu, 26 Nov 2020 at 10:50, Tom Lane  wrote:
>> Also, exactly what hardware/software platform were these curves
>> obtained on?

> Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for
> server and 8 for client) [2 numa nodes]
> Storage: 3.2 TB NVMe SSD
> OS: CentOS Linux release 7.6
> PGSQL: baseline = Release Tag 13.1

Hmm, might not be the sort of hardware ordinary mortals can get their
hands on.  What's likely to be far more common ARM64 hardware in the
near future is Apple's new gear.  So I thought I'd try this on the new
M1 mini I just got.

... and, after retrieving my jaw from the floor, I present the
attached.  Apple's chips evidently like this style of spinlock a LOT
better.  The difference is so remarkable that I wonder if I made a
mistake somewhere.  Can anyone else replicate these results?

Test conditions are absolutely brain dead:

Today's HEAD (dcfff74fb), no special build options

All server parameters are out-of-the-box defaults, except
I had to raise max_connections for the larger client counts

pgbench scale factor 100

Read-only tests are like
pgbench -S -T 60 -c 32 -j 16 bench
Quoted figure is median of three runs; except for the lowest
client count, results were quite repeatable.  (I speculate that
at -c 4, the scheduler might've been doing something funny about
sometimes using the slow cores instead of fast cores.)

Read-write tests are like
pgbench -T 300 -c 16 -j 8 bench
I didn't have the patience to run three full repetitions,
but again the numbers seemed pretty repeatable.

I used -j equal to half -c, except I could not get -j above 128
to work, so the larger client counts have -j 128.  Did not try
to run down that problem yet, but I'm probably hitting some ulimit
somewhere.  (I did have to raise "ulimit -n" to get these results.)

Anyway, this seems to be a slam-dunk win on M1.

regards, tom lane



Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-26 Thread Alvaro Herrera
On 2020-Nov-26, Alvaro Herrera wrote:

> So let's discuss the next step in this series: what to do about REINDEX
> CONCURRENTLY.

> [...]

... as in the attached.

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca24620fd0..9c1c0aad3e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -385,7 +385,7 @@ CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY
+ * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
  * on indexes that are neither expressional nor partial are also safe to
  * ignore, since we know that those processes won't examine any data
  * outside the table they're indexing.
@@ -1564,9 +1564,11 @@ DefineIndex(Oid relationId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+	/*
+	 * This transaction doesn't need to set the PROC_IN_SAFE_IC flag, because
+	 * it only takes a snapshot to do some catalog manipulations, after the
+	 * wait is over.
+	 */
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -3020,6 +3022,13 @@ ReindexMultipleInternal(List *relids, int options)
  * concerns, so there's no need for the more complicated concurrent build,
  * anyway, and a non-concurrent reindex is more efficient.
  */
+
+typedef struct reindex_idx
+{
+	Oid		indexId;
+	bool	safe;
+} reindex_idx;
+
 static bool
 ReindexRelationConcurrently(Oid relationOid, int options)
 {
@@ -3132,10 +3141,15 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		get_rel_name(cellOid;
 	else
 	{
+		reindex_idx	   *idx;
+
 		/* Save the list of relation OIDs in private context */
 		oldcontext = MemoryContextSwitchTo(private_context);
 
-		indexIds = lappend_oid(indexIds, cellOid);
+		idx = palloc(sizeof(reindex_idx));
+		idx->indexId = cellOid;
+
+		indexIds = lappend(indexIds, idx);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -3172,13 +3186,17 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 			get_rel_name(cellOid;
 		else
 		{
+			reindex_idx	   *idx;
+
 			/*
 			 * Save the list of relation OIDs in private
 			 * context
 			 */
 			oldcontext = MemoryContextSwitchTo(private_context);
 
-			indexIds = lappend_oid(indexIds, cellOid);
+			idx = palloc(sizeof(reindex_idx));
+			idx->indexId = cellOid;
+			indexIds = lappend(indexIds, idx);
 
 			MemoryContextSwitchTo(oldcontext);
 		}
@@ -3197,6 +3215,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 Oid			heapId = IndexGetRelation(relationOid,
 	  (options & REINDEXOPT_MISSING_OK) != 0);
 Relation	heapRelation;
+reindex_idx *idx;
 
 /* if relation is missing, leave */
 if (!OidIsValid(heapId))
@@ -3247,7 +3266,9 @@ ReindexRelationConcurrently(Oid relationOid, int options)
  * Save the list of relation OIDs in private context.  Note
  * that invalid indexes are allowed here.
  */
-indexIds = lappend_oid(indexIds, relationOid);
+idx = palloc(sizeof(reindex_idx));
+idx->indexId = relationOid;
+indexIds = lappend(indexIds, idx);
 
 MemoryContextSwitchTo(oldcontext);
 break;
@@ -3306,8 +3327,10 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
-		Oid			indexId = lfirst_oid(lc);
+		reindex_idx *index = lfirst(lc);
+		Oid			indexId = index->indexId;
 		Oid			newIndexId;
+		reindex_idx *newidx;
 		Relation	indexRel;
 		Relation	heapRel;
 		Relation	newIndexRel;
@@ -3347,12 +3370,20 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 
+		/* determine safety of this index for set_indexsafe_procflags */
+		index->safe = (newIndexRel->rd_indexprs == NIL &&
+	   newIndexRel->rd_indpred == NIL);
+
 		/*
 		 * Save the list of OIDs and locks in private context
 		 */
 		oldcontext = MemoryContextSwitchTo(private_context);
 
-		newIndexIds = lappend_oid(newIndexIds, newIndexId);
+		newidx = palloc(sizeof(reindex_idx));
+		newidx->indexId = newIndexId;
+		newidx->safe = index->safe;
+
+		newIndexIds = lappend(newIndexIds, newidx);
 
 		/*
 		 * Save lockrelid to protect each relation from drop then close
@@ -3416,6 +3447,11 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no
+	 * need to set the PRO

Re: [PoC] Non-volatile WAL buffer

2020-11-26 Thread Tomas Vondra



On 11/26/20 9:59 PM, Heikki Linnakangas wrote:
> On 26/11/2020 21:27, Tomas Vondra wrote:
>> Hi,
>>
>> Here's the "simple patch" that I'm currently experimenting with. It
>> essentially replaces open/close/write/fsync with pmem calls
>> (map/unmap/memcpy/persist variants), and it's by no means committable.
>> But it works well enough for experiments / measurements, etc.
>>
>> The numbers (5-minute pgbench runs on scale 500) look like this:
>>
>>   master/btt    master/dax   ntt    simple
>>     ---
>>   1 5469  7402  7977  6746
>>  16    48222 80869    107025 82343
>>  32    73974    158189    214718    158348
>>  64    85921    154540    225715    164248
>>  96   150602    221159    237008    217253
>>
>> A chart illustrating these results is attached. The four columns are
>> showing unpatched master with WAL on a pmem device, in BTT or DAX modes,
>> "ntt" is the patch submitted to this thread, and "simple" is the patch
>> I've hacked together.
>>
>> As expected, the BTT case performs poorly (compared to the rest).
>>
>> The "master/dax" and "simple" perform about the same. There are some
>> differences, but those may be attributed to noise. The NTT patch does
>> outperform these cases by ~20-40% in some cases.
>>
>> The question is why. I recall suggestions this is due to page faults
>> when writing data into the WAL, but I did experiment with various
>> settings that I think should prevent that (e.g. disabling WAL reuse
>> and/or disabling zeroing the segments) but that made no measurable
>> difference.
> 
> The page faults are only a problem when mmap() is used *without* DAX.
> 
> Takashi tried a patch earlier to mmap() WAL segments and insert WAL to
> them directly. See 0002-Use-WAL-segments-as-WAL-buffers.patch at
> https://www.postgresql.org/message-id/01d5dff4%24995ed180%24cc1c7480%24%40hco.ntt.co.jp_1.
> Could you test that patch too, please? Using your nomenclature, that
> patch skips wal_buffers and does:
> 
>   clients -> wal segments (PMEM DAX)
> 
> He got good results with that with DAX, but otherwise it performed
> worse. And then we discussed why that might be, and the page fault
> hypothesis was brought up.
> 

D'oh, I haven't noticed there's a patch doing that. This thread has so
many different patches - which is good, but a bit confusing.

> I think 0002-Use-WAL-segments-as-WAL-buffers.patch is the most promising
> approach here. But because it's slower without DAX, we need to keep the
> current code for non-DAX systems. Unfortunately it means that we need to
> maintain both implementations, selectable with a GUC or some DAX
> detection magic. The question then is whether the code complexity is
> worth the performance gin on DAX-enabled systems.
> 

Sure, I can give it a spin. The question is whether it applies to
current master, or whether some sort of rebase is needed. I'll try.

> Andres was not excited about mmapping the WAL segments because of
> performance reasons. I'm not sure how much of his critique applies if we
> keep supporting both methods and only use mmap() if so configured.
> 

Yeah. I don't think we can just discard the current approach, there are
far too many OS variants that even if Linux is happy one of the other
critters won't be.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PoC] Non-volatile WAL buffer

2020-11-26 Thread Tomas Vondra
Hi,

Here's the "simple patch" that I'm currently experimenting with. It
essentially replaces open/close/write/fsync with pmem calls
(map/unmap/memcpy/persist variants), and it's by no means committable.
But it works well enough for experiments / measurements, etc.

The numbers (5-minute pgbench runs on scale 500) look like this:

 master/bttmaster/dax   nttsimple
   ---
 1 5469  7402  7977  6746
1648222 80869107025 82343
3273974158189214718158348
6485921154540225715164248
96   150602221159237008217253

A chart illustrating these results is attached. The four columns are
showing unpatched master with WAL on a pmem device, in BTT or DAX modes,
"ntt" is the patch submitted to this thread, and "simple" is the patch
I've hacked together.

As expected, the BTT case performs poorly (compared to the rest).

The "master/dax" and "simple" perform about the same. There are some
differences, but those may be attributed to noise. The NTT patch does
outperform these cases by ~20-40% in some cases.

The question is why. I recall suggestions this is due to page faults
when writing data into the WAL, but I did experiment with various
settings that I think should prevent that (e.g. disabling WAL reuse
and/or disabling zeroing the segments) but that made no measurable
difference.

So I've added some primitive instrumentation to the code, counting the
calls and measuring duration for each of the PMEM operations, and
printing the stats regularly into log (after ~1M ops).

Typical results from a run with a single client look like this (slightly
formatted/wrapped for e-mail):

  PMEM STATS
  COUNT total 100 map 30 unmap 20
memcpy 510210 persist 489740
   TIME total 0 map 931080 unmap 188750
memcpy 4938866752 persist 187846686
 LENGTH memcpy 4337647616 persist 329824672

This shows that a majority of the 1M calls is memcpy/persist, the rest
is mostly negligible - both in terms of number of calls and duration.
The time values are in nanoseconds, BTW.

So for example we did 30 map_file calls, taking ~0.9ms in total, and the
unmap calls took even less time. So the direct impact of map/unmap calls
is rather negligible, I think.

The dominant part is clearly the memcpy (~5s) and persist (~2s). It's
not much per call, but it's overall it costs much more than the map and
unmap calls.

Finally, let's look at the LENGTH, which is a sum of the ranges either
copied to PMEM (memcpy) or fsynced (persist). Those are in bytes, and
the memcpy value is way higher than the persist one. In this particular
case, it's something like 4.3MB vs. 300kB, so an order of magnitude.

It's entirely possible this is a bug/measurement error in the patch. I'm
not all that familiar with the XLOG stuff, so maybe I did some silly
mistake somewhere.

But I think it might be also explained by the fact that XLogWrite()
always writes the WAL in a multiple of 8kB pages. Which is perfectly
reasonable for regular block-oriented storage, but pmem/dax is exactly
about not having to do that - PMEM is byte-addressable. And with pgbech,
the individual WAL records are tiny, so having to instead write/flush
the whole 8kB page (or more of them) repeatedly, as we append the WAL
records, seems a bit wasteful. So I wonder if this is why the trivial
patch does not show any benefits.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/configure b/configure
index dd64692345..91226ee880 100755
--- a/configure
+++ b/configure
@@ -867,6 +867,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_nvwal
 with_gnu_ld
 enable_largefile
 '
@@ -1572,6 +1573,7 @@ Optional Packages:
   use system time zone data in DIR
   --without-zlib  do not use Zlib
   --with-gnu-ld   assume the C compiler uses GNU ld [default=no]
+  --with-nvwaluse non-volatile WAL buffer (on a PMEM device)
 
 Some influential environment variables:
   CC  C compiler command
@@ -8598,6 +8600,203 @@ else
 
 fi
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with non-volatile WAL buffer (NVWAL)" >&5
+$as_echo_n "checking whether to build with non-volatile WAL buffer (NVWAL)... " >&6; }
+
+
+
+# Check whether --with-nvwal was given.
+if test "${with_nvwal+set}" = set; then :
+  withval=$with_nvwal;
+  case $withval in
+yes)
+
+$as_echo "#define USE_NVWAL 1" >>confdefs.h
+
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --with-nvwal option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  with_nvwal=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_nvwal" >&5
+$as_echo "$with_nvwal" >&6; }
+
+#
+#

Re: [PoC] Non-volatile WAL buffer

2020-11-26 Thread Heikki Linnakangas

On 26/11/2020 21:27, Tomas Vondra wrote:

Hi,

Here's the "simple patch" that I'm currently experimenting with. It
essentially replaces open/close/write/fsync with pmem calls
(map/unmap/memcpy/persist variants), and it's by no means committable.
But it works well enough for experiments / measurements, etc.

The numbers (5-minute pgbench runs on scale 500) look like this:

  master/bttmaster/dax   nttsimple
---
  1 5469  7402  7977  6746
 1648222 80869107025 82343
 3273974158189214718158348
 6485921154540225715164248
 96   150602221159237008217253

A chart illustrating these results is attached. The four columns are
showing unpatched master with WAL on a pmem device, in BTT or DAX modes,
"ntt" is the patch submitted to this thread, and "simple" is the patch
I've hacked together.

As expected, the BTT case performs poorly (compared to the rest).

The "master/dax" and "simple" perform about the same. There are some
differences, but those may be attributed to noise. The NTT patch does
outperform these cases by ~20-40% in some cases.

The question is why. I recall suggestions this is due to page faults
when writing data into the WAL, but I did experiment with various
settings that I think should prevent that (e.g. disabling WAL reuse
and/or disabling zeroing the segments) but that made no measurable
difference.


The page faults are only a problem when mmap() is used *without* DAX.

Takashi tried a patch earlier to mmap() WAL segments and insert WAL to 
them directly. See 0002-Use-WAL-segments-as-WAL-buffers.patch at 
https://www.postgresql.org/message-id/01d5dff4%24995ed180%24cc1c7480%24%40hco.ntt.co.jp_1. 
Could you test that patch too, please? Using your nomenclature, that 
patch skips wal_buffers and does:


  clients -> wal segments (PMEM DAX)

He got good results with that with DAX, but otherwise it performed 
worse. And then we discussed why that might be, and the page fault 
hypothesis was brought up.


I think 0002-Use-WAL-segments-as-WAL-buffers.patch is the most promising 
approach here. But because it's slower without DAX, we need to keep the 
current code for non-DAX systems. Unfortunately it means that we need to 
maintain both implementations, selectable with a GUC or some DAX 
detection magic. The question then is whether the code complexity is 
worth the performance gin on DAX-enabled systems.


Andres was not excited about mmapping the WAL segments because of 
performance reasons. I'm not sure how much of his critique applies if we 
keep supporting both methods and only use mmap() if so configured.


- Heikki




Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-26 Thread Alvaro Herrera
So let's discuss the next step in this series: what to do about REINDEX
CONCURRENTLY.

I started with Dmitry's patch (an updated version of which I already
posted in [1]).  However, Dmitry missed (and I hadn't noticed) that some
of the per-index loops are starting transactions also, and that we need
to set the flag in those.  And what's more, in a couple of the
function-scope transactions we do set the flag pointlessly: the
transactions there do not acquire a snapshot, so there's no reason to
set the flag at all, because WaitForOlderSnapshots ignores sessions
whose Xmin is 0.

There are also transactions that wait first, without setting a snapshot,
and then do some catalog manipulations.  I think it's prett much useless
to set the flag for those, because they're going to be very short
anyway.  (There's also one case of this in CREATE INDEX CONCURRENTLY.)

But there's a more interesting point also.  In Dmitry's patch, we
determine safety for *all* indexes being processed as a set, and then
apply the flag only if they're all deemed safe.  But we can optimize
this, and set the flag for each index' transaction individually, and
only skip it for those specific indexes that are unsafe.  So I propose
to change the data structure used in ReindexRelationConcurrently from
the current list of OIDs to a list of (oid,boolean) pairs, to be able to
track setting the flag individually.

There's one more useful observation: in the function-scope transactions
(outside the per-index loops), we don't touch the contents of any
indexes; we just wait or do some catalog manipulation.  So we can set
the flag *regardless of the safety of any indexes*.  We only need to
care about the safety of the indexes in the phases where we build the
indexes and when we validate them.


[1] https://postgr.es/m/20201118175804.GA23027@alvherre.pgsql




Re: POC: postgres_fdw insert batching

2020-11-26 Thread Tomas Vondra



On 11/26/20 2:48 AM, tsunakawa.ta...@fujitsu.com wrote:
> From: Tomas Vondra 
>> Well, good that we all agree this is a useful feature to have (in 
>> general). The question is whether postgres_fdw should be doing 
>> batching on it's onw (per this thread) or rely on some other 
>> feature (libpq pipelining). I haven't followed the other thread,
>> so I don't have an opinion on that.
> 
> Well, as someone said in this thread, I think bulk insert is much 
> more common than updates/deletes.  Thus, major DBMSs have INSERT 
> VALUES(record1), (record2)... and INSERT SELECT.  Oracle has direct 
> path INSERT in addition.  As for the comparison of INSERT with 
> multiple records and libpq batching (= multiple INSERTs), I think
> the former is more efficient because the amount of data transfer is
> less and the parsing-planning of INSERT for each record is
> eliminated.
> 
> I never deny the usefulness of libpq batch/pipelining, but I'm not 
> sure if app developers would really use it.  If they want to reduce 
> the client-server round-trips, won't they use traditional stored 
> procedures?  Yes, the stored procedure language is very 
> DBMS-specific.  Then, I'd like to know what kind of well-known 
> applications are using standard batching API like JDBC's batch 
> updates.  (Sorry, I think that should be discussed in libpq 
> batch/pipelining thread and this thread should not be polluted.)
> 

Not sure how is this related to app developers? I think the idea was
that the libpq features might be useful between the two PostgreSQL
instances. I.e. the postgres_fdw would use the libpq batching to send
chunks of data to the other side.

> 
>> Note however we're doing two things here, actually - we're 
>> implementing custom batching for postgres_fdw, but we're also 
>> extending the FDW API to allow other implementations do the same 
>> thing. And most of them won't be able to rely on the connection 
>> library providing that, I believe.
> 
> I'm afraid so, too.  Then, postgres_fdw would be an example that 
> other FDW developers would look at when they use INSERT with
> multiple records.
> 

Well, my point was that we could keep the API, but maybe it should be
implemented using the proposed libpq batching. They could still use the
postgres_fdw example how to use the API, but the internals would need to
be different, of course.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Covering SPGiST index

2020-11-26 Thread Pavel Borisov
>
> > The way that seems acceptable to me is to add (optional) nulls mask into
> > the end of existing style SpGistLeafTuple header and use indextuple
> > routines to attach attributes after it. In this case, we can reduce the
> > amount of code at the cost of adding one extra MAXALIGN size to the
> overall
> > tuple size on 32-bit arch as now tuple header size of 12 bit already
> fits 3
> > MAXALIGNS (on 64 bit the header now is shorter than 2 maxaligns (12 bytes
> > of 16) and nulls mask will be free of cost). If you mean this I try to
> make
> > changes soon. What do you think of it?
>
> Yeah, that was pretty much the same conclusion I came to.  For
> INDEX_MAX_KEYS values up to 32, the nulls bitmap will fit into what's
> now padding space on 64-bit machines.  For backwards compatibility,
> we'd have to be careful that the code knows there's no nulls bitmap in
> an index without included columns, so I'm not sure how messy that will
> be.  But it's worth trying that way to see how it comes out.
>

I made a refactoring of the patch code according to the discussion:
1. Changed a leaf tuple format to: header - (optional) bitmask - key value
- (optional) INCLUDE values
2. Re-use existing code of heap_fill_tuple() to fill data part of a leaf
tuple
3. Splitted index_deform_tuple() into two portions: (a) bigger 'inner' one
- index_deform_anyheader_tuple() - to make processing of index-like tuples
(now IndexTuple and SpGistLeafTuple) working independent from type of tuple
header. (b) a small 'outer' index_deform_tuple() and spgDeformLeafTuple()
to make all header-specific processing and then call the inner (a)
4. Inserted a tuple descriptor into the SpGistCache chunk of memory. So
cleaning the cached chunk will also invalidate the tuple descriptor and not
make it dangling or leaked. This also allows not to build it every time
unless the cache is invalidated.
5. Corrected amroutine->amcaninclude according to new upstream fix.
6. Returned big chunks that were shifted in spgist_private.h to their
initial places where possible and made other cosmetic changes to improve
the patch.

PFA v.11 of the patch.
Do you think the proposed changes are in the right direction?

Thank you!
-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


v11-0001-Covering-SP-GiST-index-support-for-INCLUDE-colum.patch
Description: Binary data


Re: [patch] CLUSTER blocks scanned progress reporting

2020-11-26 Thread Fujii Masao




On 2020/11/27 1:45, Matthias van de Meent wrote:

On Thu, 26 Nov 2020 at 00:55, Fujii Masao  wrote:


+* A heap scan need not return tuples for the 
last page it has
+* scanned. To ensure that heap_blks_scanned is 
equivalent to
+* total_heap_blks after the table scan phase, 
this parameter
+* is manually updated to the correct value 
when the table scan
+* finishes.

So it's better to update this comment a bit? For example,

  If the scanned last pages are empty, it's possible to go to the next phase
  while heap_blks_scanned != heap_blks_total. To ensure that they are
  equivalet after the table scan phase, this parameter is manually updated
  to the correct value when the table scan finishes.



PFA a patch with updated message and comment. I've reworded the
messages to specifically mention empty pages and the need for setting
heap_blks_scanned = total_heap_blks explicitly.


Thanks for updating the patch! It looks good to me.
Barring any objection, I will commit this patch (also back-patch to v12).

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: configure and DocBook XML

2020-11-26 Thread Tom Lane
=?utf-8?Q?Paul_F=C3=B6rster?=  writes:
> On 26. Nov, 2020, at 17:21, Tom Lane  wrote:
>> There's a nearby thread in which I was suggesting that we should just
>> not bother with this configure test [1].
>> [1] 
>> https://www.postgresql.org/message-id/flat/E2EE6B76-2D96-408A-B961-CAE47D1A86F0%40yesql.se

> I haven't installed DocBook at all. So the check for DocBook naturally always 
> fails. Could that be the reason?

If you don't have the docbook stylesheets, but you do have xmllint,
configure's probe will cause xmllint to try to download those
stylesheets off the net.  For me, that always succeeds, but it
takes two or three seconds.  I find it curious that it seems to be
timing out for you.

regards, tom lane




Re: [patch] CLUSTER blocks scanned progress reporting

2020-11-26 Thread Matthias van de Meent
On Thu, 26 Nov 2020 at 00:55, Fujii Masao  wrote:
>
> +* A heap scan need not return tuples for the 
> last page it has
> +* scanned. To ensure that heap_blks_scanned 
> is equivalent to
> +* total_heap_blks after the table scan 
> phase, this parameter
> +* is manually updated to the correct value 
> when the table scan
> +* finishes.
>
> So it's better to update this comment a bit? For example,
>
>  If the scanned last pages are empty, it's possible to go to the next 
> phase
>  while heap_blks_scanned != heap_blks_total. To ensure that they are
>  equivalet after the table scan phase, this parameter is manually updated
>  to the correct value when the table scan finishes.
>

PFA a patch with updated message and comment. I've reworded the
messages to specifically mention empty pages and the need for setting
heap_blks_scanned = total_heap_blks explicitly.

Feel free to update the specifics, other than that I think this is a
complete fix for the issues at hand.

Regards,

Matthias van de Meent
From 66af670588b88b7fb4eb78aa6251609d10651e6e Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Fri, 20 Nov 2020 16:23:59 +0100
Subject: [PATCH v3] Fix CLUSTER progress reporting of number of blocks
 scanned.

The heapScan need not start at block 0, so heapScan->rs_cblock need not be the
correct value for amount of blocks scanned. A more correct value is
 ((heapScan->rs_cblock - heapScan->rs_startblock + heapScan->rs_nblocks) %
   heapScan->rs_nblocks), as it accounts for the wraparound and the initial
offset of the heapScan.

Additionally, if the last pages of the scan were empty, the heap scan does
not return tuples from the last scanned page. This means that when
table_scan_getnextslot returns false, indicating the end of the table scan, we
must manually update the heap_blks_scanned parameter to the number of blocks
in the heap scan to ensure that heap_blks_scanned == heap_blks_total when
starting the next phase.
---
 src/backend/access/heap/heapam_handler.c | 29 ++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..720d92cac9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -698,6 +698,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	Datum	   *values;
 	bool	   *isnull;
 	BufferHeapTupleTableSlot *hslot;
+	BlockNumber prev_cblock = InvalidBlockNumber;
 
 	/* Remember if it's a system catalog */
 	is_system_catalog = IsSystemRelation(OldHeap);
@@ -793,14 +794,38 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		else
 		{
 			if (!table_scan_getnextslot(tableScan, ForwardScanDirection, slot))
+			{
+/*
+ * If the last pages of the scan were empty, we would go to
+ * the next phase while heap_blks_scanned != heap_blks_total.
+ * Instead, to ensure that heap_blks_scanned is equivalent to
+ * total_heap_blks after the table scan phase, this parameter
+ * is manually updated to the correct value when the table scan
+ * finishes.
+ */
+pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+			 heapScan->rs_nblocks);
 break;
+			}
 
 			/*
 			 * In scan-and-sort mode and also VACUUM FULL, set heap blocks
 			 * scanned
+			 *
+			 * Note that heapScan may start at an offset and wrap around, i.e.
+			 * rs_startblock may be >0, and rs_cblock may end with a number
+			 * below rs_startblock. To prevent showing this wraparound to the
+			 * user, we offset rs_cblock by rs_startblock (modulo rs_nblocks).
 			 */
-			pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
-		 heapScan->rs_cblock + 1);
+			if (prev_cblock != heapScan->rs_cblock)
+			{
+pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+			 (heapScan->rs_cblock +
+			  heapScan->rs_nblocks -
+			  heapScan->rs_startblock
+			 ) % heapScan->rs_nblocks + 1);
+prev_cblock = heapScan->rs_cblock;
+			}
 		}
 
 		tuple = ExecFetchSlotHeapTuple(slot, false, NULL);
-- 
2.20.1



Re: autovac issue with large number of tables

2020-11-26 Thread Fujii Masao




On 2020/11/26 10:41, Kasahara Tatsuhito wrote:

On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada  wrote:


On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada  wrote:


On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
 wrote:

I wonder if we could have table_recheck_autovac do two probes of the stats
data.  First probe the existing stats data, and if it shows the table to
be already vacuumed, return immediately.  If not, *then* force a stats
re-read, and check a second time.

Does the above mean that the second and subsequent table_recheck_autovac()
will be improved to first check using the previous refreshed statistics?
I think that certainly works.

If that's correct, I'll try to create a patch for the PoC


I still don't know how to reproduce Jim's troubles, but I was able to reproduce
what was probably a very similar problem.

This problem seems to be more likely to occur in cases where you have
a large number of tables,
i.e., a large amount of stats, and many small tables need VACUUM at
the same time.

So I followed Tom's advice and created a patch for the PoC.
This patch will enable a flag in the table_recheck_autovac function to use
the existing stats next time if VACUUM (or ANALYZE) has already been done
by another worker on the check after the stats have been updated.
If the tables continue to require VACUUM after the refresh, then a refresh
will be required instead of using the existing statistics.

I did simple test with HEAD and HEAD + this PoC patch.
The tests were conducted in two cases.
(I changed few configurations. see attached scripts)

1. Normal VACUUM case
   - SET autovacuum = off
   - CREATE tables with 100 rows
   - DELETE 90 rows for each tables
   - SET autovacuum = on and restart PostgreSQL
   - Measure the time it takes for all tables to be VACUUMed

2. Anti wrap round VACUUM case
   - CREATE brank tables
   - SELECT all of these tables (for generate stats)
   - SET autovacuum_freeze_max_age to low values and restart PostgreSQL
   - Consumes a lot of XIDs by using txid_curent()
   - Measure the time it takes for all tables to be VACUUMed

For each test case, the following results were obtained by changing
autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
Also changing num of tables to 1000, 5000, 1 and 2.

Due to the poor VM environment (2 VCPU/4 GB), the results are a little unstable,
but I think it's enough to ask for a trend.

===
[1.Normal VACUUM case]
  tables:1000
   autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
   autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
   autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
   autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
   autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec

  tables:5000
   autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
   autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
   autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
   autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
   autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec

  tables:1
   autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
   autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
   autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
   autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
   autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec

  tables:2
   autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
   autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
   autovacuum_max_workers 3:   (HEAD) 220 sec VS (with patch)  191 sec
   autovacuum_max_workers 5:   (HEAD) 234 sec VS (with patch)  147 sec
   autovacuum_max_workers 10:  (HEAD) 320 sec VS (with patch)  113 sec

[2.Anti wrap round VACUUM case]
  tables:1000
   autovacuum_max_workers 1:   (HEAD) 19 sec VS (with patch) 18 sec
   autovacuum_max_workers 2:   (HEAD) 14 sec VS (with patch) 15 sec
   autovacuum_max_workers 3:   (HEAD) 14 sec VS (with patch) 14 sec
   autovacuum_max_workers 5:   (HEAD) 14 sec VS (with patch) 16 sec
   autovacuum_max_workers 10:  (HEAD) 16 sec VS (with patch) 14 sec

  tables:5000
   autovacuum_max_workers 1:   (HEAD) 69 sec VS (with patch) 69 sec
   autovacuum_max_workers 2:   (HEAD) 66 sec VS (with patch) 47 sec
   autovacuum_max_workers 3:   (HEAD) 59 sec VS (with patch) 37 sec
   autovacuum_max_workers 5:   (HEAD) 39 sec VS (with patch) 28 sec
   autovacuum_max_workers 10:  (HEAD) 39 sec VS (with patch) 29 sec

  tables:1
   autovacuum_max_workers 1:   (HEAD) 139 sec VS (with patch) 138 sec
   autovacuum_max_workers 2:   (HEAD) 130 sec VS (with patch)  

Re: configure and DocBook XML

2020-11-26 Thread Paul Förster
Hi Tom,

> On 26. Nov, 2020, at 17:21, Tom Lane  wrote:
> 
> On machines where I don't have the stylesheets installed, it always
> takes several seconds (2 or 3, I think, though I've not put a stopwatch
> on it).  15 to 20 sec does seem like a lot, so it makes me wonder if
> Paul's network environment is well-configured.

can't complain here. The net and all other stuff run smoothly. No timeouts, no 
things being slow or any other problem.

> There's a nearby thread in which I was suggesting that we should just
> not bother with this configure test [1].
> [1] 
> https://www.postgresql.org/message-id/flat/E2EE6B76-2D96-408A-B961-CAE47D1A86F0%40yesql.se

I haven't installed DocBook at all. So the check for DocBook naturally always 
fails. Could that be the reason?

Cheers,
Paul



Re: Improper use about DatumGetInt32

2020-11-26 Thread Peter Eisentraut

On 2020-11-26 14:27, Alvaro Herrera wrote:

On 2020-Nov-26, Peter Eisentraut wrote:


The point of the patch is to have the range check somewhere.  If you just
cast it, then you won't notice out of range arguments.  Note that other
contrib modules that take block numbers work the same way.


I'm not saying not to do that; just saying we should not propagate it to
places that don't need it.  get_raw_page gets its page number from
PG_GETARG_INT64(), and the range check should be there.  But then it
calls get_raw_page_internal, and it could pass a BlockNumber -- there's
no need to pass an int64.  So get_raw_page_internal does not need a
range check.


Yeah, I had it like that for a moment, but then you need to duplicate 
the check in get_raw_page() and get_raw_page_fork().  I figured since 
get_raw_page_internal() does all the other argument checking also, it 
seems sensible to put the block range check there too.  But it's not a 
big deal either way.





Re: configure and DocBook XML

2020-11-26 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Nov-26, Paul Förster wrote:
>> On 26. Nov, 2020, at 14:47, Alvaro Herrera  wrote:
>>> My guess is that it's related to trying to obtain stylesheets from
>>> remote Internet locations that are missing locally.

>> I don't know DocBook at all, so I can't say. But it takes about the
>> same time, whether I run configure on a machine that is connected to
>> the internet or one that isn't.

> It might be timing out, then.  (The docbook test takes well under a
> second for me, but that's probably because I have all stylesheets
> locally).

On machines where I don't have the stylesheets installed, it always
takes several seconds (2 or 3, I think, though I've not put a stopwatch
on it).  15 to 20 sec does seem like a lot, so it makes me wonder if
Paul's network environment is well-configured.

There's a nearby thread in which I was suggesting that we should just
not bother with this configure test [1].

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/E2EE6B76-2D96-408A-B961-CAE47D1A86F0%40yesql.se




Re: proposal: possibility to read dumped table's name from file

2020-11-26 Thread Tom Lane
Dean Rasheed  writes:
> Actually, that raises a different possible benefit of passing options
> in an options file -- if the user wants to pass in a table name
> pattern, it can be a nuisance if the shell's argument processing does
> additional unwanted things like globbing and environment variable
> substitutions. Using an options file could provide a handy way to
> ensure that any option values are interpreted exactly as written,
> without any additional mangling.

Huh?  Any format we might devise, or borrow, will have to have some
kind of escaping/quoting convention.  The idea that "we don't need
that" tends to lead to very ugly workarounds later.

I do agree that the shell's quoting conventions are pretty messy
and so those aren't the ones we should borrow.  We could do a lot
worse than to use some established data format like JSON or YAML.
Given that we already have src/common/jsonapi.c, it seems like
JSON would be the better choice of those two.

regards, tom lane




Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-26 Thread Alvaro Herrera
On 2020-Nov-25, Alvaro Herrera wrote:

> On 2020-Nov-23, Andres Freund wrote:
> 
> > On 2020-11-23 12:30:05 -0300, Alvaro Herrera wrote:
> 
> > > In other words, my conclusion is that there definitely is a bug here and
> > > I am going to restore the use of exclusive lock for setting the
> > > statusFlags.
> > 
> > Cool.
> 
> Here's a patch.

Pushed, thanks.





Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-26 Thread Fujii Masao




On 2020/11/27 0:15, Bharath Rupireddy wrote:

On Thu, Nov 26, 2020 at 7:37 PM Fujii Masao  wrote:



What do you mean by normal shutdown of bgworker? Is it that bgworker has exited 
successfully with exit code 0 or for some reason with exit code other than 0? 
Or is it when the postmaster is shutdown normally?

IIUC, when a bgworker exists either normally with an exit code 0 or other than 0, then 
CleanupBackgroundWorker() is called in postmaster, a message(like below) is prepared, and the 
LogChildExit() is called with either DEBUG1 or LOG level and for instance the message you specified 
gets printed "background worker ... exited with exit code 1". I have not seen a FATAL 
message similar to "background worker ... exited with exit code 1" at the normal shutdown.

snprintf(namebuf, MAXPGPATH, _("background worker \"%s\""), 
rw->rw_worker.bgw_type);

LogChildExit(EXIT_STATUS_0(exitstatus) ? DEBUG1 : LOG, namebuf, pid, 
exitstatus);

Am I missing something?

If my analysis is right, then for instance, when a logical replication launcher is exited, it logs 
"background worker "logical replication launcher" exited with exit code X" with 
either DEBUG1 or LOG level but not with FATAL level.


Yes, it's not with FATAL level. But that message looks like that it's
reporting error message. This is why we sometimes received
the complaints (e.g., [1][2]) about that message.



Oh. Should we do something about it now?


No. This is not directly related to the issue that we are discussing
as I told upthread. Of course, it's better to work on this if we can
easily fix it. But seems not... So please ignore this my comment.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-26 Thread Bharath Rupireddy
On Thu, Nov 26, 2020 at 7:37 PM Fujii Masao  wrote:
>
> > What do you mean by normal shutdown of bgworker? Is it that bgworker has 
> > exited successfully with exit code 0 or for some reason with exit code 
> > other than 0? Or is it when the postmaster is shutdown normally?
> >
> > IIUC, when a bgworker exists either normally with an exit code 0 or other 
> > than 0, then CleanupBackgroundWorker() is called in postmaster, a 
> > message(like below) is prepared, and the LogChildExit() is called with 
> > either DEBUG1 or LOG level and for instance the message you specified gets 
> > printed "background worker ... exited with exit code 1". I have not seen a 
> > FATAL message similar to "background worker ... exited with exit code 1" at 
> > the normal shutdown.
> >
> > snprintf(namebuf, MAXPGPATH, _("background worker \"%s\""), 
> > rw->rw_worker.bgw_type);
> >
> > LogChildExit(EXIT_STATUS_0(exitstatus) ? DEBUG1 : LOG, namebuf, pid, 
> > exitstatus);
> >
> > Am I missing something?
> >
> > If my analysis is right, then for instance, when a logical replication 
> > launcher is exited, it logs "background worker "logical replication 
> > launcher" exited with exit code X" with either DEBUG1 or LOG level but not 
> > with FATAL level.
>
> Yes, it's not with FATAL level. But that message looks like that it's
> reporting error message. This is why we sometimes received
> the complaints (e.g., [1][2]) about that message.
>

Oh. Should we do something about it now? Any thoughts? I have not read
fully through the threads specified, I will go through them later.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-11-26 Thread Andy Fan
This patch has stopped moving for a while,  any suggestion about
how to move on is appreciated.

-- 
Best Regards
Andy Fan


Re: configure and DocBook XML

2020-11-26 Thread Paul Förster
Hi Alvaro,

> On 26. Nov, 2020, at 15:19, Alvaro Herrera  wrote:
> 
> One way to know for sure would be to run it under strace, and see where
> it takes a large number of seconds.  Maybe something like
> strace -f -etrace=%network -T -tt -o/tmp/configure.trace ./configure 

ok, I'll try this. Thanks very much.

> Well, it is what it is partly because people have struggled to polish
> these little annoyances :-)

it's not an annoyance for me with a thing I do 3 or 4 times a year. I'm a DBA, 
not a developer, so I just build new packages when new versions are released. I 
don't have to build a couple of times per day. :-)

Cheers,
Paul



Re: configure and DocBook XML

2020-11-26 Thread Alvaro Herrera
On 2020-Nov-26, Paul Förster wrote:

> Hi Alvaro,
> 
> > On 26. Nov, 2020, at 14:47, Alvaro Herrera  wrote:
> > 
> > My guess is that it's related to trying to obtain stylesheets from
> > remote Internet locations that are missing locally.
> 
> I don't know DocBook at all, so I can't say. But it takes about the
> same time, whether I run configure on a machine that is connected to
> the internet or one that isn't.

It might be timing out, then.  (The docbook test takes well under a
second for me, but that's probably because I have all stylesheets
locally).

One way to know for sure would be to run it under strace, and see where
it takes a large number of seconds.  Maybe something like
 strace -f -etrace=%network -T -tt -o/tmp/configure.trace ./configure 

> Whatever it does, it's more out of curiosity. PostgreSQL compiles fine
> and this is what counts. :-)

Well, it is what it is partly because people have struggled to polish
these little annoyances :-)




Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-26 Thread Fujii Masao




On 2020/11/25 23:38, Bharath Rupireddy wrote:

On Wed, Nov 25, 2020 at 3:29 PM Fujii Masao mailto:masao.fu...@oss.nttdata.com>> wrote:
 >
 > On 2020/11/20 19:33, Bharath Rupireddy wrote:
 > > On Wed, Nov 18, 2020 at 5:52 PM Fujii Masao mailto:masao.fu...@oss.nttdata.com> >> wrote:
 > >  >
 > >  > > handle_sigterm() and die() are similar except that die() has extra
 > >  > > handling(below) for exiting immediately when waiting for input/command
 > >  > > from the client.
 > >  > >      /*
 > >  > >       * If we're in single user mode, we want to quit immediately - 
we can't
 > >  > >       * rely on latches as they wouldn't work when stdin/stdout is a 
file.
 > >  > >       * Rather ugly, but it's unlikely to be worthwhile to invest 
much more
 > >  > >       * effort just for the benefit of single user mode.
 > >  > >       */
 > >  > >      if (DoingCommandRead && whereToSendOutput != DestRemote)
 > >  > >          ProcessInterrupts();
 > >  > >
 > >  > > Having this extra handling is correct for normal backends as they can
 > >  > > connect directly to clients for reading input commands, the above if
 > >  > > condition may become true and ProcessInterrupts() may be called to
 > >  > > handle the SIGTERM immediately.
 > >  > >
 > >  > > Note that DoingCommandRead can never be true in bg workers as it's
 > >  > > being set to true only in normal backend PostgresMain(). If we use
 > >  > > die()(instead of custom SIGTERM handlers such as handle_sigterm()) in
 > >  > > a bg worker/non-backend process, there are no chances that the above
 > >  > > part of the code gets hit and the interrupts are processed
 > >  > > immediately. And also here are the bg worker process that use die()
 > >  > > instead of their own custom handlers: parallel workers, autoprewarm
 > >  > > worker, autovacuum worker, logical replication launcher and apply
 > >  > > worker, wal sender process
 > >  > >
 > >  > > I think we can also use die() instead of handle_sigterm() in
 > >  > > test_shm_mq/worker.c.
 > >  > >
 > >  > > I attached a patch for this change.
 > >  >
 > >  > Thanks for the patch! It looks good to me.
 > >  >
 > >
 > > Thanks.
 >
 > When I read the patch again, I found that, with the patch, the shutdown
 > of worker_spi causes to report the following FATAL message.
 >
 >      FATAL:  terminating connection due to administrator command
 >
 > Isn't this message confusing because it's not a connection? If so,
 > we need to update ProcessInterrupts() so that the proper message is
 > reported like other bgworkers do.
 >

This is also true for all the bgworker that use the die() handler. How about 
doing it the way bgworker_die() does in ProcessInterrupts()? This would give 
meaningful information. Thoughts? If okay, I can make a separate patch.


+1
Thanks!




         else if (RecoveryConflictPending)
         {
             /* Currently there is only one non-retryable recovery conflict */
             Assert(RecoveryConflictReason == 
PROCSIG_RECOVERY_CONFLICT_DATABASE);
             pgstat_report_recovery_conflict(RecoveryConflictReason);
             ereport(FATAL,
                     (errcode(ERRCODE_DATABASE_DROPPED),
                      errmsg("terminating connection due to conflict with 
recovery"),
                      errdetail_recovery_conflict()));
         }
*       else if (IsBackgroundWorker)
        {
                ereport(FATAL,
                            (errcode(ERRCODE_ADMIN_SHUTDOWN),
                             errmsg("terminating background worker \"%s\" due to 
administrator command",
                             MyBgworkerEntry->bgw_type)));
        }*
         else
             ereport(FATAL,
                     (errcode(ERRCODE_ADMIN_SHUTDOWN),
                      errmsg("terminating connection due to administrator 
command")));

 >
 > BTW, though this is not directly related to this topic, it's strange to me
 > that the normal shutdown of bgworker causes to emit FATAL message
 > and the message "background worker ... exited with exit code 1"
 > at the normal shutdown. I've heard sometimes that users coplained that
 > it's confusing to report that message for logical replication launcher
 > at the normal shutdown. Maybe the mechanism to shutdown bgworker
 > normally seems to have not been implemented well yet.
 >

What do you mean by normal shutdown of bgworker? Is it that bgworker has exited 
successfully with exit code 0 or for some reason with exit code other than 0? 
Or is it when the postmaster is shutdown normally?

IIUC, when a bgworker exists either normally with an exit code 0 or other than 0, then 
CleanupBackgroundWorker() is called in postmaster, a message(like below) is prepared, and the 
LogChildExit() is called with either DEBUG1 or LOG level and for instance the message you specified 
gets printed "background worker ... exited with exit code 1". I have not seen a FATAL 
message similar to "b

Re: configure and DocBook XML

2020-11-26 Thread Paul Förster
Hi Alvaro,

> On 26. Nov, 2020, at 14:47, Alvaro Herrera  wrote:
> 
> My guess is that it's related to trying to obtain stylesheets from
> remote Internet locations that are missing locally.

I don't know DocBook at all, so I can't say. But it takes about the same time, 
whether I run configure on a machine that is connected to the internet or one 
that isn't.

Whatever it does, it's more out of curiosity. PostgreSQL compiles fine and this 
is what counts. :-)

Cheers,
Paul



Re: configure and DocBook XML

2020-11-26 Thread Alvaro Herrera
On 2020-Nov-26, Paul Förster wrote:

> Hi,
> 
> I always compile PostgreSQL from source and never install a pre-compiled 
> package. When I run configure, the last check, which is for DocBook XML 
> always takes very long compared to all others. It's about 15 to 20 secs or 
> so. I noticed that with many configure scripts, not only the one of 
> PostgreSQL. Why is that?

My guess is that it's related to trying to obtain stylesheets from
remote Internet locations that are missing locally.




Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-26 Thread Bharath Rupireddy
On Thu, Nov 26, 2020 at 5:34 PM Luc Vlaming  wrote:
>
> On 26-11-2020 12:36, Bharath Rupireddy wrote:
> > Few things:
> >
> > IIUC Andres mentioned similar kinds of APIs earlier in [1].
> >
> > [1] -
> > https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de
> > 
> >
> > I would like to add some more info to one of the API:
> >
> > typedef struct MultiInsertStateData
> > {
> >  MemoryContext micontext; /* A temporary memory context for
> > multi insert. */
> >  BulkInsertStateData *bistate;   /* Bulk insert state. */
> >  TupleTableSlot  **mislots; /* Array of buffered slots. */
> >  uint32  nslots; /* Total number of buffered slots. */
> >  int64  nbytes; /* Flush buffers if the total tuple size
> >  >= nbytes. */
> >  int32  nused; /* Number of current buffered slots for a
> > multi insert batch. */
> >  int64  nsize; /* Total tuple size for a multi insert
> > batch. */
> > } MultiInsertStateData;
> >
> > /* Creates a temporary memory context, allocates the
> > MultiInsertStateData, BulkInsertStateData and initializes other members. */
> >  void(*begin_multi_insert) (Relation rel,
> > MultiInsertStateData **mistate, uint32 nslots, uint64 nbytes);
> >
> > /* Buffers the input slot into mistate slots, computes the size of the
> > tuple, and adds it total buffer tuple size, if this size crosses
> > mistate->nbytes, flush the buffered tuples into table. For heapam,
> > existing heap_multi_insert can be used. Once the buffer is flushed, then
> > the micontext can be reset and buffered slots can be cleared. *If nbytes
> > i.e. total tuple size of the batch is not given, tuple size is not
> > calculated, tuples are buffered until all the nslots are filled and then
> > flushed.* */
> >  void(*do_multi_insert) (Relation rel, MultiInsertStateData
> > *mistate, struct TupleTableSlot *slot, CommandId cid, int options);
> >
> > /* Flush the buffered tuples if any. For heapam, existing
> > heap_multi_insert can be used. Deletes temporary memory context and
> > deallocates mistate. */
> >  void(*end_multi_insert) (Relation rel, MultiInsertStateData
> > *mistate, CommandId cid, int options);
>
> Looks all good to me, except for the nbytes part.
> Could you explain to me what use case that supports? IMHO the tableam
> can best decide itself that its time to flush, based on its
> implementation that e.g. considers how many pages to flush at a time and
> such, etc? This means also that most of the fields of
> MultiInsertStateData can be private as each tableam would return a
> derivative of that struct (like with the destreceivers).
>

nbytes is basically to support the following case, say the number of
tuples to buffer is 1000, and if all the tuples are toasted with size
in few hundred MB or even GB, then do we want to wait until 1000
tuples are buffered in which case we occupy for one query 1000*toasted
tuple size in GB. So, if we have a memory limit, then it will give
flexibility. Whether to use it or not is up to the table AM
implementation. And also that existing copy code(since it can know the
tuple size after parsing input data) uses this mechanism to decide
when to flush.

If the nbytes is not used in a table am, then the multi insert can
wait until the total tuples, how much ever large memory they occupy,
are buffered.

IMO, we can retain nbytes for now to decide on when to flush. Thoughts?

I wonder, how can the do_multi_insert() API decide on when to flush, I
mean, based on the number of pages to flush? Do we need to pass the
maximum number of pages the buffered tuples can occupy and track the
pages currently buffered tuples occupy to decide when to flush? Or is
it something that the existing table AM infrastructure already
supports? If we use the number of pages to decide on when to flush,
how well it works with parallel inserts?

>
> One thing I'm wondering is in which memory context the slots end up
> being allocated. I'd assume we would want to keep the slots around
> between flushes. If they are in the temporary context this might prove
> problematic however?
>

I should not have used the word temporary, it actually is not
temporary. This memory conext will be created in begin_multi_insert(),
all the buffered tuples are copied using this context, it will be
reset at the end of each flush and reused. It can get destroyed at the
end in end_multi_insert(). I think we should even do this with the new
APIs implementation.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Allow some recovery parameters to be changed with reload

2020-11-26 Thread Fujii Masao




On 2020/11/12 4:38, Sergei Kornilov wrote:

Hello


Anyway, for now I think that your first patch would be enough, i.e.,
just change the context of restore_command to PGC_SIGHUP.


Glad to hear. Attached a rebased version of the original proposal.


Thanks for rebasing the patch!

This parameter is required for archive recovery,

I found the above description in config.sgml. I was just wondering
if it should be updated so that the actual specification is described or not.
The actual spec is that restore_command is required to start archive
recovery, but optional (i.e., the parameter can be reset to an empty)
after archive recovery has started. But this updated version of
description would be rather confusing to users. So I'm now thinking
not to update that.

Does anyone object to the patch? If no, I'm thinking to commit the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Improper use about DatumGetInt32

2020-11-26 Thread Alvaro Herrera
On 2020-Nov-26, Peter Eisentraut wrote:

> The point of the patch is to have the range check somewhere.  If you just
> cast it, then you won't notice out of range arguments.  Note that other
> contrib modules that take block numbers work the same way.

I'm not saying not to do that; just saying we should not propagate it to
places that don't need it.  get_raw_page gets its page number from
PG_GETARG_INT64(), and the range check should be there.  But then it
calls get_raw_page_internal, and it could pass a BlockNumber -- there's
no need to pass an int64.  So get_raw_page_internal does not need a
range check.




Re: libpq compression

2020-11-26 Thread Daniil Zakhlystov
Hi,

> On Nov 24, 2020, at 11:35 PM, Robert Haas  wrote:
> 
> So the time to talk about the
> general approach here is now, before anything gets committed, before
> the project has committed itself to any particular design. If we
> decide in that discussion that certain things can be left for the
> future, that's fine. If we've have discussed how they could be added
> without breaking backward compatibility, even better. But we can't
> just skip over having that discussion.

> If the client requests compression and the server supports it, it
> should return a new SupportedCompressionTypes message following
> NegotiateProtocolMessage response. That should be a list of
> compression methods which the server understands. At this point, the
> clent and the server each know what methods the other understands.
> Each should now feel free to select a compression method the other
> side understands, and to switch methods whenever desired, as long as
> they only select from methods the other side has said that they
> understand. The patch seems to think that the compression method has
> to be the same in both directions and that it can never change, but
> there's no real reason for that. Let each side start out uncompressed
> and then let it issue a new SetCompressionMethod protocol message to
> switch the compression method whenever it wants. After sending that
> message it begins using the new compression type. The other side
> doesn't have to agree. That way, you don't have to worry about
> synchronizing the two directions. Each side is just telling the other
> what is choosing to do, from among the options the other side said it
> could understand.

I’ve read your suggestions about the switchable on-the-fly independent for each 
direction compression.

While the proposed protocol seems straightforward, the ability to switch 
compression mode in an arbitrary moment significantly complexifies the 
implementation which may lead to the lower adoption of the really useful 
feature in custom frontends/backends.

However, I don’t mean by this that we shouldn’t support switchable compression. 
I propose that we can offer two compression modes: permanent (which is 
implemented in the current state of the patch) and switchable on-the-fly. 
Permanent compression allows us to deliver a robust solution that is already 
present in some databases. Switchable compression allows us to support more 
complex scenarios in cases when the frontend and backend really need it and can 
afford development effort to implement it.

I’ve made a draft of the protocol that may cover both these compression modes, 
also the following protocol supports independent frontend and backend 
compression.

In StartupPacket _pq_.compression frontend will specify the:

1. Supported compression modes in the order of preference.
For example: “permanent, switchable” means that the frontend supports both 
permanent and switchable modes and prefer to use the permanent mode.

2. List of the compression algorithms which the frontend is able to decompress 
in the order of preference.
For example:
“zlib:1,3,5;zstd:7,8;uncompressed” means that frontend is able to:
 - decompress zlib with 1,3 or 5 compression levels
 - decompress zstd with 7 or 8 compression levels
 - “uncompressed” at the end means that the frontend agrees to receive 
uncompressed messages. If there is no “uncompressed” compression algorithm 
specified it means that the compression is required. 

After receiving the StartupPacket message from the frontend, the backend will 
either ignore the _pq_.compression as an unknown parameter (if the backend is 
before November 2017) or respond with the CompressionAck message which will 
include:

1. Index of the chosen compression mode or -1 if doesn’t support any of the 
compression modes send by the frontend.
In the case of the startup packet from the previous example:
It may be ‘0’ if the server chose permanent mode,’1’ if switchable, or ‘-1’ if 
the server doesn’t support any of these.

2. List of the compression algorithms which the backend is able to decompress 
in the order of preference.
For example, “zstd:2,4;uncompressed;zlib:7” means that the backend is able to:
-decompress zstd with 2 and 4 compression levels
-work in uncompressed mode
-decompress zlib with compression level 7

After sending the CompressionAck message, the backend will also send the 
SetCompressionMessage with one of the following:
 - Index of the chosen backend compression algorithm followed by the index of 
the chosen compression level. In this case, the frontend now should use the 
chosen decompressor for incoming messages, the backend should also use the 
chosen compressor for outgoing messages. 
 - '-1', if the backend doesn’t support the compression using any of the 
algorithms sent by the frontend. In this case, the frontend must terminate the 
connection after receiving this message.

After receiving the SetCompressionMessage from the backend, the frontend should 
also 

Re: cleanup temporary files after crash

2020-11-26 Thread Euler Taveira
On Thu, 26 Nov 2020 at 05:48, Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

>
> I also think that the GUC is not needed here. This 'feature' was
> internal from the very beginning, so users shouldn't care about
> preserving old behavior. Without the GUC the patch is very simple,
> please see attached version. I also omit the test, because I am not sure
> it will be stable given that the RemovePgTempFiles() allows the
> possibility of failure.
>
> Anastasia, thanks for reviewing it. As I said, I'm fine without the GUC.
However, if we decided to go with the GUC, default behavior should be remove
the temporary files after the crash.

Regards,

-- 
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Improper use about DatumGetInt32

2020-11-26 Thread Peter Eisentraut

On 2020-11-25 20:04, Alvaro Herrera wrote:

On 2020-Nov-25, Peter Eisentraut wrote:


  bt_page_stats(PG_FUNCTION_ARGS)
  {
text   *relname = PG_GETARG_TEXT_PP(0);
-   uint32  blkno = PG_GETARG_UINT32(1);
+   int64   blkno = PG_GETARG_INT64(1);


As a matter of style, I think it'd be better to have an int64 variable
that gets the value from PG_GETARG_INT64(), then you cast that to
another variable that's a BlockNumber and use that throughout the rest
of the code.  So you'd avoid changes like this:


  static bytea *get_raw_page_internal(text *relname, ForkNumber forknum,
-   
BlockNumber blkno);
+   int64 
blkno);


where the previous coding was correct, and the new one is dubious and it
forces you to add unnecessary range checks in that function:


@@ -144,11 +144,16 @@ get_raw_page_internal(text *relname, ForkNumber forknum, 
BlockNumber blkno)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot access temporary tables of other 
sessions")));
  
+	if (blkno < 0 || blkno > MaxBlockNumber)

+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid block number")));
+


The point of the patch is to have the range check somewhere.  If you 
just cast it, then you won't notice out of range arguments.  Note that 
other contrib modules that take block numbers work the same way.





Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Krunal Bauskar
On Thu, 26 Nov 2020 at 16:02, Heikki Linnakangas  wrote:

> On 26/11/2020 06:30, Krunal Bauskar wrote:
> > Improving spin-lock implementation on ARM.
> > 
> >
> > * Spin-Lock is known to have a significant effect on performance
> >with increasing scalability.
> >
> > * Existing Spin-Lock implementation for ARM is sub-optimal due to
> >use of TAS (test and swap)
> >
> > * TAS is implemented on ARM as load-store so even if the lock is not
> free,
> >store operation will execute to replace the same value.
> >This redundant operation (mainly store) is costly.
> >
> > * CAS is implemented on ARM as load-check-store-check that means if the
> >lock is not free, check operation, post-load will cause the loop to
> >return there-by saving on costlier store operation. [1]
>
> Can you add some code comments to explain that why CAS is cheaper than
> TAS on ARM?
>

1. As Alexey too pointed out in followup email
CAS = load value -> check if the value is expected -> if yes then replace
(store new value) -> else exit/break
TAS = load value -> store new value

This means each TAS would be converted to 2 operations that are LOAD and
STORE and of-course
STORE is costlier in terms of latency. CAS ensures optimization in this
regard with an early check.

Let's look at some micro-benchmarking. I implemented a simple spin-loop
using both approaches and
as expected with increase scalability, CAS continues to out-perform TAS by
a margin up to 50%.

 TAS 
Running 128 parallelism
Elapsed time: 1.34271 s
Running 256 parallelism
Elapsed time: 3.6487 s
Running 512 parallelism
Elapsed time: 11.3725 s
Running 1024 parallelism
Elapsed time: 43.5376 s
 CAS 
Running 128 parallelism
Elapsed time: 1.00131 s
Running 256 parallelism
Elapsed time: 2.53202 s
Running 512 parallelism
Elapsed time: 7.66829 s
Running 1024 parallelism
Elapsed time: 22.6294 s

This could be also observed from the perf profiling

TAS:
 15.57 │44:   ldxr  w0, [x19]
 83.93 │  stxr  w1, w21, [x19]

CAS:
 81.29 │58: ↓ b.ne  cc

  9.86 │cc:   ldaxr w0, [x22]
  8.84 │  cmp   w0, #0x0
   │↑ b.ne  58
   │  stxr  w1, w20, [x22]

*In TAS: STORE is pretty costly.*

2.  I have added the needed comment in the patch. Updated patch attached.

--
Thanks for taking look at this and surely let me know if any more info is
needed.


>
> Is there some official ARM documentation, like a programmer's reference
> manual or something like that, that would show a reference
> implementation of a spinlock on ARM? It would be good to refer to an
> authoritative source on this.
>
> - Heikki
>


-- 
Regards,
Krunal Bauskar
#include 
#include 
#include 
#include 
#include 

#define likely(x)   __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)

const size_t data_block_size = 64*1024;
char data_block[64*1024];

size_t k_iter=100;
unsigned long global_crcval = 0;

/*  spin-lock  */
uint32_t lock_unit;

void lock()
{
  while (true) {
uint32_t expected = 0;
#ifdef CAS
if (!__atomic_compare_exchange_n(&lock_unit, &expected, 1, false,
 __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE)) {
#elif TAS
if (__sync_lock_test_and_set(&lock_unit, 1)) {
#endif
  /* Spin-and-Retry as the lock is acquired by some other thread */
  __asm__ __volatile__("" ::: "memory");
  continue;
}
break;
  }
}

void unlock()
{
#ifdef CAS
  __atomic_store_n(&lock_unit, 0, __ATOMIC_RELEASE);
#else
  __sync_lock_release(&lock_unit);
#endif
}

/*  workload  */
void workload_execute()
{
  for (size_t i = 0; i < k_iter; ++i) {
lock();
/* Each thread try to take lock -> execute critical section -> unlock */
memset(data_block, rand() % 255, data_block_size);
unsigned long crcval = 0;
crc32(crcval, (const unsigned char *)data_block, data_block_size);
global_crcval += crcval;
unlock();
  }
}

int main(int argc, char *argv[]) {
  if (argc != 2) {
std::cerr << "usage:  " << std::endl;
return 1;
  }
  size_t num_of_threads = atol(argv[1]);

  std::thread* handles[num_of_threads];

  auto start = std::chrono::high_resolution_clock::now();

  for (size_t i = 0; i < num_of_threads; ++i) {
handles[i] = new std::thread(workload_execute);
  }
  for (size_t i = 0; i < num_of_threads; ++i) {
handles[i]->join();
  }

  auto finish = std::chrono::high_resolution_clock::now();

  for (size_t i = 0; i < num_of_threads; ++i) {
delete handles[i];
  }

  std::chrono::duration elapsed = finish - start;
  std::cout << "Elapsed time: " << elapsed.count() << " s\n";
}
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 31a5ca6fb3..5110de37d4 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -321,7 +321,33 @@ tas(volatile slock_t *lock)
  * than other widths.
  */
 #if defined(_

Re: [PATCH] Add features to pg_stat_statements

2020-11-26 Thread Fujii Masao




On 2020/11/25 17:07, Fujii Masao wrote:



On 2020/11/25 15:40, Seino Yuki wrote:

2020-11-25 13:13 に Fujii Masao さんは書きました:

On 2020/11/25 12:02, Seino Yuki wrote:

2020-11-17 01:46 に Fujii Masao さんは書きました:

On 2020/11/16 12:22, Seino Yuki wrote:

Thanks for updating the patch!

+    pgss_info->dealloc = 0;
+    SpinLockInit(&pgss_info->mutex);
+    Assert(pgss_info->dealloc == 0);

Why is this assertion check necessary? It seems not necessary.

+    {
+    Assert(found == found_info);

Having pgssSharedState and pgssInfoCounters separately might make
the code a bit more complicated like the above? If this is true, what about
including pgssInfoCounters in pgssSharedState?

PGSS_FILE_HEADER needs to be changed since the patch changes
the format of pgss file?

+    /* Read pgss_info */
+    if (feof(file) == 0)
+    if (fread(pgss_info, sizeof(pgssInfoCounters), 1, file) != 1)
+    goto read_error;

Why does feof(file) need to be called here?

+pgss_info_update(void)
+{
+    {

Why is the second "{" necessary? It seems redundant.

+pgss_info_reset(void)
+{
+    {

Same as above.

+pg_stat_statements_info(PG_FUNCTION_ARGS)
+{
+    int64    d_count = 0;
+    {

Same as above.

+    SpinLockAcquire(&c->mutex);
+    d_count = Int64GetDatum(c->dealloc);
+    SpinLockRelease(&c->mutex);

Why does Int64GetDatum() need to be called here? It seems not necessary.

+   
+    
+ pg_stat_statements_info() returns bigint
+ 
+  pg_stat_statements_info
+ 
+    

Isn't it better not to expose pg_stat_statements_info() function in the
document because pg_stat_statements_info view is enough and there
seems no use case for the function?

Regards,


Thanks for the comment.
I'll post a fixed patch.
Due to similar fixed, we have also merged the patches discussed in the 
following thread.
https://commitfest.postgresql.org/30/2738/


I agree that these two patches should use the same infrastructure
because they both try to add the global stats for pg_stat_statements.
But IMO they should not be merged to one patch. It's better to
develop them one by one for ease of review. Thought?

So I extracted the "dealloc" part from the merged version of your patch.
Also I refactored the code and applied some cosmetic changes into
the patch. Attached is the updated version of the patch that implements
only "dealloc" part. Could you review this version?

Regards,


Thank you for posting the new patch.

I checked "regression test" and "document" and "operation of the view".
No particular problems were found.


Thanks for the review and test!
So you think that this patch can be marked as ready for committer?




I just want to check one thing: will the log output be unnecessary this time?
Quotes from v2.patch


I'm not sure if it's really good idea to add this log message.
If we adopt that logging, in the system where pgss entries are deallocated
very frequently, that message also would be logged very frequently.
Such too many log messages might be noisy to users. To address this issue,
we may want to add new parameter that controls whether log message is
emitted or not when entries are deallocated. But that parameter sounds
too specific... Thought?

Regards,




Thanks for the review and test!
So you think that this patch can be marked as ready for committer?


Updated status to ready for committer.


Thanks!


I modified the docs a bit and pushed the patch. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-26 Thread Luc Vlaming

On 26-11-2020 12:36, Bharath Rupireddy wrote:

Few things:

IIUC Andres mentioned similar kinds of APIs earlier in [1].

[1] - 
https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de 



I would like to add some more info to one of the API:

typedef struct MultiInsertStateData
{
     MemoryContext         micontext; /* A temporary memory context for 
multi insert. */

     BulkInsertStateData *bistate;   /* Bulk insert state. */
     TupleTableSlot      **mislots; /* Array of buffered slots. */
     uint32              nslots; /* Total number of buffered slots. */
     int64              nbytes; /* Flush buffers if the total tuple size 
 >= nbytes. */
     int32              nused; /* Number of current buffered slots for a 
multi insert batch. */
     int64              nsize; /* Total tuple size for a multi insert 
batch. */

} MultiInsertStateData;

/* Creates a temporary memory context, allocates the 
MultiInsertStateData, BulkInsertStateData and initializes other members. */
     void        (*begin_multi_insert) (Relation rel, 
MultiInsertStateData **mistate, uint32 nslots, uint64 nbytes);


/* Buffers the input slot into mistate slots, computes the size of the 
tuple, and adds it total buffer tuple size, if this size crosses 
mistate->nbytes, flush the buffered tuples into table. For heapam, 
existing heap_multi_insert can be used. Once the buffer is flushed, then 
the micontext can be reset and buffered slots can be cleared. *If nbytes 
i.e. total tuple size of the batch is not given, tuple size is not 
calculated, tuples are buffered until all the nslots are filled and then 
flushed.* */
     void        (*do_multi_insert) (Relation rel, MultiInsertStateData 
*mistate, struct TupleTableSlot *slot, CommandId cid, int options);


/* Flush the buffered tuples if any. For heapam, existing 
heap_multi_insert can be used. Deletes temporary memory context and 
deallocates mistate. */
     void        (*end_multi_insert) (Relation rel, MultiInsertStateData 
*mistate, CommandId cid, int options);


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com 


Looks all good to me, except for the nbytes part.
Could you explain to me what use case that supports? IMHO the tableam 
can best decide itself that its time to flush, based on its 
implementation that e.g. considers how many pages to flush at a time and 
such, etc? This means also that most of the fields of 
MultiInsertStateData can be private as each tableam would return a 
derivative of that struct (like with the destreceivers).


One thing I'm wondering is in which memory context the slots end up 
being allocated. I'd assume we would want to keep the slots around 
between flushes. If they are in the temporary context this might prove 
problematic however?


Regards,
Luc




Re: proposal: possibility to read dumped table's name from file

2020-11-26 Thread Dean Rasheed
On Thu, 26 Nov 2020 at 06:43, Pavel Stehule  wrote:
>
> st 25. 11. 2020 v 21:00 odesílatel Tom Lane  napsal:
>>
>> (One thing to consider is
>> how painful will it be for people to quote table names containing
>> funny characters, for instance.  On the command line, we largely
>> depend on the shell's quoting behavior to solve that, but we'd not
>> have that infrastructure when reading from a file.)
>
> This is not a problem with the current patch - and the last version of this 
> patch supports well obscure names.
>

Actually, that raises a different possible benefit of passing options
in an options file -- if the user wants to pass in a table name
pattern, it can be a nuisance if the shell's argument processing does
additional unwanted things like globbing and environment variable
substitutions. Using an options file could provide a handy way to
ensure that any option values are interpreted exactly as written,
without any additional mangling.

> There was a requirement for supporting all and future pg_dump options - ok it 
> can make sense. I have not a problem to use instead a line format
>
> "option argument" or "long-option=argument"
>
> This format - can it be a solution? I'll try to rewrite the parser for this 
> format.
>

Yes, that's the sort of thing I was thinking of, to make the feature
more general-purpose.

> It is implementable, but this is in collision with Stephen's requirement for 
> human well readable format designed for handy writing. There are requests 
> that have no intersection. Well readable format needs a more complex parser. 
> And machine generating in this format needs more fork - generating flat file 
> is more simple and more robust than generating JSON or YAML.
>

To be clear, I wasn't suggesting that this patch implement multiple
formats. Just the "plain" format above. Other formats like YAML might
well be more human-readable, and be able to take advantage of values
that are lists to avoid repeating option names, and they would have
the advantage of being readable and writable by other standard tools,
which might be useful. But I think such things would be for the
future. Maybe no one will ever add support for other formats, or maybe
someone will just write a separate external tool to convert YAML or
JSON to our plain format. I don't know. But supporting all pg_dump
options makes more things possible.

>> I wasn't very excited about multiple switch files either, though
>> depending on how the implementation is done, that could be simple
>> enough to be in the might-as-well category.
>>

That's what I was hoping.

Regards,
Dean




Re: [HACKERS] logical decoding of two-phase transactions

2020-11-26 Thread Amit Kapila
On Thu, Nov 26, 2020 at 4:24 PM Ajin Cherian  wrote:
>
> On Wed, Nov 25, 2020 at 11:54 PM Amit Kapila  wrote:
>
> > One problem with this patch is: What if we have assembled a consistent
> > snapshot after prepare and before commit prepared. In that case, it
> > will currently just send commit prepared record which would be a bad
> > idea. It should decode the entire transaction for such cases at commit
> > prepared time. This same problem is noticed by Arseny Sher, see
> > problem-2 in email [1].
>
> I'm not sure I understand how you could assemble a consistent snapshot
> after prepare but before commit prepared?
> Doesn't a consistent snapshot require that all in-progress
> transactions commit?
>

By above, I don't mean that the transaction is not committed. I am
talking about the timing of WAL. It is possible that between WAL of
Prepare and Commit Prepared, we reach a consistent state.

> I've tried start a new subscription after
> a prepare on the publisher and I see that the create subscription just
> hangs till the transaction on the publisher is either committed or
> rolled back.
>

I think what you need to do to reproduce this is to follow the
snapshot machinery in SnapBuildFindSnapshot. Basically, first, start a
transaction  (say transaction-id is 500) and do some operations but
don't commit. Here, if you create a slot (via subscription or
otherwise), it will wait for 500 to complete and make the state as
SNAPBUILD_BUILDING_SNAPSHOT. Here, you can commit 500 and then having
debugger in that state, start another transaction (say 501), do some
operations but don't commit. Next time when you reach this function,
it will change the state to SNAPBUILD_FULL_SNAPSHOT and wait for 501,
now you can start another transaction (say 502) which you can prepare
but don't commit. Again start one more transaction 503, do some ops,
commit both 501 and 503. At this stage somehow we need to ensure that
XLOG_RUNNING_XACTS record. Then commit prepared 502. Now, I think you
should notice that the consistent point is reached after 502's prepare
and before its commit. Now, this is just a theoretical scenario, you
need something on these lines and probably a way to force
XLOG_RUNNING_XACTS WAL (probably via debugger or some other way) at
the right times to reproduce it.

Thanks for trying to build a test case for this, it is really helpful.

-- 
With Regards,
Amit Kapila.




Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-26 Thread Bharath Rupireddy
Few things:

IIUC Andres mentioned similar kinds of APIs earlier in [1].

[1] -
https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de

I would like to add some more info to one of the API:

typedef struct MultiInsertStateData
{
MemoryContext micontext; /* A temporary memory context for
multi insert. */
BulkInsertStateData *bistate;   /* Bulk insert state. */
TupleTableSlot  **mislots; /* Array of buffered slots. */
uint32  nslots; /* Total number of buffered slots. */
int64  nbytes; /* Flush buffers if the total tuple size >=
nbytes. */
int32  nused; /* Number of current buffered slots for a
multi insert batch. */
int64  nsize; /* Total tuple size for a multi insert batch.
*/
} MultiInsertStateData;

/* Creates a temporary memory context, allocates the MultiInsertStateData,
BulkInsertStateData and initializes other members. */
void(*begin_multi_insert) (Relation rel, MultiInsertStateData
**mistate, uint32 nslots, uint64 nbytes);

/* Buffers the input slot into mistate slots, computes the size of the
tuple, and adds it total buffer tuple size, if this size crosses
mistate->nbytes, flush the buffered tuples into table. For heapam, existing
heap_multi_insert can be used. Once the buffer is flushed, then the
micontext can be reset and buffered slots can be cleared. *If nbytes i.e.
total tuple size of the batch is not given, tuple size is not calculated,
tuples are buffered until all the nslots are filled and then flushed.* */
void(*do_multi_insert) (Relation rel, MultiInsertStateData
*mistate, struct TupleTableSlot *slot, CommandId cid, int options);

/* Flush the buffered tuples if any. For heapam, existing heap_multi_insert
can be used. Deletes temporary memory context and deallocates mistate. */

void(*end_multi_insert) (Relation rel, MultiInsertStateData
*mistate, CommandId cid, int options);

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Alexander Korotkov
On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas  wrote:
> On 26/11/2020 06:30, Krunal Bauskar wrote:
> > Improving spin-lock implementation on ARM.
> > 
> >
> > * Spin-Lock is known to have a significant effect on performance
> >with increasing scalability.
> >
> > * Existing Spin-Lock implementation for ARM is sub-optimal due to
> >use of TAS (test and swap)
> >
> > * TAS is implemented on ARM as load-store so even if the lock is not free,
> >store operation will execute to replace the same value.
> >This redundant operation (mainly store) is costly.
> >
> > * CAS is implemented on ARM as load-check-store-check that means if the
> >lock is not free, check operation, post-load will cause the loop to
> >return there-by saving on costlier store operation. [1]
>
> Can you add some code comments to explain that why CAS is cheaper than
> TAS on ARM?
>
> Is there some official ARM documentation, like a programmer's reference
> manual or something like that, that would show a reference
> implementation of a spinlock on ARM? It would be good to refer to an
> authoritative source on this.

Let me add my 2 cents.

I've compared assembly output of gcc implementations of CAS and TAS.
The sample C-program is attached.  I've compiled it on raspberry pi 4
using gcc 9.3.0.

The inner loop of CAS is as follows.  So, if the value loaded by ldaxr
doesn't match expected value, then we immediately quit the loop.

.L3:
ldxrw3, [x0]
cmp w3, w1
bne .L4
stlxr   w4, w2, [x0]
cbnzw4, .L3
.L4:

The inner loop of TAS is as follows.  So it really does "stxr"
unconditionally.  In principle, it could check if a new value matches
the observed value and there is nothing to do, but it doesn't.
Moreover, stxr might fail, then we can spend multiple loops of
ldxr/stxr due to concurrent memory access.  AFAIK, those concurrent
accesses could reflect not only lock release, but also other
unsuccessful lock attempts.  So, good chance for extra loops to be
useless.

.L7:
ldxrw2, [x0]
stxrw3, w1, [x0]
cbnzw3, .L7

I've also googled for spinlock implementation on arm and found a blog
post about spinlock implementation in linux kernel [1].  Surprisingly
it doesn't use the trick to skip stxr if the lock is busy.  Instead,
they use some arm-specific power-saving option called WFE.

So, I think it's quite clear that switching from TAS to CAS on arm
would be a win.  But there could be other options to do this even
better.

Links
1. 
https://linux-concepts.blogspot.com/2018/05/spinlock-implementation-in-arm.html

--
Regards,
Alexander Korotkov


test.c
Description: Binary data


Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-26 Thread Bharath Rupireddy
On Thu, Nov 26, 2020 at 12:25 PM Luc Vlaming  wrote:
>
> What I'm wondering about is the reason for wanting a cap on data volume.
> When doing some local (highly concurrent) ingest speed tests a few weeks
> ago it seemed to mostly matter how many pages were being written and the
> resulting pressure on locks, etc. and not necessarily so much the actual
> memory usage. I didn't collect proof on that though (yet). There was
> however a very clearly observable contention point where with bigger
> buffers the performance would not only stagnate but actually drop.
>
> So what I'm kinda wondering is if we should worry more about the amount
> of pages that are going to be written and maybe not so much about the
> memory usage?
>
> If this were to be the case then maybe we can consider improving the
> current design, potentially in a follow-up patch? The problem I see is
> that generically each tableam will have different choices to make on how
> to buffer and flush multiple rows, given that a storage engine might
> have more or less write amplification, a different way of extending a
> relation, fsm use, etc.
> Assuming we indeed want a per-tableam implementation, we could either:
> - make multi_insert buffer the tuples itself and add a flush_multi_insert.
> - add a new function called create_multi_insert which returns something
> like a MultiInsertState, which, like a destreceiver, has a set of
> callbacks to start, shutdown and insert.
>
> With both solutions one part that to me seems appealing is that we
> buffer the data in something that likely resembles the disk format very
> much. Thoughts?
>

IMHO, I would like to go with your option 1 i.e. add a few APIs to the
TableAmRoutine structure. Advantage is that we could use these APIs in
at least 3 places, without much code duplication: 1) COPY 2) CTAS and
3) Refresh Materialized View. I could roughly sketch the APIs in below
way:

typedef struct MultiInsertStateData
{
MemoryContext micontext; /* A temporary memory context for
multi insert. */
BulkInsertStateData *bistate;   /* Bulk insert state. */
TupleTableSlot  **mislots; /* Array of buffered slots. */
uint32  nslots; /* Total number of buffered slots. */
uint64  nbytes; /* Flush buffers if the total tuple
size >= nbytes. */
int32  nused; /* Number of current buffered slots for
a multi insert batch. */
int64  nsize; /* Total tuple size for a multi insert
batch. */
} MultiInsertStateData;

/* Creates a temporary memory context, allocates the
MultiInsertStateData, BulkInsertStateData and initializes other
members. */
void(*begin_multi_insert) (Relation rel,
MultiInsertStateData **mistate, uint32 nslots, uint64 nbytes);

/* Buffers the input slot into mistate slots, computes the size of the
tuple, and adds it to the total tuple size of the buffered tuples, if
this size crosses mistate->nbytes, flush the buffered tuples into
table. For heapam, existing heap_multi_insert can be used. Once the
buffer is flushed, then micontext can be reset and buffered slots can
be cleared. */
void(*do_multi_insert) (Relation rel, MultiInsertStateData
*mistate, struct TupleTableSlot *slot, CommandId cid, int options);

/* Flush the buffered tuples if any. For heapam, existing
heap_multi_insert can be used. Deletes temporary memory context and
deallocates mistate. */
void(*end_multi_insert) (Relation rel,
MultiInsertStateData *mistate, CommandId cid, int options);

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] logical decoding of two-phase transactions

2020-11-26 Thread Ajin Cherian
On Wed, Nov 25, 2020 at 11:54 PM Amit Kapila  wrote:

> One problem with this patch is: What if we have assembled a consistent
> snapshot after prepare and before commit prepared. In that case, it
> will currently just send commit prepared record which would be a bad
> idea. It should decode the entire transaction for such cases at commit
> prepared time. This same problem is noticed by Arseny Sher, see
> problem-2 in email [1].

I'm not sure I understand how you could assemble a consistent snapshot
after prepare but before commit prepared?
Doesn't a consistent snapshot require that all in-progress
transactions commit? I've tried start a new subscription after
a prepare on the publisher and I see that the create subscription just
hangs till the transaction on the publisher is either committed or
rolled back.
Even if I try to create a replication slot using
pg_create_logical_replication_slot when a transaction has been
prepared but not yet committed
, it just hangs till the transaction is committed/aborted.

regards,
Ajin Cherian
Fujitsu Australia




Re: [Proposal] Global temporary tables

2020-11-26 Thread Magnus Hagander
On Thu, Nov 26, 2020 at 11:16 AM Julien Rouhaud  wrote:
>
> On Thu, Nov 26, 2020 at 4:05 PM 曾文旌  wrote:
> >
> > I found that the new Patch mail failed to register to Commitfest
> > https://commitfest.postgresql.org/28/2349/#
> > I don't know what's wrong and how to check it?
> > Could you help me figure it out?
>
> Apparently the attachment in
> https://www.postgresql.org/message-id/a3f1ebd9-e694-4384-8049-37b093084...@alibaba-inc.com
> wasn't detected.  I have no idea why, maybe Magnus will know.
> Otherwise you could try to ask on -www.

Not offhand. The email appeas to have a fairly complex nested mime
structure, so something in the python library that parses the MIME
decides that it's not there. For some reason the email is 7 parts. 1
is the signature, the rest seems complexly nested. And the attachment
seems to be squeezed in between two different HTML parts.

Basically, at the top it's multipart/alternative, which says there are
two choices. One is text/plain, which is what the archives uses. The
other is a combination of text/html followed by
application/octetstream (the patch) followed by another text/html.

The archives picks the first alternative, which is text/plain, which
does not contain the attachment. The attachment only exists in the
HTML view.

I think the easiest solution is to re-send as plain text email with
the attachment, which would then put the attachment on the email
itself instead of embedded in the HTML, I would guess.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Heikki Linnakangas

On 26/11/2020 06:30, Krunal Bauskar wrote:

Improving spin-lock implementation on ARM.


* Spin-Lock is known to have a significant effect on performance
   with increasing scalability.

* Existing Spin-Lock implementation for ARM is sub-optimal due to
   use of TAS (test and swap)

* TAS is implemented on ARM as load-store so even if the lock is not free,
   store operation will execute to replace the same value.
   This redundant operation (mainly store) is costly.

* CAS is implemented on ARM as load-check-store-check that means if the
   lock is not free, check operation, post-load will cause the loop to
   return there-by saving on costlier store operation. [1]


Can you add some code comments to explain that why CAS is cheaper than 
TAS on ARM?


Is there some official ARM documentation, like a programmer's reference 
manual or something like that, that would show a reference 
implementation of a spinlock on ARM? It would be good to refer to an 
authoritative source on this.


- Heikki




Re: [Proposal] Global temporary tables

2020-11-26 Thread Julien Rouhaud
On Thu, Nov 26, 2020 at 4:05 PM 曾文旌  wrote:
>
> I found that the new Patch mail failed to register to Commitfest
> https://commitfest.postgresql.org/28/2349/#
> I don't know what's wrong and how to check it?
> Could you help me figure it out?

Apparently the attachment in
https://www.postgresql.org/message-id/a3f1ebd9-e694-4384-8049-37b093084...@alibaba-inc.com
wasn't detected.  I have no idea why, maybe Magnus will know.
Otherwise you could try to ask on -www.




Re: fixing old_snapshot_threshold's time->xid mapping

2020-11-26 Thread Anastasia Lubennikova

On 06.10.2020 08:32, Thomas Munro wrote:

On Fri, Sep 25, 2020 at 2:00 PM Michael Paquier  wrote:

On Thu, Sep 24, 2020 at 03:46:14PM -0400, Robert Haas wrote:


Thomas, with respect to your part of this patch set, I wonder if we
can make the functions that you're using to write tests safe enough
that we could add them to contrib/old_snapshot and let users run them
if they want. As you have them, they are hedged around with vague and
scary warnings, but is that really justified? And if so, can it be
fixed? It would be nicer not to end up with two loadable modules here,
and maybe the right sorts of functions could even have some practical
use.

Yeah, you may be right.  I am thinking about that.  In the meantime,
here is a rebase.  A quick recap of these remaining patches:

0001 replaces the current "magic test mode" that didn't really test
anything with a new test mode that verifies pruning and STO behaviour.
0002 fixes a separate bug that Andres reported: the STO XID map
suffers from wraparound-itis.
0003 adds a simple smoke test for Robert's commit 55b7e2f4.  Before
that fix landed, it failed.


I have switched this item as waiting on author in the CF app then, as
we are not completely done yet.

Thanks.  For the record, I think there is still one more complaint
from Andres that remains unaddressed even once these are in the tree:
there are thought to be some more places that lack
TestForOldSnapshot() calls.


Status update for a commitfest entry.

This entry is "Waiting on author" and the thread was inactive for a 
while.  As far as I see, part of the fixes is already committed. Is 
there anything left to work on or this patch set needs review now?


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Parallel Inserts in CREATE TABLE AS

2020-11-26 Thread Bharath Rupireddy
On Thu, Nov 26, 2020 at 12:15 PM Hou, Zhijie  wrote:
>
> I took a deep look at the projection logic.
> In most cases, you are right that Gather node does not need projection.
>
> In some rare cases, such as Subplan (or initplan I guess).
> The projection will happen in Gather node.
>
> The example:
>
> Create table test(i int);
> Create table test2(a int, b int);
> insert into test values(generate_series(1,1000,1));
> insert into test2 values(generate_series(1,1000,1), 
> generate_series(1,1000,1));
>
> postgres=# explain(verbose, costs off) select test.i,(select i from (select * 
> from test2) as tt limit 1) from test where test.i < 2000;
>QUERY PLAN
> 
>  Gather
>Output: test.i, (SubPlan 1)
>Workers Planned: 2
>->  Parallel Seq Scan on public.test
>  Output: test.i
>  Filter: (test.i < 2000)
>SubPlan 1
>  ->  Limit
>Output: (test.i)
>->  Seq Scan on public.test2
>  Output: test.i
>
> In this case, projection is necessary,
> because the subplan will be executed in projection.
>
> If skipped, the table created will loss some data.
>

Thanks a lot for the use case. Yes with the current patch table will
lose data related to the subplan. On analyzing further, I think we can
not allow parallel inserts in the cases when the Gather node has some
projections to do. Because the workers can not perform that
projection. So, having ps_ProjInfo in the Gather node is an indication
for us to disable parallel inserts and only the leader can do the
insertions after the Gather node does the required projections.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




configure and DocBook XML

2020-11-26 Thread Paul Förster
Hi,

I always compile PostgreSQL from source and never install a pre-compiled 
package. When I run configure, the last check, which is for DocBook XML always 
takes very long compared to all others. It's about 15 to 20 secs or so. I 
noticed that with many configure scripts, not only the one of PostgreSQL. Why 
is that?

Cheers,
Paul



Re: ResourceOwner refactoring

2020-11-26 Thread Kyotaro Horiguchi
At Thu, 19 Nov 2020 17:27:18 +0800, Julien Rouhaud  wrote 
in 
> On Thu, Nov 19, 2020 at 10:16 AM Michael Paquier  wrote:
> >
> > On Wed, Nov 18, 2020 at 10:50:08AM +0200, Heikki Linnakangas wrote:
> > > If RESOWNER_ARRAY_STATS is increased to 16, all the lookups fit in the
> > > array. But I haven't done any benchmarking to see which is faster.
> >
> > My gut tells me that your guess is right, but it would be better to be
> > sure.
> >
> > > BTW, I think there would be an easy win in the hashing codepath, by 
> > > changing
> > > to a cheaper hash function. Currently, with or without this patch, we use
> > > hash_any(). Changing that to murmurhash32() or something similar would be 
> > > a
> > > drop-in replacement.
> >
> > Good idea.
> 
> +1, and +1 for this refactoring.

+1 for making the interface more generic. I thought a similar thing
when I added an resowner array for WaitEventSet (not committed yet).
About performance, though I'm not sure about, there's no reason not to
do this as far as the resowner mechanism doesn't get slower, and +2 if
gets faster.

> I just saw a minor issue in a comment while reviewing the patch:
> 
> [...]
> + /* Is there space in the hash? If not, enlarge it. */
> 
>   /* Double the capacity of the array (capacity must stay a power of 2!) */
> - newcap = (oldcap > 0) ? oldcap * 2 : RESARRAY_INIT_SIZE;
> [...]
> 
> The existing comment is kept as-is, but it should mention that it's
> now the hash capacity that is increased.

+   /* And release old array. */
+   pfree(oldhash);

:p


+   for (int i = 0; i < owner->narr; i++)
{
+   if (owner->arr[i].kind->phase == phase)
+   {
+   Datum   value = owner->arr[i].item;
+   ResourceOwnerFuncs *kind = owner->arr[i].kind;
+
+   if (printLeakWarnings)
+   kind->PrintLeakWarning(value);
+   kind->ReleaseResource(value);
+   found = true;
+   }
+   /*
+* If any resources were released, check again because some of 
the
+* elements might have moved by the callbacks. We don't want to
+* miss them.
+*/
+   } while (found && owner->narr > 0);

Coundn't that missing be avoided by just not incrementing i if found?


+   /*
+* Like with the array, we must check again after we reach the
+* end, if any callbacks were called. XXX: We could probably
+* stipulate that the callbacks may not do certain thing, like
+* remember more references in the same resource owner, to avoid
+* that.
+*/

If I read this patch correctly, ResourceOwnerForget doesn't seem to do
such a thing for hash?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: cleanup temporary files after crash

2020-11-26 Thread Anastasia Lubennikova

On 01.11.2020 04:25, Michael Paquier wrote:

On Sat, Oct 31, 2020 at 09:01:15PM -0300, Euler Taveira wrote:

I thought about not providing a GUC at all or provide it in the developer
section. I've never heard someone saying that they use those temporary
files to investigate an issue. Regarding a crash, all information is already
available and temporary files don't provide extra details. This new
GUC is just to keep the previous behavior. I'm fine without the GUC, though.

The original behavior is as old as 4a5f38c4, and last time we talked
about that there were arguments about still keeping the existing
behavior to not cleanup files during a restart-after-crash scenario
for the sake of being useful just "in case".  I have never used that
property myself, TBH, and I have seen much more cases of users caring
about the data folder not facing an ENOSPC particularly if they don't
use different partitions for pg_wal/ and the main data folder.
--
Michael


Thank you, Euler for submitting this.
+1 for the feature. One of the platforms we support uses temp files a 
lot and we faced the problem, while never actually used these orphan 
files for debugging purposes.


I also think that the GUC is not needed here. This 'feature' was 
internal from the very beginning, so users shouldn't care about 
preserving old behavior. Without the GUC the patch is very simple, 
please see attached version. I also omit the test, because I am not sure 
it will be stable given that the RemovePgTempFiles() allows the 
possibility of failure.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit 417581b287f4060178595cd96465adc639639290
Author: anastasia 
Date:   Thu Nov 26 11:45:05 2020 +0300

Remove temporary files after a backend crash

Successive crashes could result in useless storage usage until service
is restarted. It could be the case on host with inadequate resources.
This manual intervention for some environments is not desirable.

The only reason we kept temp files was that someone might want to examine
them for debugging purposes. With this patch we favor service continuity
over debugging.

Author: Euler Taveira
Reviewer: Anastasia Lubennikova

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b7799ed1d24..a151250734f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3999,6 +3999,9 @@ PostmasterStateMachine(void)
 		ereport(LOG,
 (errmsg("all server processes terminated; reinitializing")));
 
+		/* remove leftover temporary files after a crash */
+		RemovePgTempFiles();
+
 		/* allow background workers to immediately restart */
 		ResetBackgroundWorkerCrashTimes();
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 05abcf72d68..d30bfc28541 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2992,15 +2992,15 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
  * remove any leftover files created by OpenTemporaryFile and any leftover
  * temporary relation files created by mdcreate.
  *
- * NOTE: we could, but don't, call this during a post-backend-crash restart
- * cycle.  The argument for not doing it is that someone might want to examine
- * the temp files for debugging purposes.  This does however mean that
- * OpenTemporaryFile had better allow for collision with an existing temp
- * file name.
+ * This is also called during a post-backend-crash restart cycle. The argument
+ * for not doing it is that crashes of queries using temp files can result in
+ * useless storage usage that can only be reclaimed by a service restart.
  *
  * NOTE: this function and its subroutines generally report syscall failures
  * with ereport(LOG) and keep going.  Removing temp files is not so critical
  * that we should fail to start the database when we can't do it.
+ * Because this routine is not strict, OpenTemporaryFile allows for collision
+ * with an existing temp file name.
  */
 void
 RemovePgTempFiles(void)


Re: POC: Cleaning up orphaned files using undo logs

2020-11-26 Thread Antonin Houska
Amit Kapila  wrote:

> On Wed, Nov 25, 2020 at 8:00 PM Antonin Houska  wrote:

> > I meant that AbortOutOfAnyTransaction should PANIC itself if it sees that
> > there is unapplied undo, so nothing changes for its callers. Do I still miss
> > something?
> >
> 
> Adding PANIC in some generic code-path sounds scary. Why can't we
> simply try to execute undo?

Indeed it should try. I imagined it this way but probably got distracted by
some other thought when writing the email :-)

> > Actually I hit the problem of missing connection when playing with the
> > "undoxacttest" module. Those tests use table_open() / table_close() 
> > functions,
> > but it might not be necessary for the real RMGRs.
> >
> 
> How can we apply the action on a page without opening the relation?

If the undo record contains RelFileNode, ReadBufferWithoutRelcache() can be
used, just like it happens with WAL. Not sure how much it would affect zheap.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: POC: Cleaning up orphaned files using undo logs

2020-11-26 Thread Antonin Houska
Amit Kapila  wrote:

> On Wed, Nov 25, 2020 at 7:47 PM Antonin Houska  wrote:
> >
> > Antonin Houska  wrote:
> >
> > > Amit Kapila  wrote:
> >
> > > > I think we also need to maintain oldestXidHavingUndo for CLOG 
> > > > truncation and
> > > > transaction-wraparound. We can't allow CLOG truncation for the 
> > > > transaction
> > > > whose undo is not discarded as that could be required by some other
> > > > transaction.
> > >
> > > Good point. Even the discard worker might need to check the transaction 
> > > status
> > > when deciding whether undo log of that transaction should be discarded.
> >
> > In the zheap code [1] I see that DiscardWorkerMain() discards undo log up to
> > OldestXmin:
> >
> >
> > OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_AUTOVACUUM |
> >
> > PROCARRAY_FLAGS_VACUUM);
> >
> > oldestXidHavingUndo = 
> > GetXidFromEpochXid(pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
> >
> > /*
> >  * Call the discard routine if there oldestXidHavingUndo is lagging
> >  * behind OldestXmin.
> >  */
> > if (OldestXmin != InvalidTransactionId &&
> > TransactionIdPrecedes(oldestXidHavingUndo, OldestXmin))
> > {
> > UndoDiscard(OldestXmin, &hibernate);
> >
> > and that UndoDiscard() eventually advances oldestXidHavingUndo in the shared
> > memory.
> >
> > I'm not sure this is correct because, IMO, OldestXmin can advance as soon as
> > AbortTransaction() has cleared both xid and xmin fields of the transaction's
> > PGXACT (by calling ProcArrayEndTransactionInternal). However the 
> > corresponding
> > undo log may still be waiting for processing. Am I wrong?

> The UndoDiscard->UndoDiscardOneLog ensures that we don't discard the
> undo if there is a pending abort.

ok, I should have dug deeper than just reading the header comment of
UndoDiscard(). Checked now and seem to understand why no information is lost.

Nevertheless, I see in the zheap code that the discard worker may need to scan
a lot of undo log each time. While the oldest_xid and oldest_data fields of
UndoLogControl help to skip parts of the log, I'm not sure such information
fits into the undo-record-set (URS) approach. For now I tend to try to
implement the "exhaustive" scan for the URS too, and later let's teach the
discard worker to store some metadata so that the processing is rather
incremental.

> > I think that oldestXidHavingUndo should be advanced at the time transaction
> > commits or when the undo log of an aborted transaction has been applied.
> >
> 
> We can't advance oldestXidHavingUndo just on commit because later we
> need to rely on it for visibility, basically any transaction older
> than oldestXidHavingUndo should be all-visible.

ok

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Online verification of checksums

2020-11-26 Thread Magnus Hagander
On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier  wrote:
>
> On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > We are not just looking at one LSN value. Here are the steps we are
> > proposing (I'll skip checks for zero pages here):
> >
> > 1) Test the page checksum. If it passes the page is OK.
> > 2) If the checksum does not pass then record the page offset and LSN and
> > continue.
>
> But here the checksum is broken, so while the offset is something we
> can rely on how do you make sure that the LSN is fine?  A broken
> checksum could perfectly mean that the LSN is actually *not* fine if
> the page header got corrupted.
>
> > 3) After the file is copied, reopen and reread the file, seeking to offsets
> > where possible invalid pages were recorded in the first pass.
> > a) If the page is now valid then it is OK.
> > b) If the page is not valid but the LSN has increased from the LSN
>
> Per se the previous point about the LSN value that we cannot rely on.

We cannot rely on the LSN itself. But it's a lot more likely that we
can rely on the LSN changing, and on the LSN changing in a "correct
way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
*doesn't change* we know it's corrupt. But if the LSN *increases* AND
the new page now has a correct checksum, it's very most likely to be
correct. You could perhaps even put cap on it saying "if the LSN
increased, but less than ", where  is a sufficiently high number
that it's entirely unreasonable to advanced that far between the
reading of two blocks. But it has to have a very high margin in that
case.


> > A malicious attacker could easily trick these checks, but as Stephen pointed
> > out elsewhere they would likely make the checksums valid which would escape
> > detection anyway.
> >
> > We believe that the chances of random storage corruption passing all these
> > checks is incredibly small, but eventually we'll also check against the WAL
> > to be completely sure.
>
> The lack of check for any concurrent I/O on the follow-up retries is
> disturbing.  How do you guarantee that on the second retry what you
> have is a torn page and not something corrupted?  Init forks for
> example are made of up to 2 blocks, so the window would get short for
> at least those.  There are many instances with tables that have few
> pages as well.

Here I was more worried that the window might get *too long* if tables
are large :)

The risk is certainly that you get a torn page *again* on the second
read. It could be the same torn page (if it hasn't changed), but you
can detect that (by the fact that it hasn't actually changed) and
possibly do a short delay before trying again if it gets that far.
That could happen if the process is too quick. It could also be that
you are unlucky and that you hit a *new* write, and you were so
unlucky that both times it happened to hit exactly when you were
reading the page the next time. I'm not sure the chance of that
happening is even big enough we have to care about it, though?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: OpenSSL 3.0.0 compatibility

2020-11-26 Thread Michael Paquier
On Tue, Sep 29, 2020 at 12:25:05PM +0200, Daniel Gustafsson wrote:
> The attached adds config loading to pgcrypto for < 1.1.0 and a doc notice for
> enabling the legacy provider in 3.0.0.  This will require an alternative 
> output
> file for non-legacy configs, but that should wait until 3.0.0 is GA since the
> returned error messages have changed over course of development and may not be
> set in stone just yet.

FWIW, testing with 3.0.0-alpha9 dev (2d84089), I can see that the
error we have in our SSL tests when using a wrong password in the
private PEM key leads now to "PEM lib" instead of "bad decrypt".

Upthread, we had "nested asn1 error":
https://www.postgresql.org/message-id/9ce70af4-e1a0-4d24-86fa-4c3067077...@yesql.se
It looks like not everything is sorted out there yet.

pgcrypto is also throwing new errors.  Daniel, what if we let this
patch aside until upstream has sorted out their stuff?
--
Michael


signature.asc
Description: PGP signature


Re: Enumize logical replication message actions

2020-11-26 Thread Ashutosh Bapat
On Thu, Nov 26, 2020 at 10:15 AM Amit Kapila 
wrote:

> On Wed, Nov 25, 2020 at 2:52 PM Amit Kapila 
> wrote:
> >
> > On Wed, Nov 25, 2020 at 2:26 PM Peter Smith 
> wrote:
> > >
> > > Hi Hackers.
> > >
> > > Last month there was a commit [1] for replacing logical replication
> > > message type characters with enums of equivalent values.
> > >
> > > I was revisiting this code recently and I think due to oversight that
> > > initial patch was incomplete. IIUC there are several more enum
> > > substitutions which should have been made.
> > >
> > > PSA my patch which adds those missing substitutions.
> > >
> >
> > Good catch. I'll review it in a day or so.
> >
>
> The patch looks good to me and I have pushed it.
>

Thanks Peter and Amit for noticing the missing substitutions and fixing
those.

--
Best Wishes,
Ashutosh


Re: [Proposal] Global temporary tables

2020-11-26 Thread 曾文旌

I found that the new Patch mail failed to register to Commitfest
https://commitfest.postgresql.org/28/2349/# 

I don't know what's wrong and how to check it?
Could you help me figure it out?



> 2020年11月25日 14:19,Pavel Stehule  写道:
> 
> 
> 
> po 23. 11. 2020 v 10:27 odesílatel 曾文旌  > napsal:
> 
> 
>> 2020年11月21日 02:28,Pavel Stehule > > 写道:
>> 
>> Hi
>> 
>> pá 11. 9. 2020 v 17:00 odesílatel 曾文旌 > > napsal:
>> I have written the README for the GTT, which contains the GTT requirements 
>> and design.
>> I found that compared to my first email a year ago, many GTT Limitations are 
>> now gone.
>> Now, I'm adding comments to some of the necessary functions.
>> 
>> There are problems with patching. Please, can you rebase your patch?
> Sure.
> I'm still working on sort code and comments.
> If you have any suggestions, please let me know.
> 
> It is broken again 
> 
> There is bad white space
> 
> +   /*
> +* For global temp table only
> +* use ShareUpdateExclusiveLock for ensure safety
> +*/
> +   {
> +   {
> +   "on_commit_delete_rows",
> +   "global temp table on commit options",
> +   RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
> +   ShareUpdateExclusiveLock
> +   },
> +   true
> +   },  <=
> /* list terminator */
> {{NULL}}
> 
> +7 OTHERS
> +Parallel query
> +Planner does not produce parallel query plans for SQL related to GTT. 
> Because <=
> +GTT private data cannot be accessed across processes.
> diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
> 
> 
> +/*
> + * Update global temp table relstats(relpage/reltuple/relallvisible) 
> <
> + * to local hashtable
> + */
> +void
> 
> +/*
> + * Search global temp table relstats(relpage/reltuple/relallvisible) 
> <==
> + * from lo
> 
> and there are lot of more places ...
> 
> I found other issue
> 
> postgres=# create global temp table foo(a int);
> CREATE TABLE
> postgres=# create index on foo(a);
> CREATE INDEX
> 
> close session and in new session
> 
> postgres=# reindex index foo_a_idx ;
> WARNING:  relcache reference leak: relation "foo" not closed
> REINDEX
> 
> Regards
> 
> Pavel
> 
> 
> 
> 
> Wenjing
> 
> 
>> 
>> Regards
>> 
>> Pavel
>> 
>> 
>> 
>> Wenjing
>> 
>> 
>> 
>> 
>> 
>> > 2020年7月31日 上午4:57,Robert Haas > > > 写道:
>> > 
>> > On Thu, Jul 30, 2020 at 8:09 AM wenjing zeng > > > wrote:
>> >> Please continue to review the code.
>> > 
>> > This patch is pretty light on comments. Many of the new functions have
>> > no header comments, for example. There are comments here and there in
>> > the body of the new functions that are added, and in places where
>> > existing code is changed there are comments here and there, but
>> > overall it's not a whole lot. There's no documentation and no README,
>> > either. Since this adds a new feature and a bunch of new SQL-callable
>> > functions that interact with that feature, the feature itself should
>> > be documented, along with its limitations and the new SQL-callable
>> > functions that interact with it. I think there should be either a
>> > lengthy comment in some suitable file, or maybe various comments in
>> > various files, or else a README file, that clearly sets out the major
>> > design principles behind the patch, and explaining also what that
>> > means in terms of features and limitations. Without that, it's really
>> > hard for anyone to jump into reviewing this code, and it will be hard
>> > for people who have to maintain it in the future to understand it,
>> > either. Or for users, for that matter.
>> > 
>> > -- 
>> > Robert Haas
>> > EnterpriseDB: http://www.enterprisedb.com 
>> > The Enterprise PostgreSQL Company



smime.p7s
Description: S/MIME cryptographic signature