Re: 64-bit XIDs in deleted nbtree pages

2021-02-12 Thread Victor Yegorov
сб, 13 февр. 2021 г. в 05:39, Masahiko Sawada :

> > (BTW, I've been using txid_current() for my own "laptop testing", as a
> > way to work around this issue.)
> >
> > * More generally, if you really can't do recycling of pages that you
> > deleted during the last VACUUM during this VACUUM (perhaps because of
> > the presence of a long-running xact that holds open a snapshot), then
> > you have lots of *huge* problems already, and this is the least of
> > your concerns. Besides, at that point an affected VACUUM will be doing
> > work for an affected index through a btbulkdelete() call, so the
> > behavior of _bt_vacuum_needs_cleanup() becomes irrelevant.
> >
>
> I agree that there already are huge problems in that case. But I think
> we need to consider an append-only case as well; after bulk deletion
> on an append-only table, vacuum deletes heap tuples and index tuples,
> marking some index pages as dead and setting an XID into btpo.xact.
> Since we trigger autovacuums even by insertions based on
> autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
> the table again. But if there is a long-running query a "wasted"
> cleanup scan could happen many times depending on the values of
> autovacuum_vacuum_insert_scale_factor/threshold and
> vacuum_cleanup_index_scale_factor. This should not happen in the old
> code. I agree this is DBA problem but it also means this could bring
> another new problem in a long-running query case.
>

I'd like to outline one relevant case.

Quite often bulk deletes are done on a time series data (oldest) and
effectively
removes a continuous chunk of data at the (physical) beginning of the table,
this is especially true for the append-only tables.
After the delete, planning queries takes a long time, due to MergeJoin
estimates
are using IndexScans ( see
https://postgr.es/m/17467.1426090...@sss.pgh.pa.us )
Right now we have to disable MergeJoins via the ALTER SYSTEM to mitigate
this.

So I would, actually, like it very much for VACUUM to kick in sooner in
such cases.

-- 
Victor Yegorov


Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-02-12 Thread japin

Thanks for your review again.

On Wed, 10 Feb 2021 at 21:49, Bharath Rupireddy 
 wrote:
> On Fri, Feb 5, 2021 at 6:51 PM japin  wrote:
>> On Fri, 05 Feb 2021 at 17:50, Bharath Rupireddy 
>>  wrote:
>> We will get cell == NULL when we iterate all items in publist.  I use it
>> to check whether the dropped publication is in publist or not.
>>
>> > If you
>> > have a strong reasong retain this error errmsg("publication name
>> > \"%s\" do not in subscription", then there's a typo
>> > errmsg("publication name \"%s\" does not exists in subscription".
>>
>> Fixed.
>
> I think we still have a typo in 0002, it's
> + errmsg("publication name \"%s\" does not exist
> in subscription",
> instead of
> + errmsg("publication name \"%s\" does not exists
> in subscription",
>

Fixed.

> IIUC, with the current patch, the new ALTER SUBSCRIPTION ... ADD/DROP
> errors out on the first publication that already exists/that doesn't
> exist right? What if there are multiple publications given in the
> ADD/DROP list, and few of them exist/don't exist. Isn't it good if we
> loop over the subscription's publication list and show all the already
> existing/not existing publications in the error message, instead of
> just erroring out for the first existing/not existing publication?
>

Yes, you are right. Agree with you, I modified it. Please consider v5
for further review.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

>From e70fcf88c42909b21784db6911d4d1bed3998c63 Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Fri, 5 Feb 2021 20:49:40 +0800
Subject: [PATCH v5 1/5] Export textarray_to_stringlist

---
 src/backend/catalog/pg_subscription.c | 3 +--
 src/include/catalog/pg_subscription.h | 2 ++
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index c32fc8137d..e01c22c604 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -34,7 +34,6 @@
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
-static List *textarray_to_stringlist(ArrayType *textarray);
 
 /*
  * Fetch the subscription from the syscache.
@@ -210,7 +209,7 @@ get_subscription_name(Oid subid, bool missing_ok)
  *
  * Note: the resulting list of strings is pallocated here.
  */
-static List *
+List *
 textarray_to_stringlist(ArrayType *textarray)
 {
 	Datum	   *elems;
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efdf20..3fc558c7c5 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -21,6 +21,7 @@
 #include "catalog/pg_subscription_d.h"
 
 #include "nodes/pg_list.h"
+#include "utils/array.h"
 
 /* 
  *		pg_subscription definition. cpp turns this into
@@ -101,6 +102,7 @@ extern Subscription *GetSubscription(Oid subid, bool missing_ok);
 extern void FreeSubscription(Subscription *sub);
 extern Oid	get_subscription_oid(const char *subname, bool missing_ok);
 extern char *get_subscription_name(Oid subid, bool missing_ok);
+extern  List *textarray_to_stringlist(ArrayType *textarray);
 
 extern int	CountDBSubscriptions(Oid dbid);
 
-- 
2.25.1

>From 5e943f92b67caaf7349143280c030d77f24748ff Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Sat, 13 Feb 2021 05:11:25 +
Subject: [PATCH v5 2/5] Introduce a new syntax to add/drop publications

At present, if we want to update publications in subscription, we can
use SET PUBLICATION, however, it requires supply all publications that
exists and the new publications if we want to add new publications, it's
inconvenient.  The new syntax only supply the new publications.  When
the refresh is true, it only refresh the new publications.
---
 src/backend/commands/subscriptioncmds.c | 144 
 src/backend/parser/gram.y   |  20 
 src/include/nodes/parsenodes.h  |   2 +
 3 files changed, 166 insertions(+)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5cf874e0b4..ad271deb64 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -47,6 +47,7 @@
 #include "utils/syscache.h"
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
+static List *merge_subpublications(HeapTuple tuple, List *newpublist, bool addpub);
 static void ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err);
 
 
@@ -962,6 +963,53 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 break;
 			}
 
+		case ALTER_SUBSCRIPTION_ADD_PUBLICATION:
+		case ALTER_SUBSCRIPTION_DROP_PUBLICATION:
+			{
+boolcopy_data = false;
+boolisadd = stmt->kind == ALTER_SUBSCRIPTION_ADD_PUBLICATION;
+boolrefresh;
+List   *publist = NIL;
+
+publist = merge_subpublications(tup, stmt->publication, isadd);
+
+

Re: Why do we have MakeSingleTupleTableSlot instead of not using MakeTupleTableSlot?

2021-02-12 Thread Tom Lane
Bharath Rupireddy  writes:
> Right, we could as well have an inline function. My point was that why
> do we need to wrap MakeTupleTableSlot inside MakeSingleTupleTableSlot
> which just does nothing.  As I said upthread, how about renaming
> MakeTupleTableSlot to
> MakeSingleTupleTableSlot which requires  minimal changes?

I'm disinclined to change this just to save one level of function call.
If you dig in the git history (see f92e8a4b5 in particular) you'll note
that the current version of MakeTupleTableSlot originated as code shared
between ExecAllocTableSlot and MakeSingleTupleTableSlot.  The fact that
the latter is currently just equivalent to that shared functionality is
something that happened later and might need to change again.

It is fair to wonder why execTuples.c exports MakeTupleTableSlot at
all, though.  ExecAllocTableSlot is supposed to be used by code that
expects ExecutorEnd to clean up the slot, while MakeSingleTupleTableSlot
is supposed to pair with ExecDropSingleTupleTableSlot.  Direct use of
MakeTupleTableSlot leaves one wondering who is holding the bag for
slot cleanup.  The external callers of it all look to be pretty new
code, so I wonder how carefully that's been thought through.

In short: I'm not okay with doing
s/MakeTupleTableSlot/MakeSingleTupleTableSlot/g in a patch that doesn't
also introduce matching ExecDropSingleTupleTableSlot calls (unless those
exist somewhere already; but where?).  If we did clean that up, maybe
MakeTupleTableSlot could become "static".  But I'd still be inclined to
keep it physically separate, leaving it to the compiler to decide whether
to inline it into the callers.

There's a separate question of whether any of the call sites that lack
cleanup support represent live resource-leak bugs.  I see that they all
use TTSOpsVirtual, so maybe that's a slot type that never holds any
interesting resources (like buffer pins).  If so, maybe the best thing is
to invent a wrapper "MakeVirtualTupleTableSlot" or the like, ensuring such
callers use a TupleTableSlotOps type that doesn't require cleanup.

regards, tom lane




Re: 64-bit XIDs in deleted nbtree pages

2021-02-12 Thread Peter Geoghegan
On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada  wrote:
> I agree that there already are huge problems in that case. But I think
> we need to consider an append-only case as well; after bulk deletion
> on an append-only table, vacuum deletes heap tuples and index tuples,
> marking some index pages as dead and setting an XID into btpo.xact.
> Since we trigger autovacuums even by insertions based on
> autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
> the table again. But if there is a long-running query a "wasted"
> cleanup scan could happen many times depending on the values of
> autovacuum_vacuum_insert_scale_factor/threshold and
> vacuum_cleanup_index_scale_factor. This should not happen in the old
> code. I agree this is DBA problem but it also means this could bring
> another new problem in a long-running query case.

I see your point.

This will only not be a problem with the old code because the oldest
XID in the metapage happens to restrict VACUUM in what turns out to be
exactly perfect. But why assume that? It's actually rather unlikely
that we won't be able to free even one block, even in this scenario.
The oldest XID isn't truly special -- at least not without the
restrictions that go with 32-bit XIDs.

The other thing is that vacuum_cleanup_index_scale_factor is mostly
about limiting how long we'll go before having stale statistics, and
so presumably the user gets the benefit of not having stale statistics
(maybe that theory is a bit questionable in some cases, but that
doesn't have all that much to do with page deletion -- in fact the
problem exists without page deletion ever occuring).

BTW, I am thinking about making recycling take place for pages that
were deleted during the same VACUUM. We can just use a
work_mem-limited array to remember a list of blocks that are deleted
but not yet recyclable (plus the XID found in the block). At the end
of the VACUUM, (just before calling IndexFreeSpaceMapVacuum() from
within btvacuumscan()), we can then determine which blocks are now
safe to recycle, and recycle them after all using some "late" calls to
RecordFreeIndexPage() (and without revisiting the pages a second
time). No need to wait for the next VACUUM to recycle pages this way,
at least in many common cases. The reality is that it usually doesn't
take very long for a deleted page to become recyclable -- why wait?

This idea is enabled by commit c79f6df75dd from 2018. I think it's the
next logical step.

-- 
Peter Geoghegan




Re: 64-bit XIDs in deleted nbtree pages

2021-02-12 Thread Masahiko Sawada
On Thu, Feb 11, 2021 at 12:10 PM Peter Geoghegan  wrote:
>
> On Wed, Feb 10, 2021 at 2:20 AM Masahiko Sawada  wrote:
> > Thank you for working on this!
>
> I'm glad that I finally found time for it! It seems like it'll make
> things easier elsewhere.
>
> Attached is v3 of the index. I'll describe the changes I made in more
> detail in my response to your points below.
>
> > I agree that btm_oldest_btpo_xact will no longer be necessary in terms
> > of recycling deleted pages.
>
> Cool.
>
> > Given that we can guarantee that deleted pages never be leaked by
> > using 64-bit XID, I also think we don't need this value. We can do
> > amvacuumcleanup only if the table receives enough insertions to update
> > the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
> > think this is a more desirable behavior. Not skipping amvacuumcleanup
> > if there is even one deleted page that we can recycle is very
> > conservative.
> >
> > Considering your idea of keeping newly deleted pages in the meta page,
> > I can see a little value that keeping btm_oldest_btpo_xact and making
> > it 64-bit XID. I described below.
>
> > Interesting.
> >
> > I like this idea that triggers btvacuumscan() if there are many newly
> > deleted pages. I think this would be helpful especially for the case
> > of bulk-deletion on the table. But why we use the number of *newly*
> > deleted pages but not the total number of deleted pages in the index?
>
> I was unclear here -- I should not have said "newly deleted" pages at
> all. What I actually do when calling _bt_vacuum_needs_cleanup() is
> this (from v3, at the end of btvacuumscan()):
>
> -   _bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
> +   Assert(stats->pages_deleted >= stats->pages_free);
> +   pages_deleted_not_free = stats->pages_deleted - stats->pages_free;
> +   _bt_update_meta_cleanup_info(rel, pages_deleted_not_free,
>  info->num_heap_tuples);
>
> We're actually passing something I have called
> "pages_deleted_not_free" here, which is derived from the bulk delete
> stats in the obvious way that you see here (subtraction). I'm not
> using pages_newly_deleted at all now. Note also that the behavior
> inside _bt_update_meta_cleanup_info() no longer varies based on
> whether it is called during btvacuumcleanup() or during btbulkdelete()
> -- the same rules apply either way. We want to store
> pages_deleted_not_free in the metapage at the end of btvacuumscan(),
> no matter what.
>
> This same pages_deleted_not_free information is now used by
> _bt_vacuum_needs_cleanup() in an obvious and simple way: if it's too
> high (over 2.5%), then that will trigger a call to btbulkdelete() (we
> won't skip scanning the index). Though in practice it probably won't
> come up that often -- there just aren't ever that many deleted pages
> in most indexes.

Thanks for your explanation. That makes sense to me.

>
> > IIUC if several btbulkdelete executions deleted index pages less than
> > 2% of the index and those deleted pages could not be recycled yet,
> > then the number of recyclable pages would exceed 2% of the index in
> > total but amvacuumcleanup() would not trigger btvacuumscan() because
> > the last newly deleted pages are less than the 2% threshold. I might
> > be missing something though.
>
> I think you're right -- my idea of varying the behavior of
> _bt_update_meta_cleanup_info() based on whether it's being called
> during btvacuumcleanup() or during btbulkdelete() was a bad idea (FWIW
> half the problem was that I explained the idea badly to begin with).
> But, as I said, it's fixed in v3: we simply pass
> "pages_deleted_not_free" as an argument to _bt_vacuum_needs_cleanup()
> now.
>
> Does that make sense? Does it address this concern?

Yes!

>
> > Also, we need to note that having newly deleted pages doesn't
> > necessarily mean these always are recyclable at that time. If the
> > global xmin is still older than deleted page's btpo.xact values, we
> > still could not recycle them. I think btm_oldest_btpo_xact probably
> > will help this case. That is, we store the oldest btpo.xact among
> > those newly deleted pages to btm_oldest_btpo_xact and we trigger
> > btvacuumscan() if there are many newly deleted pages (more than 2% of
> > index) and the btm_oldest_btpo_xact is older than the global xmin (I
> > suppose each newly deleted pages could have different btpo.xact).
>
> I agree that having no XID in the metapage creates a new small
> problem. Specifically, there are certain narrow cases that can cause
> confusion in _bt_vacuum_needs_cleanup(). These cases didn't really
> exist before my patch (kind of).
>
> The simplest example is easy to run into when debugging the patch on
> your laptop. Because you're using your personal laptop, and not a real
> production server, there will be no concurrent sessions that might
> consume XIDs. You can run VACUUM VERBOSE manually several times, but
> that alone will never be enough to enable VACUUM 

Re: Why do we have MakeSingleTupleTableSlot instead of not using MakeTupleTableSlot?

2021-02-12 Thread Bharath Rupireddy
On Fri, Feb 12, 2021 at 9:37 PM Zhihong Yu  wrote:
> On Fri, Feb 12, 2021 at 5:44 AM Bharath Rupireddy 
>  wrote:
>>
>> Hi,
>>
>> I wonder, is there a specific reason that MakeTupleTableSlot is
>> wrapped up in MakeSingleTupleTableSlot without doing anything than
>> just returning the slot created by MakeTupleTableSlot? Do we really
>> need MakeSingleTupleTableSlot? Can't we just use MakeTupleTableSlot
>> directly? Am I missing something?
>>
>> I think we can avoid some unnecessary function call costs, for
>> instance when called 1000 times inside table_slot_create from
>> copyfrom.c or in some other places where MakeSingleTupleTableSlot is
>> called in a loop.
>>
>> If it's okay to remove MakeSingleTupleTableSlot and use
>> MakeTupleTableSlot instead, we might have to change in a lot of
>> places. If we don't want to change in those many files, we could
>> rename MakeTupleTableSlot to MakeSingleTupleTableSlot and change it in
>> only a few places.
>>
>> Thoughts?
>
> MakeSingleTupleTableSlot can be defined as a macro, calling 
> MakeTupleTableSlot().

Right, we could as well have an inline function. My point was that why
do we need to wrap MakeTupleTableSlot inside MakeSingleTupleTableSlot
which just does nothing.  As I said upthread, how about renaming
MakeTupleTableSlot to
MakeSingleTupleTableSlot which requires  minimal changes?  Patch
attached. Both make check and make check-world passes on it. Please
have a look.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Remove-unnecessary-wrapping-of-MakeTupleTableSlot.patch
Description: Binary data


Re: Possible dereference null return (src/backend/replication/logical/reorderbuffer.c)

2021-02-12 Thread Zhihong Yu
Hi,
How about the following patch ?

ReorderBufferSetBaseSnapshot() can return a bool to indicate whether the
base snapshot is set up.

For the call by SnapBuildCommitTxn(), it seems xid is top transaction. So
the return value doesn't need to be checked.

Cheers

On Fri, Feb 12, 2021 at 6:40 PM Michael Paquier  wrote:

> On Fri, Feb 12, 2021 at 03:56:02PM +0900, Kyotaro Horiguchi wrote:
> > If the return from the first call is a subtransaction, the second call
> > always obtain the top transaction.  If the top transaction actualy did
> > not exist, it's rather the correct behavior to cause SEGV, than
> > creating a bogus rbtxn. THus it is wrong to set create=true and
> > create_as_top=true.  We could change the assertion like Assert (txn &&
> > txn->base_snapshot) to make things clearer.
>
> I don't see much the point to change this code.  The result would be
> the same: a PANIC at this location.
> --
> Michael
>


reorder-buffer-base-snapshot.patch
Description: Binary data


Re: [PoC] Non-volatile WAL buffer

2021-02-12 Thread Masahiko Sawada
On Thu, Jan 28, 2021 at 1:41 AM Tomas Vondra
 wrote:
>
> On 1/25/21 3:56 AM, Masahiko Sawada wrote:
> >>
> >> ...
> >>
> >> On 1/21/21 3:17 AM, Masahiko Sawada wrote:
> >>> ...
> >>>
> >>> While looking at the two methods: NTT and simple-no-buffer, I realized
> >>> that in XLogFlush(), NTT patch flushes (by pmem_flush() and
> >>> pmem_drain()) WAL without acquiring WALWriteLock whereas
> >>> simple-no-buffer patch acquires WALWriteLock to do that
> >>> (pmem_persist()). I wonder if this also affected the performance
> >>> differences between those two methods since WALWriteLock serializes
> >>> the operations. With PMEM, multiple backends can concurrently flush
> >>> the records if the memory region is not overlapped? If so, flushing
> >>> WAL without WALWriteLock would be a big benefit.
> >>>
> >>
> >> That's a very good question - it's quite possible the WALWriteLock is
> >> not really needed, because the processes are actually "writing" the WAL
> >> directly to PMEM. So it's a bit confusing, because it's only really
> >> concerned about making sure it's flushed.
> >>
> >> And yes, multiple processes certainly can write to PMEM at the same
> >> time, in fact it's a requirement to get good throughput I believe. My
> >> understanding is we need ~8 processes, at least that's what I heard from
> >> people with more PMEM experience.
> >
> > Thanks, that's good to know.
> >
> >>
> >> TBH I'm not convinced the code in the "simple-no-buffer" code (coming
> >> from the 0002 patch) is actually correct. Essentially, consider the
> >> backend needs to do a flush, but does not have a segment mapped. So it
> >> maps it and calls pmem_drain() on it.
> >>
> >> But does that actually flush anything? Does it properly flush changes
> >> done by other processes that may not have called pmem_drain() yet? I
> >> find this somewhat suspicious and I'd bet all processes that did write
> >> something have to call pmem_drain().
> >
> For the record, from what I learned / been told by engineers with PMEM
> experience, calling pmem_drain() should properly flush changes done by
> other processes. So it should be sufficient to do that in XLogFlush(),
> from a single process.
>
> My understanding is that we have about three challenges here:
>
> (a) we still need to track how far we flushed, so this needs to be
> protected by some lock anyway (although perhaps a much smaller section
> of the function)
>
> (b) pmem_drain() flushes all the changes, so it flushes even "future"
> part of the WAL after the requested LSN, which may negatively affects
> performance I guess. So I wonder if pmem_persist would be a better fit,
> as it allows specifying a range that should be persisted.
>
> (c) As mentioned before, PMEM behaves differently with concurrent
> access, i.e. it reaches peak throughput with relatively low number of
> threads wroting data, and then the throughput drops quite quickly. I'm
> not sure if the same thing applies to pmem_drain() too - if it does, we
> may need something like we have for insertions, i.e. a handful of locks
> allowing limited number of concurrent inserts.

Thanks. That's a good summary.

>
>
> > Yeah, in terms of experiments at least it's good to find out that the
> > approach mmapping each WAL segment is not good at performance.
> >
> Right. The problem with small WAL segments seems to be that each mmap
> causes the TLB to be thrown away, which means a lot of expensive cache
> misses. As the mmap needs to be done by each backend writing WAL, this
> is particularly bad with small WAL segments. The NTT patch works around
> that by doing just a single mmap.
>
> I wonder if we could pre-allocate and mmap small segments, and keep them
> mapped and just rename the underlying files when recycling them. That'd
> keep the regular segment files, as expected by various tools, etc. The
> question is what would happen when we temporarily need more WAL, etc.
>
> >>>
> >>> ...
> >>>
> >>> I think the performance improvement by NTT patch with the 16MB WAL
> >>> segment, the most common WAL segment size, is very good (150437 vs.
> >>> 212410 with 64 clients). But maybe evaluating writing WAL segment
> >>> files on PMEM DAX filesystem is also worth, as you mentioned, if we
> >>> don't do that yet.
> >>>
> >>
> >> Well, not sure. I think the question is still open whether it's actually
> >> safe to run on DAX, which does not have atomic writes of 512B sectors,
> >> and I think we rely on that e.g. for pg_config. But maybe for WAL that's
> >> not an issue.
> >
> > I think we can use the Block Translation Table (BTT) driver that
> > provides atomic sector updates.
> >
>
> But we have benchmarked that, see my message from 2020/11/26, which
> shows this table:
>
>  master/bttmaster/dax   nttsimple
>---
>  1 5469  7402  7977  6746
> 1648222 80869107025 82343
> 32

Re: [DOC] add missing "[ NO ]" to various "DEPENDS ON" synopses

2021-02-12 Thread Michael Paquier
On Fri, Feb 12, 2021 at 10:32:14AM +0900, Ian Lawrence Barwick wrote:
> In the documentation, the "[ NO ]" option is listed in the synopsis for
> ALTER TRIGGER and ALTER FUNCTION, but not the others.
> Trivial patch attached.

There are two flavors to cover for 6 commands per gram.y, and you are
covering all of them.  So this looks good to me.  I'll apply and
backpatch in a bit.  It is worth noting that tab-complete.c does a bad
job in completing those clauses.
--
Michael


signature.asc
Description: PGP signature


Re: Possible dereference null return (src/backend/replication/logical/reorderbuffer.c)

2021-02-12 Thread Michael Paquier
On Fri, Feb 12, 2021 at 03:56:02PM +0900, Kyotaro Horiguchi wrote:
> If the return from the first call is a subtransaction, the second call
> always obtain the top transaction.  If the top transaction actualy did
> not exist, it's rather the correct behavior to cause SEGV, than
> creating a bogus rbtxn. THus it is wrong to set create=true and
> create_as_top=true.  We could change the assertion like Assert (txn &&
> txn->base_snapshot) to make things clearer.

I don't see much the point to change this code.  The result would be
the same: a PANIC at this location.
--
Michael


signature.asc
Description: PGP signature


Re: Tightening up allowed custom GUC names

2021-02-12 Thread Michael Paquier
On Thu, Feb 11, 2021 at 02:50:13PM -0500, Robert Haas wrote:
> +1 for not back-patching whatever we do here.

+1.
--
Michael


signature.asc
Description: PGP signature


Re: some pointless HeapTupleHeaderIndicatesMovedPartitions calls

2021-02-12 Thread Michael Paquier
On Fri, Feb 12, 2021 at 04:42:26PM +, Anastasia Lubennikova wrote:
> I wonder, why this patch hangs on commitfest for so long. 
> The idea of the fix is clear, the code is correct and all tests pass, so, I 
> move it to ReadyForCommitter status.
> 
> The new status of this patch is: Ready for Committer

So that's this patch: https://commitfest.postgresql.org/32/2941/.
Alvaro is most likely going to take care of that, so let's wait for
him.
--
Michael


signature.asc
Description: PGP signature


Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)

2021-02-12 Thread Michael Paquier
On Fri, Feb 12, 2021 at 03:21:40PM +0900, Kyotaro Horiguchi wrote:
> The v3 drops the changes of the uuid_ossp contrib.  I'm not sure the
> change of scram_HMAC_final is needed.

Meaning that v3 would fail to compile uuid-ossp.  v3 also produces
compilation warnings in auth-scram.c.

> In v2, int_md5_finish() calls pg_cryptohash_final() with
> h->block_size(h) (64) but it should be h->result_size(h)
> (16). int_sha1_finish() is wrong the same way. (and, v3 seems fixing
> them in the wrong way.)

Right.  These should just use h->result_size(h), and not
h->block_size(h).

-extern int scram_HMAC_final(uint8 *result, scram_HMAC_ctx *ctx);
+extern int scram_HMAC_final(scram_HMAC_ctx *ctx, uint8 *result);
There is no point in this change.  You just make back-patching harder
while doing nothing about the problem at hand.

-   if (pg_cryptohash_final(manifest->manifest_ctx, checksumbuf) < 0)
+   if (pg_cryptohash_final(manifest->manifest_ctx, checksumbuf,
+   PG_SHA256_DIGEST_LENGTH) < 0)
Here this could just use sizeof(checksumbuf)?  This pattern could be
used elsewhere as well, like in md5_common.c.

> Although I don't oppose to make things defensive, I think the derived
> interfaces should be defensive in the same extent if we do. Especially
> the calls to the function in checksum_helper.c is just nullifying the
> protection.

The checksum stuff just relies on PG_CHECKSUM_MAX_LENGTH and there are
already static assertions used as sanity checks, so I see little point
in adding a new argument that would be just PG_CHECKSUM_MAX_LENGTH.
This backup checksum code is already very specific, and it is not
intended for uses as generic as the cryptohash functions.  With such a
change, my guess is that it becomes really easy to miss that
pg_checksum_final() has to return the size of the digest result, and
not the maximum buffer size allocation.  Perhaps one thing this part
could do is just to save the digest length in a variable and use it
for retval and the third argument of pg_cryptohash_final(), but the
impact looks limited.

> Total 9/19 places.  I think at least pg_checksum_final() should take
> the buffer length.  I'm not sure about px_md_finish() and
> hmac_md_finish()..

I guess that you mean px_hmac_finish() for the second one.  The first
one is tied to passing down result_size() and down to the cryptohash
functoins, meaning that there is no need to take about it more than
that IMO.  The second one would be tied to the HMAC refactoring.  This
would be valuable in the case of pgcrypto when building with OpenSSL,
meaning that the code would go through the defenses put in place at
the PG level.
--
Michael


signature.asc
Description: PGP signature


Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-12 Thread John Naylor
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas  wrote:
>
> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
>   mixed | ascii
> ---+---
> 447 |46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

Attached is a patch that does roughly what simdjson fallback did, except I
use straight tests on the bytes and only calculate code points in assertion
builds. In the course of doing this, I found that my earlier concerns about
putting the ascii check in a static inline function were due to my
suboptimal loop implementation. I had assumed that if the chunked ascii
check failed, it had to check all those bytes one at a time. As it turns
out, that's a waste of the branch predictor. In the v2 patch, we do the
chunked ascii check every time we loop. With that, I can also confirm the
claim in the Lemire paper that it's better to do the check on 16-byte
chunks:

(MacOS, Clang 10)

master:

 chinese | mixed | ascii
-+---+---
1081 |   761 |   366

v2 patch, with 16-byte stride:

 chinese | mixed | ascii
-+---+---
 806 |   474 |83

patch but with 8-byte stride:

 chinese | mixed | ascii
-+---+---
 792 |   490 |   105

I also included the fast path in all other multibyte encodings, and that is
also pretty good performance-wise. It regresses from master on pure
multibyte input, but that case is still faster than PG13, which I simulated
by reverting 6c5576075b0f9 and b80e10638e3:

~PG13:

 chinese | mixed | ascii
-+---+---
1565 |   848 |   365

ascii fast-path plus pg_*_verifychar():

 chinese | mixed | ascii
-+---+---
1279 |   656 |94


v2 has a rough start to having multiple implementations in
src/backend/port. Next steps are:

1. Add more tests for utf-8 coverage (in addition to the ones to be added
by the noError argument patch)
2. Add SSE4 validator -- it turns out the demo I referred to earlier
doesn't match the algorithm in the paper. I plan to only copy the lookup
tables from simdjson verbatim, but the code will basically be written from
scratch, using  simdjson as a hint.
3. Adjust configure.ac

--
John Naylor
EDB: http://www.enterprisedb.com


v2-add-portability-stub-and-new-fallback.patch
Description: Binary data


Re: Improvements and additions to COPY progress reporting

2021-02-12 Thread Justin Pryzby
--- a/doc/src/sgml/ref/analyze.sgml 

   
+++ b/doc/src/sgml/ref/analyze.sgml 

   
@@ -273,6 +273,12 @@ ANALYZE [ VERBOSE ] [ table_and_columns  

   
+   

   
+ 

   
+Each backend running the ANALYZE command will report 
their   
  
+progress to the pg_stat_progress_analyze view.

   
+See  for details.  

   
+

   

I think this should say:

"..will report its progress to.."

Or:

"The progress of each backend running >ANALYZE< is reported in the
>pg_stat_progress_analyze< view."

-- 
Justin




Re: Detecting pointer misalignment (was Re: pgsql: Implementation of subscripting for jsonb)

2021-02-12 Thread Tom Lane
Alexander Korotkov  writes:
> On Fri, Feb 12, 2021 at 8:19 PM Tom Lane  wrote:
>> So maybe it'd better be __GNUC__ >= 6 not __GNUC__ >= 5.  I think
>> we can wait a little bit for more reports before messing with that,
>> though.

> I've rechecked this in the documentation. no_sanitize attribute seems
> to appear since gcc 8.0.  Much later than alignment sanitizer itself.

Yeah, I'd just come to that conclusion from scraping the buildfarm
logs.  Good to see it confirmed in the manual though.

>> Once this does settle, should we consider back-patching so that it's
>> possible to run alignment checks in the back branches too?

> +1

Let's make sure we have a clean set of builds and then do that.

regards, tom lane




Re: Detecting pointer misalignment (was Re: pgsql: Implementation of subscripting for jsonb)

2021-02-12 Thread Tom Lane
I wrote:
> Looking around at other recent reports, it looks like we'll need to tweak
> the compiler version cutoffs a bit.  I see for instance that spurfowl,
> with gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, is whining:
> ...
> So maybe it'd better be __GNUC__ >= 6 not __GNUC__ >= 5.  I think
> we can wait a little bit for more reports before messing with that,
> though.

Further reports show that gcc 6.x and 7.x also produce warnings,
so I moved the cutoff up to 8.  Hopefully that's good enough.
We could write a configure test instead, but I'd just as soon not
expend configure cycles on this.

regards, tom lane




Re: PostgreSQL <-> Babelfish integration

2021-02-12 Thread Adam Brusselback
Just wanted to link to the discussion on this from HN for anyone intersted:
https://news.ycombinator.com/item?id=26114281


Re: Detecting pointer misalignment (was Re: pgsql: Implementation of subscripting for jsonb)

2021-02-12 Thread Alexander Korotkov
On Fri, Feb 12, 2021 at 8:19 PM Tom Lane  wrote:
> I've updated buildfarm member longfin to use "-fsanitize=alignment
> -fsanitize-trap=alignment", and it just got through a run successfully
> with that.  It'd be good perhaps if some other buildfarm owners
> followed suit (mumble JIT coverage mumble).
>
> Looking around at other recent reports, it looks like we'll need to tweak
> the compiler version cutoffs a bit.  I see for instance that spurfowl,
> with gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, is whining:
>
> pg_crc32c_sse42.c:24:1: warning: \342\200\230no_sanitize\342\200\231 
> attribute directive ignored [-Wattributes]
>
> So maybe it'd better be __GNUC__ >= 6 not __GNUC__ >= 5.  I think
> we can wait a little bit for more reports before messing with that,
> though.

I've rechecked this in the documentation. no_sanitize attribute seems
to appear since gcc 8.0.  Much later than alignment sanitizer itself.
https://gcc.gnu.org/gcc-8/changes.html
"A new attribute no_sanitize can be applied to functions to instruct
the compiler not to do sanitization of the options provided as
arguments to the attribute. Acceptable values for no_sanitize match
those acceptable by the -fsanitize command-line option."

Yes, let's wait for more feedback from buildfarm and fix the version
requirement.

> Once this does settle, should we consider back-patching so that it's
> possible to run alignment checks in the back branches too?

+1

--
Regards,
Alexander Korotkov




Re: Preventing free space from being reused

2021-02-12 Thread Noah Bergbauer
> (My reaction to your previous thread was that it was simply a question
> of blindly insisting on using BRIN indexes for a case that they're quite
> badly adapted to.  The better answer is to not use BRIN.)

Apologies, perhaps I am completely misunderstanding the motivation for BRIN?

>From the docs:
>BRIN is designed for handling very large tables in which certain columns
have some natural correlation with their physical location within the table.
>[...]
>a table storing a store's sale orders might have a date column on which
each order was placed, and most of the time the entries for earlier orders
will appear earlier in the table

My table is very large, and the column in question has a strong natural
correlation with each tuple's physical location. It is, in fact, a date
column where entries with earlier timestamps will appear earlier in the
table. To be honest, if this isn't a use case for BRIN, then I don't know
what is. The only exception to this is a small proportion of tuples which
are slotted into random older pages due to their small size.

A btree index on the same column is 700x the size of BRIN, or 10% of
relation itself. It does not perform significantly better than BRIN. The
issue here is twofold: not only does slotting these tuples into older pages
significantly reduce the effectiveness of BRIN, it also causes
fragmentation on disk. Ultimately, this is why CLUSTER exists. One way to
look at this situation is that my data is inserted exactly in index order,
but Postgres keeps un-clustering it for reasons that are valid in general
(don't waste disk space) but don't apply at all in this case (the file
system uses compression, no space is wasted).

Any alternative ideas would of course be much appreciated! But at the
moment HEAP_INSERT_SKIP_FSM seems like the most practical solution to me.




On Fri, Feb 12, 2021 at 10:43 PM Tom Lane  wrote:

> Noah Bergbauer  writes:
> > I am working on a project where I do not want Postgres to reuse free
> space
> > in old pages (see
> >
> https://www.postgresql.org/message-id/flat/CABjy%2BRhbFu_Hs8ZEiOzaPaJSGB9jqFF0gDU5gtwCLiurG3NLjQ%40mail.gmail.com
> > for details). I found that the HEAP_INSERT_SKIP_FSM flag accomplishes
> this.
> > For a long-term solution I see two options:
> > 1. Introduce a reloption for this.
> > 2. Implement it as a custom table access method in an extension.
>
> TBH, I can't believe that this is actually a good idea.  If we introduce
> a reloption that does that, we'll just be getting users complaining about
> table bloat ... but probably only after they get to a state where it's
> going to be horribly painful to get out of.
>
> (My reaction to your previous thread was that it was simply a question
> of blindly insisting on using BRIN indexes for a case that they're quite
> badly adapted to.  The better answer is to not use BRIN.)
>
> regards, tom lane
>


Re: Preventing free space from being reused

2021-02-12 Thread Tom Lane
Noah Bergbauer  writes:
> I am working on a project where I do not want Postgres to reuse free space
> in old pages (see
> https://www.postgresql.org/message-id/flat/CABjy%2BRhbFu_Hs8ZEiOzaPaJSGB9jqFF0gDU5gtwCLiurG3NLjQ%40mail.gmail.com
> for details). I found that the HEAP_INSERT_SKIP_FSM flag accomplishes this.
> For a long-term solution I see two options:
> 1. Introduce a reloption for this.
> 2. Implement it as a custom table access method in an extension.

TBH, I can't believe that this is actually a good idea.  If we introduce
a reloption that does that, we'll just be getting users complaining about
table bloat ... but probably only after they get to a state where it's
going to be horribly painful to get out of.

(My reaction to your previous thread was that it was simply a question
of blindly insisting on using BRIN indexes for a case that they're quite
badly adapted to.  The better answer is to not use BRIN.)

regards, tom lane




Experimenting with redo batching

2021-02-12 Thread Thomas Munro
Hello,

As a very simple exploration of the possible gains from batching redo
records during replay, I tried to avoid acquiring and releasing
buffers pins and locks while replaying records that touch the same
page as the previous record.  The attached experiment-grade patch
works by trying to give a locked buffer to the next redo handler,
which then releases it if it wants a different buffer.  Crash recovery
on my dev machine went from 62s to 34s (1.8x speedup) for:

  create table t (i int, foo text);
  insert into t select generate_series(1, 5000), 'the quick brown
fox jumped over the lazy dog';
  delete from t;

Of course that workload was contrived to produce a suitable WAL
history for this demo.  The patch doesn't help more common histories
from the real world, involving (non-HOT) UPDATEs and indexes etc,
because then you have various kinds of interleaving that defeat this
simple-minded optimisation.  To get a more general improvement, it
seems that we'd need a smarter redo loop that could figure out what
can safely be reordered to maximise the page-level batching and
locality effects.  I haven't studied the complications of reordering
yet, and I'm not working on that for PostgreSQL 14, but I wanted to
see if others have thoughts about it.  The WAL prefetching patch that
I am planning to get into 14 opens up these possibilities by decoding
many records into a circular WAL decode buffer, so you can see a whole
chain of them at once.
From b481d7727d1732705b85e5935887ff4404eb014b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 12 Feb 2021 01:08:04 +1300
Subject: [PATCH] Try to hold onto buffers between WAL records.

When replaying a sequence of redo records that touch the same page, try
to avoid having to look up, pin, lock, unlock and unpin the buffer each
time by holding onto one buffer between records.  This aims to speed up
replay of bulk operations.

XXX Are there really many interesting workload where records are
adjacent without (for example) interleaving Heap and Btree records?
Maybe for this to pay off for interleaving cases you might need to look
ahead and find suitable records, and replay them out of order!

XXX There is probably a better word than "keep"

WIP experiment; may contain nuts
---
 src/backend/access/heap/heapam.c| 12 ++--
 src/backend/access/nbtree/nbtxlog.c |  8 +--
 src/backend/access/transam/xlog.c   |  6 ++
 src/backend/access/transam/xlogreader.c |  2 +
 src/backend/access/transam/xlogutils.c  | 80 ++---
 src/backend/catalog/storage.c   |  3 +
 src/include/access/xlogreader.h |  3 +
 src/include/access/xlogutils.h  |  1 +
 8 files changed, 96 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9926e2bd54..4750c8ba2d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8448,7 +8448,7 @@ heap_xlog_clean(XLogReaderState *record)
 	{
 		Size		freespace = PageGetHeapFreeSpace(BufferGetPage(buffer));
 
-		UnlockReleaseBuffer(buffer);
+		XLogKeepBufferForRedo(record, buffer);
 
 		/*
 		 * After cleaning records from a page, it's useful to update the FSM
@@ -8660,7 +8660,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		MarkBufferDirty(buffer);
 	}
 	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
+		XLogKeepBufferForRedo(record, buffer);
 }
 
 /*
@@ -8760,7 +8760,7 @@ heap_xlog_delete(XLogReaderState *record)
 		MarkBufferDirty(buffer);
 	}
 	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
+		XLogKeepBufferForRedo(record, buffer);
 }
 
 static void
@@ -8869,7 +8869,7 @@ heap_xlog_insert(XLogReaderState *record)
 		MarkBufferDirty(buffer);
 	}
 	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
+		XLogKeepBufferForRedo(record, buffer);
 
 	/*
 	 * If the page is running low on free space, update the FSM as well.
@@ -9410,7 +9410,7 @@ heap_xlog_lock(XLogReaderState *record)
 		MarkBufferDirty(buffer);
 	}
 	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
+		XLogKeepBufferForRedo(record, buffer);
 }
 
 static void
@@ -9511,7 +9511,7 @@ heap_xlog_inplace(XLogReaderState *record)
 		MarkBufferDirty(buffer);
 	}
 	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
+		XLogKeepBufferForRedo(record, buffer);
 }
 
 void
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..4025f97227 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -237,7 +237,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
 		MarkBufferDirty(buffer);
 	}
 	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
+		XLogKeepBufferForRedo(record, buffer);
 
 	/*
 	 * Note: in normal operation, we'd update the metapage while still holding
@@ -553,7 +553,7 @@ btree_xlog_dedup(XLogReaderState *record)
 	}
 
 	if (BufferIsValid(buf))
-		UnlockReleaseBuffer(buf);
+		

Preventing free space from being reused

2021-02-12 Thread Noah Bergbauer
Hello,

I am working on a project where I do not want Postgres to reuse free space
in old pages (see
https://www.postgresql.org/message-id/flat/CABjy%2BRhbFu_Hs8ZEiOzaPaJSGB9jqFF0gDU5gtwCLiurG3NLjQ%40mail.gmail.com
for details). I found that the HEAP_INSERT_SKIP_FSM flag accomplishes this.
For a long-term solution I see two options:

1. Introduce a reloption for this.
2. Implement it as a custom table access method in an extension.

As an experiment, I have created an extension which forwards all table
access functions to the builtin heap access method, but enables the
HEAP_INSERT_SKIP_FSM flag for heap_insert and heap_multi_insert. However,
the check in heap_getnext (
https://github.com/postgres/postgres/blob/REL_12_5/src/backend/access/heap/heapam.c#L1294-L1304)
is a showstopper. Because the custom access method uses a different
function table (so that I can override heap_insert and heap_multi_insert),
heap_getnext errors out with "only heap AM is supported". I am currently
hacking around this problem by duplicating all code up to and including
heap_getnext, with this check commented out. Clearly this is not ideal, as
changes to the heap code in future updates might cause incompatibilities.

Any ideas on how to proceed with this issue?

Thank you,
Noah Bergbauer


Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-02-12 Thread Joel Jacobson
Hi Mark,

On Fri, Feb 12, 2021, at 20:56, Mark Rofail wrote:
>Indeed you are right, to support the correct behaviour we have to use 
>@>>(anycompatiblearray, anycompatiblenonarry) and >this throws a sanity error 
>in opr_santiy since the left operand doesn't equal the gin opclass which is 
>anyarray. I am thinking >to solve this we need to add a new opclass under gin 
>"compatible_array_ops" beside the already existing "array_ops", >what do you 
>think?

I'm afraid I have no idea. I don't really understand how these 
"anycompatible"-types work, I only knew of "anyarray" and "anyelement" until 
recently. I will study these in detail to get a better understanding. But 
perhaps you could just explain a quick question first:

Why couldn't/shouldn't @>> and <<@ be operating on anyarray and anyelement?
This would seem more natural to me since the Array Operators versions of @> and 
<@ operate on anyarray.

/Joel

Re: [BUG] Autovacuum not dynamically decreasing cost_limit and cost_delay

2021-02-12 Thread David Zhang

Thanks for the patch, Mead.

For 'MAXVACUUMCOSTLIMIT", it would be nice to follow the current GUC 
pattern to do define a constant.


For example, the constant "MAX_KILOBYTES" is defined in guc.h, with a 
pattern like, "MAX_" to make it easy to read.


Best regards,

David

On 2021-02-08 6:48 a.m., Mead, Scott wrote:

Hello,
   I recently looked at what it would take to make a running 
autovacuum pick-up a change to either cost_delay or cost_limit.  Users 
frequently will have a conservative value set, and then wish to change 
it when autovacuum initiates a freeze on a relation.  Most users end 
up finding out they are in ‘to prevent wraparound’ after it has 
happened, this means that if they want the vacuum to take advantage of 
more I/O, they need to stop and then restart the currently running 
vacuum (after reloading the GUCs).
  Initially, my goal was to determine feasibility for making this 
dynamic.  I added debug code to vacuum.c:vacuum_delay_point(void) and 
found that changes to cost_delay and cost_limit are already processed 
by a running vacuum.  There was a bug preventing the cost_delay or 
cost_limit from being configured to allow higher throughput however.
I believe this is a bug because currently, autovacuum will dynamically 
detect and /increase/ the cost_limit or cost_delay, but it can never 
decrease those values beyond their setting when the vacuum began.  The 
current behavior is for vacuum to limit the maximum throughput of 
currently running vacuum processes to the cost_limit that was set when 
the vacuum process began.
I changed this (see attached) to allow the cost_limit to be 
re-calculated up to the maximum allowable (currently 10,000).  This 
has the effect of allowing users to reload a configuration change and 
an in-progress vacuum can be ‘sped-up’ by setting either the 
cost_limit or cost_delay.

The problematic piece is:
diff --git a/src/backend/postmaster/autovacuum.c 
b/src/backend/postmaster/autovacuum.c

index c6ec657a93..d3c6b0d805 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1834,7 +1834,7 @@ autovac_balance_cost(void)
* cost_limit to more than the base value.
*/
worker->wi_cost_limit = *Max(Min(limit,*
*- worker->wi_cost_limit_base*),
+ MAXVACUUMCOSTLIMIT),
1);
}
We limit the worker to the max cost_limit that was set at the 
beginning of the vacuum.  I introduced the MAXVACUUMCOSTLIMIT constant 
(currently defined to 1, which is the currently max limit already 
defined) in miscadmin.h so that vacuum will now be able to adjust the 
cost_limit up to 1 as the upper limit in a currently running vacuum.


The tests that I’ve run show that the performance of an existing 
vacuum can be increased commensurate with the parameter change. 
 Interestingly, /autovac_balance_cost(void) /is only updating the 
cost_limit, even if the cost_delay is modified.  This is done 
correctly, it was just a surprise to see the behavior.



2021-02-01 13:36:52.346 EST [37891] DEBUG:  VACUUM Sleep: Delay: 
20.00, CostBalance: 207, CostLimit: *200*, msec: 20.70
2021-02-01 13:36:52.346 EST [37891] CONTEXT:  while scanning block 
1824 of relation "public.blah"
2021-02-01 13:36:52.362 EST [36460] LOG:  received SIGHUP, reloading 
configuration files

*
*
*2021-02-01 13:36:52.364 EST [36460] LOG:  parameter 
"autovacuum_vacuum_cost_delay" changed to "2"*

\
2021-02-01 13:36:52.365 EST [36463] DEBUG:  checkpointer updated 
shared memory configuration values
2021-02-01 13:36:52.366 EST [36466] DEBUG: 
 autovac_balance_cost(pid=37891 db=13207, rel=16384, dobalance=yes 
cost_limit=2000, cost_limit_base=200, cost_delay=20)


2021-02-01 13:36:52.366 EST [36467] DEBUG:  received inquiry for 
database 0
2021-02-01 13:36:52.366 EST [36467] DEBUG:  writing stats file 
"pg_stat_tmp/global.stat"
2021-02-01 13:36:52.366 EST [36467] DEBUG:  writing stats file 
"pg_stat_tmp/db_0.stat"
2021-02-01 13:36:52.388 EST [37891] DEBUG:  VACUUM Sleep: Delay: 
20.00, CostBalance: 2001, CostLimit: 2000, msec: 20.01



--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca


Re: PostgreSQL <-> Babelfish integration

2021-02-12 Thread Peter Geoghegan
On Fri, Feb 12, 2021 at 11:13 AM Matthias van de Meent
 wrote:
> I agree. I believe that Babelfish's efforts can be compared with the
> zedstore and zheap efforts: they require work in core before they can
> be integrated or added as an extension that could replace the normal
> heap tableam, and while core is being prepared we can discover what
> can and cannot be prepared in core for this new feature.

I see what you mean, but even that seems generous to me, since, as I
said, we don't have any Babelfish code to evaluate today. Whereas
Zedstore and zheap can actually be downloaded and tested.

-- 
Peter Geoghegan




Re: PostgreSQL <-> Babelfish integration

2021-02-12 Thread Peter Geoghegan
On Fri, Feb 12, 2021 at 11:04 AM Álvaro Hernández  wrote:
> I'm sure if we embrace an open and honest conversation, we will be
> able to figure out what the integration costs are even before the source
> code gets published. As I said, this goes beyond the very technical
> detail of source code integration.

Perhaps that's true in one sense, but if the cost of integrating
Babelfish is prohibitive then it's still not going to go anywhere. If
it did happen then that would certainly involve at least one or two
senior community members that personally adopt it. That's our model
for everything, to some degree because there is no other way that it
could work. It's very bottom-up.

For better or worse, very high level discussion like this has always
followed from code, not the other way around. We don't really have the
ability or experience to do it any other way IMO.

> Waiting until the source code is
> published is a bit chicken-and-egg (as I presume the source will morph
> towards convergence if there's work that may be started, even if it is
> just for example for protocol extensibility).

Well, the priorities of Postgres development are not set in any fixed
way (except to the limited extent that you're on the hook for anything
you integrate that breaks). I myself am not convinced that this is
worth spending any time on right now, especially given the lack of
code to evaluate.

-- 
Peter Geoghegan




Re: PostgreSQL <-> Babelfish integration

2021-02-12 Thread Matthias van de Meent
On Fri, 12 Feb 2021 at 19:44, Peter Geoghegan  wrote:
>
> On Fri, Feb 12, 2021 at 10:26 AM Álvaro Hernández  wrote:
> > As I stated in the mentioned post, I believe Babelfish is a very
> > welcomed addition to the PostgreSQL ecosystem. It allows PostgreSQL to
> > reach other users, other use cases, other markets; something which in my
> > opinion PostgreSQL really needs to extend its reach, to become a more
> > relevant player in the database market. The potential is there,
> > specially given all the extensibility points that PostgreSQL already
> > has, which are unparalleled in the industry.
>
> Let's assume for the sake of argument that your analysis of the
> benefits is 100% correct -- let's take it for granted that Babelfish
> is manna from heaven. It's still not clear that it's worth embracing
> Babelfish in the way that you have advocated.
>
> We simply don't know what the costs are. Because there is no source
> code available. Maybe that will change tomorrow or next week, but as
> of this moment there is simply nothing substantive to evaluate.

I agree. I believe that Babelfish's efforts can be compared with the
zedstore and zheap efforts: they require work in core before they can
be integrated or added as an extension that could replace the normal
heap tableam, and while core is being prepared we can discover what
can and cannot be prepared in core for this new feature. But as long
as there is no information about what structural updates in core would
be required, no commitment can be made for inclusion. And although I
would agree that an extension system for custom protocols and parsers
would be interesting, I think it would be putting the cart before the
horse if you want to force a decision 4 years ahead of time [0],
without ever having seen the code or even a design document.

In general, I think postgres could indeed benefit from a pluggable
protocol and dialect frontend, but as long as there are no public and
open projects that demonstrate the benefits or would provide a guide
for implementing such frontend, I see no reason for the postgres
project to put work into such a feature.

With regards,

Matthias van de Meent

[0] I believe this is an optimistic guess, based on the changes that
were (and are yet still) required for the zedstore and/or zheap
tableam, but am happy to be proven wrong.




Re: PostgreSQL <-> Babelfish integration

2021-02-12 Thread Álvaro Hernández



On 12/2/21 19:44, Peter Geoghegan wrote:
> On Fri, Feb 12, 2021 at 10:26 AM Álvaro Hernández  wrote:
>> As I stated in the mentioned post, I believe Babelfish is a very
>> welcomed addition to the PostgreSQL ecosystem. It allows PostgreSQL to
>> reach other users, other use cases, other markets; something which in my
>> opinion PostgreSQL really needs to extend its reach, to become a more
>> relevant player in the database market. The potential is there,
>> specially given all the extensibility points that PostgreSQL already
>> has, which are unparalleled in the industry.
> Let's assume for the sake of argument that your analysis of the
> benefits is 100% correct -- let's take it for granted that Babelfish
> is manna from heaven. It's still not clear that it's worth embracing
> Babelfish in the way that you have advocated.
>
> We simply don't know what the costs are. Because there is no source
> code available. Maybe that will change tomorrow or next week, but as
> of this moment there is simply nothing substantive to evaluate.

    I'm sure if we embrace an open and honest conversation, we will be
able to figure out what the integration costs are even before the source
code gets published. As I said, this goes beyond the very technical
detail of source code integration. Waiting until the source code is
published is a bit chicken-and-egg (as I presume the source will morph
towards convergence if there's work that may be started, even if it is
just for example for protocol extensibility).

    I'm sure this can be also discussed at an architectural level,
getting an analysis of what parts of PostgreSQL would be changed, what
extension mechanisms are required, what is the volume of the project,
and many others.

    Álvaro


-- 

Alvaro Hernandez


---
OnGres





Re: PostgreSQL <-> Babelfish integration

2021-02-12 Thread Peter Geoghegan
On Fri, Feb 12, 2021 at 10:26 AM Álvaro Hernández  wrote:
> As I stated in the mentioned post, I believe Babelfish is a very
> welcomed addition to the PostgreSQL ecosystem. It allows PostgreSQL to
> reach other users, other use cases, other markets; something which in my
> opinion PostgreSQL really needs to extend its reach, to become a more
> relevant player in the database market. The potential is there,
> specially given all the extensibility points that PostgreSQL already
> has, which are unparalleled in the industry.

Let's assume for the sake of argument that your analysis of the
benefits is 100% correct -- let's take it for granted that Babelfish
is manna from heaven. It's still not clear that it's worth embracing
Babelfish in the way that you have advocated.

We simply don't know what the costs are. Because there is no source
code available. Maybe that will change tomorrow or next week, but as
of this moment there is simply nothing substantive to evaluate.

-- 
Peter Geoghegan




PostgreSQL <-> Babelfish integration

2021-02-12 Thread Álvaro Hernández


    I would like to share my thoughts in the list about the potential
PostgreSQL <-> Babelfish integration. There is already a thread about
protocol hooks [1], but I'd like to offer my PoV from a higher level
perspective and keep that thread for the technical aspects of the
protocol hooks. This is also a follow-up on a public blog post I
recently published [2], and the feedback I received to bring the topic
to the ML.

    As I stated in the mentioned post, I believe Babelfish is a very
welcomed addition to the PostgreSQL ecosystem. It allows PostgreSQL to
reach other users, other use cases, other markets; something which in my
opinion PostgreSQL really needs to extend its reach, to become a more
relevant player in the database market. The potential is there,
specially given all the extensibility points that PostgreSQL already
has, which are unparalleled in the industry.

    I believe we should engage in a conversation, with AWS included,
about how we can possibly benefit from this integration. It must be
symbiotic, both "parties" should win with it, otherwise it won't work.
But I believe it can definitely be a win-win situation. There has been
some concerns that this may be for Amazon's own benefit, and would
suppose an increased maintenance burden for the PostgreSQL Community. I
believe that analysis is not including the many benefits that such a
compatibility for PostgreSQL would bring in many fronts. And possibly,
the changes required to core, are beneficial for other areas of
PostgreSQL. Several have already pointed out in the extensibility hooks
thread that this could allow for new protocols into PostgreSQL,
including the much desired v4 or an HTTP one. I can only strongly second
that, and we should also analyze it from this perspective.

    There is also a risk factor that I believe needs to be factored into
the analysis, and is what is the risk of not doing anything. From my
understanding, it is very clear that AWS wants to treat Babelfish as a
kind of development branch, waiting for inclusion into mainline. But I
also believe, if this branch sits forever not merged, at some point, may
be under the risk of having its own life, becoming a fork. And if it
does, it may become our "MariaDB". I would not like this to happen.

    I'm happy to contribute what I can to this discussion: if we want
Babelfish to be integrated, how, analyze pros and cons, etc. I see this
as an incredible gift that, if managed properly, not only will make
PostgreSQL much better in use-cases that cannot access now; but may also
boost PostgreSQL's extensibility even further, and maybe even spark
development of some projects (like v4 or HTTP protocol) that have been
longer dismissed because there were (logically) too many requisites for
any v3 replacement, that made its replacement extremely hard.

    But of course, these are just the humble 2 cents of a casual
-hackers reader.


    Álvaro


[1]
https://www.postgresql.org/message-id/CAGBW59d5SjLyJLt-jwNv%2BoP6esbD8SCB%3D%3D%3D11WVe5%3DdOHLQ5wQ%40mail.gmail.com
[2] https://postgresql.fund/blog/babelfish-the-elephant-in-the-room/

-- 

Alvaro Hernandez


---
OnGres






Re: pg13.2: invalid memory alloc request size NNNN

2021-02-12 Thread Justin Pryzby
On Fri, Feb 12, 2021 at 06:44:54PM +0100, Tomas Vondra wrote:
> > (gdb) p len
> > $1 = -4
> > 
> > This VM had some issue early today and I killed the VM, causing PG to 
> > execute
> > recovery.  I'm tentatively blaming that on zfs, so this could conceivably 
> > be a
> > data error (although recovery supposedly would have resolved it).  I just
> > checked and data_checksums=off.
> 
> This seems very much like a corrupted varlena header - length (-4) is
> clearly bogus, and it's what triggers the problem, because that's what wraps
> around to 18446744073709551613 (which is 0xFFFD).
> 
> This has to be a value stored in a table, not some intermediate value
> created during execution. So I don't think the exact query matters. Can you
> try doing something like pg_dump, which has to detoast everything?

Right, COPY fails and VACUUM FULL crashes.

message| invalid memory alloc request size 18446744073709551613
query  | COPY child.tt TO '/dev/null';

> The question is whether this is due to the VM getting killed in some strange
> way (what VM system is this, how is the storage mounted?) or whether the
> recovery is borked and failed to do the right thing.

This is qemu/kvm, with block storage:
  
  

And then more block devices for ZFS vdevs:
  
  
  ...

Those are LVM volumes (I know that ZFS/LVM is discouraged).

$ zpool list -v
NAME SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAGCAP  DEDUP
HEALTH  ALTROOT
zfs  762G   577G   185G- -71%75%  1.00x
ONLINE  -
  vdj127G  92.7G  34.3G- -64%  73.0%  -  ONLINE 
 
  vdd127G  95.6G  31.4G- -74%  75.2%  -  ONLINE 
 
  vdf127G  96.0G  31.0G- -75%  75.6%  -  ONLINE 
 
  vdg127G  95.8G  31.2G- -74%  75.5%  -  ONLINE 
 
  vdh127G  95.5G  31.5G- -74%  75.2%  -  ONLINE 
 
  vdi128G   102G  25.7G- -71%  79.9%  -  ONLINE 
 

This is recently upgraded to ZFS 2.0.0, and then to 2.0.1:

Jan 21 09:33:26 Installed: zfs-dkms-2.0.1-1.el7.noarch
Dec 23 08:41:21 Installed: zfs-dkms-2.0.0-1.el7.noarch

The VM has gotten "wedged" and I've had to kill it a few times in the last 24h
(needless to say this is not normal).  That part seems like a kernel issue and
not postgres problem.  It's unclear if that's due to me trying to tickle the
postgres ERROR.  It's the latest centos7 kernel: 3.10.0-1160.15.2.el7.x86_64

-- 
Justin




Re: Trigger execution role

2021-02-12 Thread Tom Lane
Isaac Morland  writes:
> I was trying to use triggers, and ran into something I hadn't realized
> until now: triggers run, not as the owner of the table, but as the user who
> is doing the insert/update/delete.

If you don't want that, you can make the trigger function SECURITY
DEFINER.  If we forced such behavior, there'd be no way to get the
other behavior.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-02-12 Thread Tomas Vondra




On 2/12/21 5:46 AM, Andres Freund wrote:

Hi,

On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:

Yeah, that's a good point. I think it'd make sense to keep track of recent
FPIs and skip prefetching such blocks. But how exactly should we implement
that, how many blocks do we need to track? If you get an FPI, how long
should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons. Firstly,
the usual pattern is that we have FPI + several changes for that block
shortly after it. Secondly, maintenance_io_concurrency limits this naturally
- after crossing that, redo should place the FPI into shared buffers,
allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might track
more buffers to allow skipping prefetches of blocks that were evicted from
shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple
queue is not very suitable - searching it linearly for each block would be
too expensive. But I think we can use a simple hash table, tracking
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
and whenever we prefetch a block or find an FPI, we simply add the block to
the array as determined by hash(relfilenode, block)

 hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one
position. If the (relfilenode, block) match, we check the LSN and skip the
prefetch if it's sufficiently recent. Otherwise we prefetch.


I'm a bit doubtful this is really needed at this point. Yes, the
prefetching will do a buffer table lookup - but it's a lookup that
already happens today. And the patch already avoids doing a second
lookup after prefetching (by optimistically caching the last Buffer id,
and re-checking).

I think there's potential for some significant optimization going
forward, but I think it's basically optimization over what we're doing
today. As this is already a nontrivial patch, I'd argue for doing so
separately.



I agree with treating this as an improvement - it's not something that 
needs to be solved in the first verson. OTOH I think Stephen has a point 
that just skipping FPIs like we do now has limited effect, because the 
WAL usually contains additional changes to the same block.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improvements and additions to COPY progress reporting

2021-02-12 Thread Matthias van de Meent
On Fri, 12 Feb 2021 at 13:40, Bharath Rupireddy
 wrote:
>
> On Fri, Feb 12, 2021 at 5:40 PM Matthias van de Meent
>  wrote:
> >
> > On Fri, 12 Feb 2021 at 12:23, Matthias van de Meent
> >  wrote:
> > >
> > > On Thu, 11 Feb 2021 at 15:44, Bharath Rupireddy
> > >  wrote:
> > > >
> > > >
> > > > On Thu, Feb 11, 2021, 8:08 PM Josef Šimánek  
> > > > wrote:
> > > >> I have split it since it should be the start of progress reporting
> > > >> testing at all. If you better consider this as part of COPY testing,
> > > >> feel free to move it to already existing copy testing related files.
> > > >> There's no real reason to keep it separated if not needed.
> > > >
> > > >
> > > > +1 to move those test cases to existing copy test files.
> > >
> > > Thanks for your reviews. PFA v4 of the patchset, in which the tests
> > > are put into copy.sql (well, input/copy.source). This also adds tests
> > > for correctly reporting COPY ... FROM 'file'.
> >
> > PFA v5, which fixes a failure in the pg_upgrade regression tests due
> > to incorrect usage of @abs_builddir@. I had the changes staged, but
> > forgot to add them to the patches.
> >
> > Sorry for the noise.
>
> Looks like the patch 0001 that was adding tuples_excluded was missing
> and cfbot is also not happy with the v5 patch set.
>
> Maybe, we may not need 6 patches as they are relatively very small
> patches. IMO, the following are enough:
>
> 0001 - tuples_excluded, lines to tuples change, COPY FROM/COPY TO
> addition, io_target -- basically all the code related patches can go
> into 0001
> 0002 - documentation
> 0003 - tests - I think we can only have a simple test(in copy2.sql)
> showing stdin/stdout and not have file related tests. Because these
> patches work as expected, please find my testing below:

I agree with that split, the current split was mainly for the reason
that some of the patches (except 1, 3 and 6, which are quite
substantially different from the rest) each have had their seperate
concerns voiced about the changes contained in that patch (be it
direct or indirect); e.g. the renaming of lines_* to tuples_* is in my
opionion a good thing, and Josef disagrees.

Anyway, please find attached patchset v6 applying that split.

Regarding only a simple test: I believe it is useful to have at least
a test that distinguishes between two different import types. I've
made a mistake before, so I think it is useful to add a regression
tests to prevent someone else from making this same mistake (trivial
as it may be). Additionally, testing in copy.sql also allows for
validating the bytes_total column, which cannot be tested in copy2.sql
due to the lack of COPY FROM FILE -support over there. I'm +0.5 on
keeping it as-is in copy.sql, so unless someone has some strong
feelings about this, I'd like to keep it in copy.sql.

With regards,

Matthias van de Meent.
From daf793916a91157f917bafd669f50af3d3751ee4 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Fri, 5 Feb 2021 23:11:50 +0100
Subject: [PATCH v6 2/3] Add backlinks to progress reporting documentation

Previously, for most progress-reported features, the only place the
feature was mentioned is in the progress reporting document itself.
This makes the progress reporting more discoverable from the reported
commands.
---
 doc/src/sgml/ref/analyze.sgml   |  7 +++
 doc/src/sgml/ref/cluster.sgml   |  6 ++
 doc/src/sgml/ref/copy.sgml  | 14 ++
 doc/src/sgml/ref/create_index.sgml  |  8 
 doc/src/sgml/ref/pg_basebackup.sgml |  1 +
 doc/src/sgml/ref/reindex.sgml   |  7 +++
 doc/src/sgml/ref/vacuum.sgml| 11 +++
 7 files changed, 54 insertions(+)

diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml
index 7d816c87c6..9db9070b62 100644
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -273,6 +273,12 @@ ANALYZE [ VERBOSE ] [ table_and_columns
+
+  
+Each backend running the ANALYZE command will report their
+progress to the pg_stat_progress_analyze view.
+See  for details.
+  
  
 
  
@@ -291,6 +297,7 @@ ANALYZE [ VERBOSE ] [ table_and_columns


+   
   
  
 
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 5dd21a0189..5c2270f71b 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -192,6 +192,11 @@ CLUSTER [VERBOSE]
 are periodically reclustered.

 
+  
+Each backend running the CLUSTER command will report their
+progress to the pg_stat_progress_cluster view.
+See  for details.
+  
  
 
  
@@ -242,6 +247,7 @@ CLUSTER index_name ON 

+   
   
  
 
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 0fca6583af..af3ce72561 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -82,6 +82,12 @@ COPY { table_name [ ( 
+
+  
+Each backend running the COPY command will report their
+progress to the pg_stat_progress_copy view.
+See  for details.
+  
  

Re: pg13.2: invalid memory alloc request size NNNN

2021-02-12 Thread Tomas Vondra




On 2/12/21 2:48 AM, Justin Pryzby wrote:

ts=# \errverbose
ERROR:  XX000: invalid memory alloc request size 18446744073709551613

#0  pg_re_throw () at elog.c:1716
#1  0x00a33b12 in errfinish (filename=0xbff20e "mcxt.c", lineno=959, funcname=0xbff2db 
<__func__.6684> "palloc") at elog.c:502
#2  0x00a6760d in palloc (size=18446744073709551613) at mcxt.c:959
#3  0x009fb149 in text_to_cstring (t=0x2aaae8023010) at varlena.c:212
#4  0x009fbf05 in textout (fcinfo=0x2094538) at varlena.c:557
#5  0x006bdd50 in ExecInterpExpr (state=0x2093990, econtext=0x20933d8, 
isnull=0x7fff5bf04a87) at execExprInterp.c:1112
#6  0x006d4f18 in ExecEvalExprSwitchContext (state=0x2093990, 
econtext=0x20933d8, isNull=0x7fff5bf04a87) at 
../../../src/include/executor/executor.h:316
#7  0x006d4f81 in ExecProject (projInfo=0x2093988) at 
../../../src/include/executor/executor.h:350
#8  0x006d5371 in ExecScan (node=0x20932c8, accessMtd=0x7082e0 , 
recheckMtd=0x708385 ) at execScan.c:238
#9  0x007083c2 in ExecSeqScan (pstate=0x20932c8) at nodeSeqscan.c:112
#10 0x006d1b00 in ExecProcNodeInstr (node=0x20932c8) at 
execProcnode.c:466
#11 0x006e742c in ExecProcNode (node=0x20932c8) at 
../../../src/include/executor/executor.h:248
#12 0x006e77de in ExecAppend (pstate=0x2089208) at nodeAppend.c:267
#13 0x006d1b00 in ExecProcNodeInstr (node=0x2089208) at 
execProcnode.c:466
#14 0x0070964f in ExecProcNode (node=0x2089208) at 
../../../src/include/executor/executor.h:248
#15 0x00709795 in ExecSort (pstate=0x2088ff8) at nodeSort.c:108
#16 0x006d1b00 in ExecProcNodeInstr (node=0x2088ff8) at 
execProcnode.c:466
#17 0x006d1ad1 in ExecProcNodeFirst (node=0x2088ff8) at 
execProcnode.c:450
#18 0x006dec36 in ExecProcNode (node=0x2088ff8) at 
../../../src/include/executor/executor.h:248
#19 0x006df079 in fetch_input_tuple (aggstate=0x2088a20) at 
nodeAgg.c:589
#20 0x006e1fad in agg_retrieve_direct (aggstate=0x2088a20) at 
nodeAgg.c:2368
#21 0x006e1bfd in ExecAgg (pstate=0x2088a20) at nodeAgg.c:2183
#22 0x006d1b00 in ExecProcNodeInstr (node=0x2088a20) at 
execProcnode.c:466
#23 0x006d1ad1 in ExecProcNodeFirst (node=0x2088a20) at 
execProcnode.c:450
#24 0x006c6ffa in ExecProcNode (node=0x2088a20) at 
../../../src/include/executor/executor.h:248
#25 0x006c966b in ExecutePlan (estate=0x2032f48, planstate=0x2088a20, 
use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, 
direction=ForwardScanDirection, dest=0xbb3400 ,
 execute_once=true) at execMain.c:1632

#3  0x009fb149 in text_to_cstring (t=0x2aaae8023010) at varlena.c:212
212 result = (char *) palloc(len + 1);

(gdb) l
207 /* must cast away the const, unfortunately */
208 text   *tunpacked = pg_detoast_datum_packed(unconstify(text 
*, t));
209 int len = VARSIZE_ANY_EXHDR(tunpacked);
210 char   *result;
211
212 result = (char *) palloc(len + 1);

(gdb) p len
$1 = -4

This VM had some issue early today and I killed the VM, causing PG to execute
recovery.  I'm tentatively blaming that on zfs, so this could conceivably be a
data error (although recovery supposedly would have resolved it).  I just
checked and data_checksums=off.



This seems very much like a corrupted varlena header - length (-4) is 
clearly bogus, and it's what triggers the problem, because that's what 
wraps around to 18446744073709551613 (which is 0xFFFD).


This has to be a value stored in a table, not some intermediate value 
created during execution. So I don't think the exact query matters. Can 
you try doing something like pg_dump, which has to detoast everything?


The question is whether this is due to the VM getting killed in some 
strange way (what VM system is this, how is the storage mounted?) or 
whether the recovery is borked and failed to do the right thing.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Trigger execution role

2021-02-12 Thread Isaac Morland
I was trying to use triggers, and ran into something I hadn't realized
until now: triggers run, not as the owner of the table, but as the user who
is doing the insert/update/delete.

It seems to me that for a lot of the suggested uses of triggers this is not
the desired behaviour. For example, in the canonical "logging table"
example, we want to be able to allow users to make changes to the base
table, but we don't want them to be able to insert to the log table except
indirectly by causing the trigger to fire. Even in cases where a trigger is
just checking whether the update is permissible, it seems to me that it
might be useful to refer to information not accessible to the user doing
the changes.

Has there been any discussion of this before? I know Postgres has had
triggers for a long time but at the same time I don't see how anybody could
do a lot of work with triggers without finding this to be a problem at some
point. I haven't found any discussion of this in the documentation.


Re: Detecting pointer misalignment (was Re: pgsql: Implementation of subscripting for jsonb)

2021-02-12 Thread Tom Lane
I've updated buildfarm member longfin to use "-fsanitize=alignment
-fsanitize-trap=alignment", and it just got through a run successfully
with that.  It'd be good perhaps if some other buildfarm owners
followed suit (mumble JIT coverage mumble).

Looking around at other recent reports, it looks like we'll need to tweak
the compiler version cutoffs a bit.  I see for instance that spurfowl,
with gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, is whining:

pg_crc32c_sse42.c:24:1: warning: \342\200\230no_sanitize\342\200\231 attribute 
directive ignored [-Wattributes]

So maybe it'd better be __GNUC__ >= 6 not __GNUC__ >= 5.  I think
we can wait a little bit for more reports before messing with that,
though.

Once this does settle, should we consider back-patching so that it's
possible to run alignment checks in the back branches too?

regards, tom lane




Re: WIP: document the hook system

2021-02-12 Thread Anastasia Lubennikova

On 17.01.2021 16:53, Magnus Hagander wrote:

On Fri, Jan 15, 2021 at 8:28 AM Peter Eisentraut
 wrote:

On 2020-12-31 04:28, David Fetter wrote:

This could probably use a lot of filling in, but having it in the
actual documentation beats needing to know folklore even to know
that the capability is there.

This patch seems quite short of a state where one could begin to
evaluate it.  Documenting the hooks better seems a worthwhile goal.   I
think the question is whether we can develop documentation that is
genuinely useful by itself without studying the relevant source code.
This submission does not address that question.

Even just having a list of available hooks would be a nice improvement though :)

But maybe it's something that belongs better in a README file instead,
since as you say it's unlikely to be properly useful without looking
at the source anyway. But just a list of hooks and a *very* high
overview of where each of them hooks in would definitely be useful to
have somewhere, I think. Having to find with "grep" whether there may
or may not exist a hook for approximately what it is you're looking
for is definitely a process to improve on.


+1 for README.
Hooks are intended for developers and can be quite dangerous without 
proper understanding of the internal code.


I also want to remind about a readme gathered by mentees [1]. It was 
done under a PostgreSQL license, so we can use it.
By the way, is there any agreement on the plain-text format of 
PostrgeSQL README files or we can use md?


[1] https://github.com/AmatanHead/psql-hooks/blob/master/Detailed.md

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: some pointless HeapTupleHeaderIndicatesMovedPartitions calls

2021-02-12 Thread Anastasia Lubennikova
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

I wonder, why this patch hangs on commitfest for so long. 
The idea of the fix is clear, the code is correct and all tests pass, so, I 
move it to ReadyForCommitter status.

The new status of this patch is: Ready for Committer


Re: logical replication seems broken

2021-02-12 Thread er
> On 02/12/2021 1:51 PM Amit Kapila  wrote:
> 
>  
> On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers  wrote:
> >
> > Hello,
> >
> > I am seeing errors in replication in a test program that I've been running 
> > for years with very little change (since 2017, really [1]).

Hi,

Here is a test program.  Careful, it deletes stuff.  And it will need some 
changes:

I compile postgres server versions into directories like:
 $HOME/pg_stuff/pg_installations/pgsql.$projectwhere project is a name

The attached script (logrep_cascade_bug.sh)  assumes that two such compiled 
versions are present (on my machine they are called HEAD and head0):
 $HOME/pg_stuff/pg_installations/pgsql.HEAD   --> git master as of today - 
friday 12 febr 2021
 $HOME/pg_stuff/pg_installations/pgsql.head0  --> 3063eb17593c  so that's 
from 11 febr, before the replication changes

In the test script, bash variables 'project' (and 'BIN') reflect my set up - so 
should probably be changed.

The instance from today 12 february ('HEAD') has the bug:
  it keeps endlessly waiting/looping with 'NOK' (=Not OK).
  'Not OK' means: primary not identical to all replicas (replica1 seems ok, but 
replica2 remains empty)

The instance from yesterday 11 february ('head0') is ok:
  it finishes in 20 s after waiting/looping just 2 or 3 times
  'ok' means: all replicas are identical to primary (as proven by the md5s).

That's all I have for now - I have no deeper idea about what exactly goes wrong.

I hope that helps, let me know when you cannot reproduce the problem.

Erik Rijkers

logrep_cascade_bug.sh
Description: application/shellscript


Re: Possible dereference after null check (src/backend/executor/ExecUtils.c)

2021-02-12 Thread Ranier Vilela
Em sex., 12 de fev. de 2021 às 03:28, Kyotaro Horiguchi <
horikyota@gmail.com> escreveu:

> At Wed, 10 Feb 2021 19:54:46 -0300, Ranier Vilela 
> wrote in
> > Hi,
> >
> > Per Coverity.
> >
> > The functions ExecGetInsertedCols and ExecGetUpdatedCols at ExecUtils.c
> > only are safe to call if the variable "ri_RangeTableIndex" is  != 0.
> >
> > Otherwise a possible Dereference after null check (FORWARD_NULL) can be
> > raised.
>
> As it turns out, it's a false positive. And perhaps we don't want to
> take action just to satisfy the static code analyzer.
>
>
> The coment in ExecGetInsertedCols says:
>
> > /*
> >  * The columns are stored in the range table entry. If this ResultRelInfo
> >  * doesn't have an entry in the range table (i.e. if it represents a
> >  * partition routing target), fetch the parent's RTE and map the columns
> >  * to the order they are in the partition.
> >  */
> > if (relinfo->ri_RangeTableIndex != 0)
> > {
>
> This means that any one of the two is always usable here.  AFAICS,
> actually, ri_RangeTableIndex is non-zero for partitioned (=leaf) and
> non-partitoned relations and ri_RootResultRelInfo is non-null for
> partitioned (parent or intermediate) relations (since they don't have
> a coressponding range table entry).
>
> The only cases where both are 0 and NULL are trigger-use, which is
> unrelated to the code path.
>
This is a case where it would be worth an assertion.
What do you think?

regards,
Ranier Vilela


Re: Possible dereference null return (src/backend/replication/logical/reorderbuffer.c)

2021-02-12 Thread Ranier Vilela
Em sex., 12 de fev. de 2021 às 03:56, Kyotaro Horiguchi <
horikyota@gmail.com> escreveu:

> At Wed, 10 Feb 2021 20:12:38 -0300, Ranier Vilela 
> wrote in
> > Hi,
> >
> > Per Coverity.
> >
> > If xid is a subtransaction, the setup of base snapshot on the top-level
> > transaction,
> > can be not optional, otherwise a Dereference null return value
> > (NULL_RETURNS) can be raised.
> >
> > Patch suggestion to fix this.
> >
> > diff --git a/src/backend/replication/logical/reorderbuffer.c
> > b/src/backend/replication/logical/reorderbuffer.c
> > index 5a62ab8bbc..3c6a81f716 100644
> > --- a/src/backend/replication/logical/reorderbuffer.c
> > +++ b/src/backend/replication/logical/reorderbuffer.c
> > @@ -2993,8 +2993,8 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb,
> > TransactionId xid,
> >   */
> >   txn = ReorderBufferTXNByXid(rb, xid, true, _new, lsn, true);
> >   if (rbtxn_is_known_subxact(txn))
> > - txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
> > - NULL, InvalidXLogRecPtr, false);
> > + txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, true,
> > + NULL, InvalidXLogRecPtr, true);
> >   Assert(txn->base_snapshot == NULL);
>
> If the return from the first call is a subtransaction, the second call
> always obtain the top transaction.  If the top transaction actualy did
> not exist, it's rather the correct behavior to cause SEGV, than
> creating a bogus rbtxn. THus it is wrong to set create=true and
> create_as_top=true.  We could change the assertion like Assert (txn &&
> txn->base_snapshot) to make things clearer.
>
It does not make sense to generate a SEGV on purpose and
assertion would not solve why this happens in a production environment.
Better to report an error if the second call fails.
What do you suggest as a description of the error?

regards,
Ranier Vilela


Re: Why do we have MakeSingleTupleTableSlot instead of not using MakeTupleTableSlot?

2021-02-12 Thread Zhihong Yu
Hi,
MakeSingleTupleTableSlot can be defined as a macro, calling
MakeTupleTableSlot().

Cheers

On Fri, Feb 12, 2021 at 5:44 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Hi,
>
> I wonder, is there a specific reason that MakeTupleTableSlot is
> wrapped up in MakeSingleTupleTableSlot without doing anything than
> just returning the slot created by MakeTupleTableSlot? Do we really
> need MakeSingleTupleTableSlot? Can't we just use MakeTupleTableSlot
> directly? Am I missing something?
>
> I think we can avoid some unnecessary function call costs, for
> instance when called 1000 times inside table_slot_create from
> copyfrom.c or in some other places where MakeSingleTupleTableSlot is
> called in a loop.
>
> If it's okay to remove MakeSingleTupleTableSlot and use
> MakeTupleTableSlot instead, we might have to change in a lot of
> places. If we don't want to change in those many files, we could
> rename MakeTupleTableSlot to MakeSingleTupleTableSlot and change it in
> only a few places.
>
> Thoughts?
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>
>
>


Re: Detecting pointer misalignment (was Re: pgsql: Implementation of subscripting for jsonb)

2021-02-12 Thread Alexander Korotkov
Hi, Thomas!

On Fri, Feb 12, 2021 at 12:04 AM Thomas Munro  wrote:
> On Tue, Feb 9, 2021 at 1:34 PM Alexander Korotkov  
> wrote:
> > Could we have both cfbot + buildfarm animals?
> For cfbot, yeah it does seem like a good idea to throw whatever code
> sanitiser stuff we can into the automated tests, especially stuff that
> isn't prone to false alarms.  Can you please recommend an exact change
> to apply to:
>
> https://github.com/macdice/cfbot/blob/master/cirrus/.cirrus.yml
>
> Note that FreeBSD and macOS are using clang (though you might think
> the latter is using gcc from its configure output...), and Linux is
> using gcc.

Thank you for the feedback!
I'll propose a pull-request at github.

--
Regards,
Alexander Korotkov




Re: Dump public schema ownership & seclabels

2021-02-12 Thread Noah Misch
On Thu, Feb 11, 2021 at 04:08:34AM -0800, Noah Misch wrote:
> On Sun, Jan 17, 2021 at 12:00:06PM +0100, Vik Fearing wrote:
> > On 1/17/21 10:41 AM, Noah Misch wrote:
> > > On Sat, Jan 16, 2021 at 02:05:43PM +0100, Vik Fearing wrote:
> > >> On 12/30/20 12:59 PM, Noah Misch wrote:
> > >>> On Tue, Dec 29, 2020 at 05:49:24AM -0800, Noah Misch wrote:
> >  https://postgr.es/m/20201031163518.gb4039...@rfd.leadboat.com gave 
> >  $SUBJECT as
> >  one of the constituent projects for changing the public schema default 
> >  ACL.
> >  This ended up being simple.  Attached.
> > >>>
> > >>> This is defective; it fails to reproduce nspacl after "ALTER SCHEMA 
> > >>> public
> > >>> OWNER TO pg_write_server_files; REVOKE ALL ON SCHEMA public FROM
> > >>> pg_write_server_files;".  I will try again later.
> 
> Fixed.  The comment added to getNamespaces() explains what went wrong.
> 
> Incidentally, --no-acl is fragile without --no-owner, because any REVOKE
> statements assume a particular owner.  Since one can elect --no-owner at
> restore time, we can't simply adjust the REVOKE statements constructed at dump
> time.  That's not new with this patch or specific to initdb-created objects.
> 
> > >> Could I ask you to also include COMMENTs when you try again, please?
> > > 
> > > That may work.  I had not expected to hear of a person changing the 
> > > comment on
> > > schema public.  To what do you change it?
> > 
> > It was a while ago and I don't remember because it didn't appear in the
> > dump so I stopped doing it. :(
> > 
> > Mine was an actual comment, but there are some tools out there that
> > (ab)use COMMENTs as crude metadata for what they do.  For example:
> > https://postgresql-anonymizer.readthedocs.io/en/stable/declare_masking_rules/#declaring-rules-with-comments
> 
> I've attached a separate patch for this, which applies atop the ownership
> patch.  This makes more restores fail for non-superusers, which is okay.

Oops, I botched a refactoring late in the development of that patch.  Here's a
fixed pair of patches.
Author: Noah Misch 
Commit: Noah Misch 

Dump public schema ownership and security labels.

As a side effect, this corrects dumps of public schema ACLs in databases
where the DBA dropped and recreated that schema.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20201229134924.ga1431...@rfd.leadboat.com

diff --git a/src/bin/pg_dump/dumputils.c b/src/bin/pg_dump/dumputils.c
index 60d306e..ea67e52 100644
--- a/src/bin/pg_dump/dumputils.c
+++ b/src/bin/pg_dump/dumputils.c
@@ -725,6 +725,7 @@ void
 buildACLQueries(PQExpBuffer acl_subquery, PQExpBuffer racl_subquery,
PQExpBuffer init_acl_subquery, PQExpBuffer 
init_racl_subquery,
const char *acl_column, const char *acl_owner,
+   const char *initprivs_expr,
const char *obj_kind, bool binary_upgrade)
 {
/*
@@ -765,23 +766,25 @@ buildACLQueries(PQExpBuffer acl_subquery, PQExpBuffer 
racl_subquery,
  "WITH ORDINALITY AS perm(acl,row_n) "
  "WHERE NOT EXISTS ( "
  "SELECT 1 FROM "
- 
"pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault(%s,%s))) "
+ 
"pg_catalog.unnest(coalesce(%s,pg_catalog.acldefault(%s,%s))) "
  "AS init(init_acl) WHERE acl = 
init_acl)) as foo)",
  acl_column,
  obj_kind,
  acl_owner,
+ initprivs_expr,
  obj_kind,
  acl_owner);
 
printfPQExpBuffer(racl_subquery,
  "(SELECT pg_catalog.array_agg(acl 
ORDER BY row_n) FROM "
  "(SELECT acl, row_n FROM "
- 
"pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault(%s,%s))) "
+ 
"pg_catalog.unnest(coalesce(%s,pg_catalog.acldefault(%s,%s))) "
  "WITH ORDINALITY AS initp(acl,row_n) "
  "WHERE NOT EXISTS ( "
  "SELECT 1 FROM "
  
"pg_catalog.unnest(coalesce(%s,pg_catalog.acldefault(%s,%s))) "
  "AS permp(orig_acl) WHERE acl = 
orig_acl)) as foo)",
+ initprivs_expr,
  obj_kind,
  acl_owner,
  

Re: Extensibility of the PostgreSQL wire protocol

2021-02-12 Thread Fabrízio de Royes Mello
On Thu, Feb 11, 2021 at 12:07 PM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Thu, Feb 11, 2021 at 9:42 AM Jonah H. Harris 
wrote:
> >> As Jan said in his last email, they're not proposing all the different
> >> aspects needed. In fact, nothing has actually been proposed yet. This
> >> is an entirely philosophical debate. I don't even know what's being
> >> proposed at this point - I just know it *could* be useful. Let's just
> >> wait and see what is actually proposed before shooting it down, yes?
>
> > I don't think I'm trying to shoot anything down, because as I said, I
> > like extensibility and am generally in favor of it. Rather, I'm
> > expressing a concern which seems to me to be justified, based on what
> > was posted. I'm sorry that my tone seems to have aggravated you, but
> > it wasn't intended to do so.
>
> Likewise, the point I was trying to make is that a "pluggable wire
> protocol" is only a tiny part of what would be needed to have a credible
> MySQL, Oracle, or whatever clone.  There are large semantic differences
> from those products; there are maintenance issues arising from the fact
> that we whack structures like parse trees around all the time; and so on.
> Maybe there is some useful thing that can be accomplished here, but we
> need to consider the bigger picture rather than believing (without proof)
> that a few hook variables will be enough to do anything.
>

Just to don't miss the point, creating a compat protocol to mimic others
(TDS,
MySQL, etc) is just one use case.

There are other use cases to make wire protocol extensible, for example for
telemetry I can use some hooks to propagate context [1] and get more
detailed
tracing information about the negotiation between frontend and backend and
being able to implement a truly query tracing tool, for example.

Another use case is extending the current protocol to, for example, send
more
information about query execution on CommandComplete command instead of
just the number of affected rows.

About the HTTP protocol I think PG should have it, maybe pure HTTP (no
REST,
just HTTP) because it's the most interoperable. Performance can still be
very good
with HTTP2, and you have a huge ecosystem of tools and proxies (like Envoy)
that
would do wonders with this. You could safely query a db from a web page
(passing
through proxies that would do auth, TLS, etc). Or maybe a higher performing
gRPC
version (which is also HTTP2 and is amazing), but this makes it a bit more
difficult
to query from a web page. In either case, context propagation is already
built-in, and
in a standard way.

Regards,

[1] https://www.w3.org/TR/trace-context/

-- 
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com


Why do we have MakeSingleTupleTableSlot instead of not using MakeTupleTableSlot?

2021-02-12 Thread Bharath Rupireddy
Hi,

I wonder, is there a specific reason that MakeTupleTableSlot is
wrapped up in MakeSingleTupleTableSlot without doing anything than
just returning the slot created by MakeTupleTableSlot? Do we really
need MakeSingleTupleTableSlot? Can't we just use MakeTupleTableSlot
directly? Am I missing something?

I think we can avoid some unnecessary function call costs, for
instance when called 1000 times inside table_slot_create from
copyfrom.c or in some other places where MakeSingleTupleTableSlot is
called in a loop.

If it's okay to remove MakeSingleTupleTableSlot and use
MakeTupleTableSlot instead, we might have to change in a lot of
places. If we don't want to change in those many files, we could
rename MakeTupleTableSlot to MakeSingleTupleTableSlot and change it in
only a few places.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-12 Thread Amit Langote
On Thu, Feb 11, 2021 at 4:43 PM Greg Nancarrow  wrote:
> On Thu, Feb 11, 2021 at 5:33 PM Greg Nancarrow  wrote:
> > On Tue, Feb 9, 2021 at 1:04 AM Amit Langote  wrote:
> > >
> > > * I think that the concerns raised by Tsunakawa-san in:
> > >
> > > https://www.postgresql.org/message-id/TYAPR01MB2990CCB6E24B10D35D28B949FEA30%40TYAPR01MB2990.jpnprd01.prod.outlook.com
> > >
> > > regarding how this interacts with plancache.c deserve a look.
> > > Specifically, a plan that uses parallel insert may fail to be
> > > invalidated when partitions are altered directly (that is without
> > > altering their root parent).  That would be because we are not adding
> > > partition OIDs to PlannerGlobal.invalItems despite making a plan
> > > that's based on checking their properties.  See this (tested with all
> > > patches applied!):
> > >
> >
> > Does any current Postgres code add partition OIDs to
> > PlannerGlobal.invalItems for a similar reason?

Currently, the planner opens partitions only for SELECT queries and
also adds them to the query's range table.  And because they are added
to the range table, their OIDs do get added to
PlannerGlobal.relationOids (not invalItems, sorry!) by way of
CompleteCachedPlan() calling extract_query_dependencies(), which looks
at Query.rtable to decide which tables/partitions to add.

> > I would have thought that, for example,  partitions with a default
> > column expression, using a function that is changed from SAFE to
> > UNSAFE, would suffer the same plancache issue (for current parallel
> > SELECT functionality) as we're talking about here - but so far I
> > haven't seen any code handling this.

AFAIK, default column expressions don't affect plans for SELECT
queries.  OTOH, consider a check constraint expression as an example.
The planner may use one to exclude a partition from the plan with its
constraint exclusion algorithm (separate from "partition pruning").
If the check constraint is dropped, any cached plans that used it will
be invalidated.

create table rp (a int) partition by range (a);
create table rp1 partition of rp for values from (minvalue) to (0);
create table rp2 partition of rp for values from (0) to (maxvalue);
alter table rp1 add constraint chk check (a >= -5);
set constraint_exclusion to on;

-- forces using a cached plan
set plan_cache_mode to force_generic_plan ;
prepare q as select * from rp where a < -5;

-- planner excluded rp1 because of the contradictory constraint
explain execute q;
QUERY PLAN
--
 Result  (cost=0.00..0.00 rows=0 width=0)
   One-Time Filter: false
(2 rows)

-- constraint dropped, plancache inval hook invoked
alter table rp1 drop constraint chk ;

-- old plan invalidated, new one made
explain execute q;
   QUERY PLAN
-
 Seq Scan on rp1 rp  (cost=0.00..41.88 rows=850 width=4)
   Filter: (a < '-5'::integer)
(2 rows)

> > (Currently invalItems seems to support PROCID and TYPEOID; relation
> > OIDs seem to be handled through a different mechanism)..
>
> > Can you elaborate on what you believe is required here, so that the
> > partition OID dependency is registered and the altered partition
> > results in the plan being invalidated?
> > Thanks in advance for any help you can provide here.
>
> Actually, I tried adding the following in the loop that checks the
> parallel-safety of each partition and it seemed to work:
>
> glob->relationOids =
> lappend_oid(glob->relationOids, pdesc->oids[i]);
>
> Can you confirm, is that what you were referring to?

Right.  I had mistakenly mentioned PlannerGlobal.invalItems, sorry.

Although it gets the job done, I'm not sure if manipulating
relationOids from max_parallel_hazard() or its subroutines is okay,
but I will let the committer decide that.  As I mentioned above, the
person who designed this decided for some reason that it is
extract_query_dependencies()'s job to populate
PlannerGlobal.relationOids/invalItems.

> (note that I've already updated the code to use
> CreatePartitionDirectory() and PartitionDirectoryLookup())

I will check your v16 to check if that indeed does the intended thing.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: PostgreSQL and Flashback Database

2021-02-12 Thread Philippe Beaudoin

Hi Didier,

Have you ever had a look at the E-Maj extension.

Depending on the features you are really looking for, it may fit the needs.

Here are some pointers :

- github repo for the extension : https://github.com/dalibo/emaj

- github repo for the web client : https://github.com/dalibo/emaj_web

- online documentation : https://emaj.readthedocs.io/en/latest/ 
 or even in French, espacially 
for you ;-) https://emaj.readthedocs.io/fr/latest/


Feel free to contact me to talk about it.

Best regards.

Philippe.

Le 10/02/2021 à 09:56, ROS Didier a écrit :


Hi

My company is looking for a team of developers to implement the 
"flashback database" functionality in PostgreSQL.


  Do you think it's feasible to implement? how many days 
of development?


  Thanks in advance

Best Regards

Didier ROS

E.D.F



Ce message et toutes les pièces jointes (ci-après le 'Message') sont 
établis à l'intention exclusive des destinataires et les informations 
qui y figurent sont strictement confidentielles. Toute utilisation de 
ce Message non conforme à sa destination, toute diffusion ou toute 
publication totale ou partielle, est interdite sauf autorisation expresse.


Si vous n'êtes pas le destinataire de ce Message, il vous est interdit 
de le copier, de le faire suivre, de le divulguer ou d'en utiliser 
tout ou partie. Si vous avez reçu ce Message par erreur, merci de le 
supprimer de votre système, ainsi que toutes ses copies, et de n'en 
garder aucune trace sur quelque support que ce soit. Nous vous 
remercions également d'en avertir immédiatement l'expéditeur par 
retour du message.


Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de 
toute erreur ou virus.



This message and any attachments (the 'Message') are intended solely 
for the addressees. The information contained in this Message is 
confidential. Any use of information contained in this Message not in 
accord with its purpose, any dissemination or disclosure, either whole 
or partial, is prohibited except formal approval.


If you are not the addressee, you may not copy, forward, disclose or 
use any part of it. If you have received this message in error, please 
delete it and all copies from your system and notify the sender 
immediately by return message.


E-mail communication cannot be guaranteed to be timely secure, error 
or virus-free.





*DALIBO*
*L'expertise PostgreSQL*
43, rue du Faubourg Montmartre
75009 Paris *Philippe Beaudoin*
*Consultant Avant-Vente*
+33 (0)1 84 72 76 11
+33 (0)7 69 14 67 21
philippe.beaud...@dalibo.com
Valorisez vos compétences PostgreSQL, faites-vous certifier par Dalibo 
 !




Re: logical replication seems broken

2021-02-12 Thread Amit Kapila
On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers  wrote:
>
> Hello,
>
> I am seeing errors in replication in a test program that I've been running 
> for years with very little change (since 2017, really [1]).
>
> The symptom:
> HEAD-replication fails (most of the time) when cascading 3 instances 
> (master+2 replicas).
>
> HEAD-replication works with 2 instances (master+replica).
>
> I have also compiled a server on top of 4ad31bb2ef25 (avoiding some recent 
> changes) - and this server runs the same test program without failure; so I 
> think the culprit might be somewhere in those changes.
>

Oh, if you are running this on today's HEAD then the recent commit
bff456d7a0 could be the culprit but not sure without knowing the
details.

>  Or (always possible) there might be something my testing does wrong - but 
> then again, I do this test a few times every week, and it never fails.
>

Do you expect anything in particular while running this test, any
expected messages? What is exactly failing?

> This weekend I can dig into it some more (make a self-sufficient example) but 
> I thought I'd mention it already. perhaps one of you see the light 
> immediately...
>

That would be really helpful.

-- 
With Regards,
Amit Kapila.




Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-12 Thread Greg Nancarrow
On Fri, Feb 12, 2021 at 5:30 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> > If I disable parallel_leader_participation.
> >
> > For max_parallel_workers_per_gather = 4, It still have performance
> > degradation.
> >
> > For max_parallel_workers_per_gather = 2, the performance degradation will
> > not happen in most of the case.
> > There is sometimes a noise(performance degradation), but most of
> > result(about 80%) is good.
>
> Thank you.  The results indicate that it depends on the degree of parallelism 
> whether the gain from parallelism outweighs the loss of parallel insert 
> operations, at least in the bitmap scan case.
>

That seems to be the pattern for this particular query, but I think
we'd need to test a variety to determine if that's always the case.

> But can we conclude that this is limited to bitmap scan?  Even if that's the 
> case, the planner does not have information about insert operation to choose 
> other plans like serial execution or parallel sequential scan.  Should we 
> encourage the user in the manual to tune parameters and find the fastest plan?
>
>

It's all based on path costs, so we need to analyze and compare the
costing calculations done in this particular case against other cases,
and the values of the various parameters (costsize.c).
It's not difficult to determine for a parallel ModifyTablePath if it
has a BitmapHeapPath subpath - perhaps total_cost needs adjustment
(increase) for this case - and that will influence the planner to
choose a cheaper path. I was able to easily test the effect of doing
this, in the debugger - by increasing total_cost in cost_modifytable()
for the parallel bitmap heap scan case, the planner then chose a
serial Insert + bitmap heap scan, because it then had a cheaper cost.
Of course we need to better understand the problem and observed
patters in order to get a better feel of how those costs should be
adjusted.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Improvements and additions to COPY progress reporting

2021-02-12 Thread Bharath Rupireddy
On Fri, Feb 12, 2021 at 5:40 PM Matthias van de Meent
 wrote:
>
> On Fri, 12 Feb 2021 at 12:23, Matthias van de Meent
>  wrote:
> >
> > On Thu, 11 Feb 2021 at 15:44, Bharath Rupireddy
> >  wrote:
> > >
> > >
> > > On Thu, Feb 11, 2021, 8:08 PM Josef Šimánek  
> > > wrote:
> > >> I have split it since it should be the start of progress reporting
> > >> testing at all. If you better consider this as part of COPY testing,
> > >> feel free to move it to already existing copy testing related files.
> > >> There's no real reason to keep it separated if not needed.
> > >
> > >
> > > +1 to move those test cases to existing copy test files.
> >
> > Thanks for your reviews. PFA v4 of the patchset, in which the tests
> > are put into copy.sql (well, input/copy.source). This also adds tests
> > for correctly reporting COPY ... FROM 'file'.
>
> PFA v5, which fixes a failure in the pg_upgrade regression tests due
> to incorrect usage of @abs_builddir@. I had the changes staged, but
> forgot to add them to the patches.
>
> Sorry for the noise.

Looks like the patch 0001 that was adding tuples_excluded was missing
and cfbot is also not happy with the v5 patch set.

Maybe, we may not need 6 patches as they are relatively very small
patches. IMO, the following are enough:

0001 - tuples_excluded, lines to tuples change, COPY FROM/COPY TO
addition, io_target -- basically all the code related patches can go
into 0001
0002 - documentation
0003 - tests - I think we can only have a simple test(in copy2.sql)
showing stdin/stdout and not have file related tests. Because these
patches work as expected, please find my testing below:

postgres=# select * from pg_stat_progress_copy;
   pid   | datid | datname  | relid |  command  | io_target |
bytes_processed | bytes_total | tuples_processed | tuples_excluded
-+---+--+---+---+---+-+-+--+-
 2886103 | 12977 | postgres | 16384 | COPY FROM | FILE  |
83099648 |8595 |  9553999 | 111
(1 row)

postgres=# select * from pg_stat_progress_copy;
   pid   | datid | datname  | relid |  command  | io_target |
bytes_processed | bytes_total | tuples_processed | tuples_excluded
-+---+--+---+---+---+-+-+--+-
 2886103 | 12977 | postgres | 16384 | COPY FROM | STDIO |
 0 |   0 |0 |   0
(1 row)

postgres=# select * from pg_stat_progress_copy;
   pid   | datid | datname  | relid | command | io_target |
bytes_processed | bytes_total | tuples_processed | tuples_excluded
-+---+--+---+-+---+-+-+--+-
 2886103 | 12977 | postgres | 16384 | COPY TO | FILE  |
37771610 |   0 |  4999228 |   0
(1 row)

postgres=# select * from pg_stat_progress_copy;
   pid   | datid | datname  | relid |  command  | io_target |
bytes_processed | bytes_total | tuples_processed | tuples_excluded
-+---+--+---+---+---+-+-+--+-
 2892816 | 12977 | postgres | 16384 | COPY FROM | CALLBACK  |
249777823 |   0 | 3192 |   0
(1 row)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




logical replication seems broken

2021-02-12 Thread Erik Rijkers
Hello,

I am seeing errors in replication in a test program that I've been running for 
years with very little change (since 2017, really [1]).

The symptom:
HEAD-replication fails (most of the time) when cascading 3 instances (master+2 
replicas).

HEAD-replication works with 2 instances (master+replica).

I have also compiled a server on top of 4ad31bb2ef25 (avoiding some recent 
changes) - and this server runs the same test program without failure; so I 
think the culprit might be somewhere in those changes.  Or (always possible) 
there might be something my testing does wrong - but then again, I do this test 
a few times every week, and it never fails.

This weekend I can dig into it some more (make a self-sufficient example) but I 
thought I'd mention it already. perhaps one of you see the light immediately...


Erik Rijkers

[1] 
https://www.postgresql.org/message-id/flat/3897361c7010c4ac03f358173adbcd60%40xs4all.nl




Re: Improvements and additions to COPY progress reporting

2021-02-12 Thread Matthias van de Meent
On Fri, 12 Feb 2021 at 12:23, Matthias van de Meent
 wrote:
>
> On Thu, 11 Feb 2021 at 15:44, Bharath Rupireddy
>  wrote:
> >
> >
> > On Thu, Feb 11, 2021, 8:08 PM Josef Šimánek  wrote:
> >> I have split it since it should be the start of progress reporting
> >> testing at all. If you better consider this as part of COPY testing,
> >> feel free to move it to already existing copy testing related files.
> >> There's no real reason to keep it separated if not needed.
> >
> >
> > +1 to move those test cases to existing copy test files.
>
> Thanks for your reviews. PFA v4 of the patchset, in which the tests
> are put into copy.sql (well, input/copy.source). This also adds tests
> for correctly reporting COPY ... FROM 'file'.

PFA v5, which fixes a failure in the pg_upgrade regression tests due
to incorrect usage of @abs_builddir@. I had the changes staged, but
forgot to add them to the patches.

Sorry for the noise.

-Matthias
From 5727e7d6de2e0ffa9c5f319fb92c8eb35421f50b Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Fri, 12 Feb 2021 12:53:14 +0100
Subject: [PATCH v5 6/6] Add copy progress reporting regression tests.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This tests some basic features of copy progress reporting.

Sadly, we can only request one snapshot of progress_reporting each
transaction / statement, so we can't (easily) get intermediate results without
each result requiring a seperate statement being run.

Author: Josef Šimánek, Matthias van de Meent
---
 src/test/regress/input/copy.source  | 2 +-
 src/test/regress/output/copy.source | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/test/regress/input/copy.source b/src/test/regress/input/copy.source
index 0ce267e1cc..ddde33e7cc 100644
--- a/src/test/regress/input/copy.source
+++ b/src/test/regress/input/copy.source
@@ -250,7 +250,7 @@ bill	20	(11,10)	1000	sharon
 
 -- reporting of FILE imports, and correct reporting of tuples-excluded
 
-copy progress_reporting from '@abs_builddir@/data/emp.data'
+copy progress_reporting from '@abs_srcdir@/data/emp.data'
 	where (salary < 2000);
 
 -- cleanup progress_reporting
diff --git a/src/test/regress/output/copy.source b/src/test/regress/output/copy.source
index 32edc60319..60f4206aa1 100644
--- a/src/test/regress/output/copy.source
+++ b/src/test/regress/output/copy.source
@@ -202,7 +202,7 @@ create trigger check_after_progress_reporting
 copy progress_reporting from stdin;
 INFO:  progress: {"command": "COPY FROM", "datname": "regression", "io_target": "STDIO", "bytes_total": 0, "bytes_processed": 79, "tuples_excluded": 0, "tuples_processed": 3}
 -- reporting of FILE imports, and correct reporting of tuples-excluded
-copy progress_reporting from '@abs_builddir@/data/emp.data'
+copy progress_reporting from '@abs_srcdir@/data/emp.data'
 	where (salary < 2000);
 INFO:  progress: {"command": "COPY FROM", "datname": "regression", "io_target": "FILE", "bytes_total": 79, "bytes_processed": 79, "tuples_excluded": 1, "tuples_processed": 2}
 -- cleanup progress_reporting
-- 
2.20.1

From 181e95a07d7f1d3bac08f1596622e6da2519d069 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Josef=20=C5=A0im=C3=A1nek?= 
Date: Tue, 9 Feb 2021 13:06:45 +0100
Subject: [PATCH v5 5/6] Add copy progress reporting regression tests.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This tests some basic features of copy progress reporting.

Sadly, we can only request one snapshot of progress_reporting each
transaction / statement, so we can't (easily) get intermediate results without
each result requiring a seperate statement being run.

Author: Josef Šimánek, Matthias van de Meent
---
 src/test/regress/input/copy.source  | 57 +
 src/test/regress/output/copy.source | 44 ++
 2 files changed, 101 insertions(+)

diff --git a/src/test/regress/input/copy.source b/src/test/regress/input/copy.source
index a1d529ad36..0ce267e1cc 100644
--- a/src/test/regress/input/copy.source
+++ b/src/test/regress/input/copy.source
@@ -201,3 +201,60 @@ select * from parted_copytest where b = 1;
 select * from parted_copytest where b = 2;
 
 drop table parted_copytest;
+
+--
+-- progress reporting
+-- 
+
+-- setup
+-- reuse employer datatype, that has a small sized data set
+
+create table progress_reporting (
+	name 		text,
+	age			int4,
+	location 	point,
+	salary 		int4,
+	manager 	name
+);
+
+create function notice_after_progress_reporting() returns trigger AS
+$$
+declare report record;
+begin
+	-- We cannot expect 'pid' nor 'relid' to be consistent over runs due to 
+	-- variance in system process ids, and concurrency in runs of tests.
+	-- Additionally, due to the usage of this test in pg_regress, the 'datid'
+	-- also is not consistent between runs.
+	select into report (to_jsonb(r) - '{pid,relid,datid}'::text[]) as value
+		from pg_stat_progress_copy r
+		where pid = pg_backend_pid();
+
+	

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-12 Thread Zhihong Yu
Greg:
Thanks for more debugging.

Cheers

On Thu, Feb 11, 2021 at 9:43 PM Greg Nancarrow  wrote:

> On Fri, Feb 12, 2021 at 3:21 PM Zhihong Yu  wrote:
> >
> > Greg:
> > bq. we should just return parsetree->hasModifyingCTE at this point,
> >
> > Maybe you can clarify a bit.
> > The if (parsetree->hasModifyingCTE) check is followed by if
> (!hasModifyingCTE).
> > When parsetree->hasModifyingCTE is false, !hasModifyingCTE would be
> true, resulting in the execution of the if (!hasModifyingCTE) block.
> >
> > In your reply, did you mean that the if (!hasModifyingCTE) block is no
> longer needed ? (I guess not)
> >
>
> Sorry for not making it clear. What I meant was that instead of:
>
> if (parsetree->querySource == QSRC_ORIGINAL)
> {
>   /* Assume original queries have hasModifyingCTE set correctly */
>   if (parsetree->hasModifyingCTE)
> hasModifyingCTE = true;
> }
>
> I thought I should be able to use the following (it the setting for
> QSRC_ORIGINAL can really be trusted):
>
> if (parsetree->querySource == QSRC_ORIGINAL)
> {
>   /* Assume original queries have hasModifyingCTE set correctly */
>   return parsetree->hasModifyingCTE;
> }
>
> (and then the "if (!hasModifyingCTE)" test on the code following
> immediately below it can be removed)
>
>
> BUT - after testing that change, the problem test case (in the "with"
> tests) STILL fails.
> I then checked if hasModifyingCTE is always false in the QSRC_ORIGINAL
> case (by adding an Assert), and it always is false.
> So actually, there is no point in having the "if
> (parsetree->querySource == QSRC_ORIGINAL)" code block - even the so
> called "original" query doesn't maintain the setting correctly (even
> though the actual original query sent into the query rewriter does).
> And also then the "if (!hasModifyingCTE)" test on the code following
> immediately below it can be removed.
>
> Regards,
> Greg Nancarrow
> Fujitsu Australia
>


Re: Improvements and additions to COPY progress reporting

2021-02-12 Thread Matthias van de Meent
On Thu, 11 Feb 2021 at 15:44, Bharath Rupireddy
 wrote:
>
>
> On Thu, Feb 11, 2021, 8:08 PM Josef Šimánek  wrote:
>> I have split it since it should be the start of progress reporting
>> testing at all. If you better consider this as part of COPY testing,
>> feel free to move it to already existing copy testing related files.
>> There's no real reason to keep it separated if not needed.
>
>
> +1 to move those test cases to existing copy test files.

Thanks for your reviews. PFA v4 of the patchset, in which the tests
are put into copy.sql (well, input/copy.source). This also adds tests
for correctly reporting COPY ... FROM 'file'.

I've changed the notice-alerted format from manually naming each
column to calling to_jsonb and removing the unstable columns from the
reported value; this should therefore be stable and give direct notice
to changes in the view.

With regards,

Matthias van de Meent.
From f7d761f6774753d4914d0dbc80effbb1ab09b58e Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Mon, 8 Feb 2021 17:36:00 +0100
Subject: [PATCH v4 4/6] Add a command column to the copy progress view

This allows filtering on COPY FROM / COPY TO for progress reporting, and makes
it possible to determine the further meaning of the columns involved.
---
 doc/src/sgml/monitoring.sgml | 10 ++
 src/backend/catalog/system_views.sql |  3 +++
 src/backend/commands/copyfrom.c  |  1 +
 src/backend/commands/copyto.c|  1 +
 src/include/commands/progress.h  |  5 +
 src/test/regress/expected/rules.out  |  5 +
 6 files changed, 25 insertions(+)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 940e9dcee4..ca84b53896 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6544,6 +6544,16 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   
  
 
+ 
+  
+   command text
+  
+  
+   The command that is running: COPY FROM, or
+   COPY TO.
+  
+ 
+
  
   
bytes_processed bigint
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e7e227792c..1082b7d253 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1129,6 +1129,9 @@ CREATE VIEW pg_stat_progress_copy AS
 SELECT
 S.pid AS pid, S.datid AS datid, D.datname AS datname,
 S.relid AS relid,
+CASE S.param5 WHEN 1 THEN 'COPY FROM'
+  WHEN 2 THEN 'COPY TO'
+  END AS command,
 S.param1 AS bytes_processed,
 S.param2 AS bytes_total,
 S.param3 AS tuples_processed,
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index fb3c7e2c0c..ce343dbf80 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1428,6 +1428,7 @@ BeginCopyFrom(ParseState *pstate,
 	/* initialize progress */
 	pgstat_progress_start_command(PROGRESS_COMMAND_COPY,
   cstate->rel ? RelationGetRelid(cstate->rel) : InvalidOid);
+	pgstat_progress_update_param(PROGRESS_COPY_COMMAND, PROGRESS_COPY_COMMAND_FROM);
 	cstate->bytes_processed = 0;
 
 	/* We keep those variables in cstate. */
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index 9ffe7a6ee3..534c091c75 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -772,6 +772,7 @@ BeginCopyTo(ParseState *pstate,
 	/* initialize progress */
 	pgstat_progress_start_command(PROGRESS_COMMAND_COPY,
   cstate->rel ? RelationGetRelid(cstate->rel) : InvalidOid);
+	pgstat_progress_update_param(PROGRESS_COPY_COMMAND, PROGRESS_COPY_COMMAND_TO);
 	cstate->bytes_processed = 0;
 
 	MemoryContextSwitchTo(oldcontext);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 8b2b188bd5..1c30d09abb 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -138,5 +138,10 @@
 #define PROGRESS_COPY_BYTES_TOTAL 1
 #define PROGRESS_COPY_TUPLES_PROCESSED 2
 #define PROGRESS_COPY_TUPLES_EXCLUDED 3
+#define PROGRESS_COPY_COMMAND 4
+
+/* Commands of PROGRESS_COPY_COMMAND */
+#define PROGRESS_COPY_COMMAND_FROM 1
+#define PROGRESS_COPY_COMMAND_TO 2
 
 #endif
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 970f6909c2..63b5e33083 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1948,6 +1948,11 @@ pg_stat_progress_copy| SELECT s.pid,
 s.datid,
 d.datname,
 s.relid,
+CASE s.param5
+WHEN 1 THEN 'COPY FROM'::text
+WHEN 2 THEN 'COPY TO'::text
+ELSE NULL::text
+END AS command,
 s.param1 AS bytes_processed,
 s.param2 AS bytes_total,
 s.param3 AS tuples_processed,
-- 
2.20.1

From 181e95a07d7f1d3bac08f1596622e6da2519d069 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Josef=20=C5=A0im=C3=A1nek?= 
Date: Tue, 9 Feb 2021 13:06:45 +0100
Subject: 

Re: [HACKERS] logical decoding of two-phase transactions

2021-02-12 Thread Amit Kapila
On Fri, Feb 12, 2021 at 12:29 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Thursday, February 11, 2021 5:10 PM Peter Smith  
> wrote:
> > Please find attached the new 2PC patch set v39*
> I started to review the patchset
> so, let me give some comments I have at this moment.
>
> (1)
>
> File : v39-0007-Support-2PC-txn-tests-for-concurrent-aborts.patch
> Modification :
>
> @@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, 
> ReorderBufferTXN *txn,
> }
> txndata->xact_wrote_changes = true;
>
> +   /* For testing concurrent  aborts */
> +   test_concurrent_aborts(data);
> +
> class_form = RelationGetForm(relation);
> tupdesc = RelationGetDescr(relation);
>
> Comment : There are unnecessary whitespaces in comments like above in v37-007
> Please check such as pg_decode_change(), pg_decode_truncate(), 
> pg_decode_stream_truncate() as well.
> I suggest you align the code formats by pgindent.
>

This patch (v39-0007-Support-2PC-txn-tests-for-concurrent-aborts.patch)
is mostly for dev-testing purpose. We don't intend to commit as this
has a lot of timing-dependent tests and I am not sure if it is
valuable enough at this stage. So, we can ignore cosmetic comments in
this patch for now.

-- 
With Regards,
Amit Kapila.




Re: repeated decoding of prepared transactions

2021-02-12 Thread Amit Kapila
On Fri, Feb 12, 2021 at 1:10 AM Robert Haas  wrote:
>
> On Thu, Feb 11, 2021 at 5:37 AM Amit Kapila  wrote:
> > to explain the exact case to you which is not very apparent. The basic
> > idea is that we ship/replay all transactions where commit happens
> > after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see
> > atop snapbuild.c for details. Now, for transactions where prepare is
> > before snapshot state SNAPBUILD_CONSISTENT and commit prepared is
> > after SNAPBUILD_CONSISTENT, we need to send the entire transaction
> > including prepare at the commit time.
>
> This might be a dumb question, but: why?
>
> Is this because the effects of the prepared transaction might
> otherwise be included neither in the initial synchronization of the
> data nor in any subsequently decoded transaction, thus leaving the
> replica out of sync?
>

Yes.

-- 
With Regards,
Amit Kapila.




Re: Asynchronous Append on postgres_fdw nodes.

2021-02-12 Thread Kyotaro Horiguchi
At Wed, 10 Feb 2021 21:31:15 +0900, Etsuro Fujita  
wrote in 
> On Wed, Feb 10, 2021 at 7:31 PM Etsuro Fujita  wrote:
> > Attached is an updated version of the patch.  Sorry for the delay.
> 
> I noticed that I forgot to add new files.  :-(.  Please find attached
> an updated patch.

Thanks for the new version.

It seems too specific to async Append so I look it as a PoC of the
mechanism.

It creates a hash table that keyed by connection umid to record
planids run on the connection, triggerd by core planner via a dedicate
API function.  It seems to me that ConnCacheEntry.state can hold that
and the hash is not needed at all.

| postgresReconsiderAsyncForeignScan(ForeignScanState *node, AsyncContext *acxt)
| {
| ...
| /*
|* If the connection used for the ForeignScan node is used in other 
parts
|* of the query plan tree except async subplans of the parent Append 
node,
|* disable async execution of the ForeignScan node.
|*/
|   if (!bms_is_subset(fsplanids, asyncplanids))
|   return false;

This would be a reasonable restriction.

|   /*
|* If the subplans of the Append node are all async-capable, and use the
|* same connection, then we won't execute them asynchronously.
|*/
|   if (requestor->as_nasyncplans == requestor->as_nplans &&
|   !bms_nonempty_difference(asyncplanids, fsplanids))
|   return false;

It is the correct restiction?  I understand that the currently
intending restriction is one connection accepts at most one FDW-scan
node. This looks somethig different...

(Sorry, time's up for now.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center