Re: doc review for v14

2020-12-28 Thread Michael Paquier
On Tue, Dec 29, 2020 at 01:59:58PM +1300, Thomas Munro wrote:
> LGTM.

Thanks, I have done this one then.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] logical decoding of two-phase transactions

2020-12-28 Thread Masahiko Sawada
Hi Ajin,

On Wed, Dec 23, 2020 at 6:38 PM Ajin Cherian  wrote:
>
> On Tue, Dec 22, 2020 at 8:59 PM Amit Kapila  wrote:
> >
> > On Tue, Dec 22, 2020 at 2:51 PM Ajin Cherian  wrote:
> > >
> > > On Sat, Dec 19, 2020 at 2:13 PM Amit Kapila  
> > > wrote:
> > >
> > > > Okay, I have changed the rollback_prepare API as discussed above and
> > > > accordingly handle the case where rollback is received without prepare
> > > > in apply_handle_rollback_prepared.
> > >
> > >
> > > I have reviewed and tested your new patchset, I agree with all the
> > > changes that you have made and have tested quite a few scenarios and
> > > they seem to be working as expected.
> > > No major comments but some minor observations:
> > >
> > > Patch 1:
> > > logical.c: 984
> > > Comment should be "rollback prepared" rather than "abort prepared".
> > >
> >
> > Agreed.
>
> Changed.
>
> >
> > > Patch 2:
> > > decode.c: 737: The comments in the header of DecodePrepare seem out of
> > > place, I think here it should describe what the function does rather
> > > than what it does not.
> > >
> >
> > Hmm, I have written it because it is important to explain the theory
> > of concurrent aborts as that is not quite obvious. Also, the
> > functionality is quite similar to DecodeCommit and the comments inside
> > the function explain clearly if there is any difference so not sure
> > what additional we can write, do you have any suggestions?
>
> I have slightly re-worded it. Have a look.
>
> >
> > > reorderbuffer.c: 2422: It looks like pg_indent has mangled the
> > > comments, the numbering is no longer aligned.
> > >
> >
> > Yeah, I had also noticed that but not sure if there is a better
> > alternative because we don't want to change it after each pgindent
> > run. We might want to use (a), (b) .. notation instead but otherwise,
> > there is no big problem with how it is.
>
> Leaving this as is.
>
> >
> > > Patch 5:
> > > worker.c: 753: Type: change "dont" to "don't"
> > >
> >
> > Okay.
>
> Changed.
>
> >
> > > Patch 6: logicaldecoding.sgml
> > > logicaldecoding example is no longer correct. This was true prior to
> > > the changes done to replay prepared transactions after a restart. Now
> > > the whole transaction will get decoded again after the commit
> > > prepared.
> > >
> > > postgres=# COMMIT PREPARED 'test_prepared1';
> > > COMMIT PREPARED
> > > postgres=# select * from
> > > pg_logical_slot_get_changes('regression_slot', NULL, NULL,
> > > 'two-phase-commit', '1');
> > > lsn| xid |data
> > > ---+-+
> > >  0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
> > > (1 row)
> > >
> >
> > Agreed.
>
> Changed.
>
> >
> > > Patch 8:
> > > worker.c: 2798 :
> > > worker.c: 3445 : disabling two-phase in tablesync worker.
> > >  considering new design of multiple commits in tablesync, do we need
> > > to disable two-phase in tablesync?
> > >
> >
> > No, but let Peter's patch get committed then we can change it.
>
> OK, leaving it.
>
> > Can you please update the patch for the points we agreed upon?
>
> Changed and attached.

Thank you for updating the patches!

I realized that this patch is not registered yet for the next
CommitFest[1] that starts in a couple of days. I found the old entry
of this patch[2] but it's marked as "Returned with feedback". Although
this patch is being reviewed actively, I suggest you adding it before
2021-01-01 AoE[2] so cfbot also can test your patch.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://commitfest.postgresql.org/22/944/
[3] https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: New IndexAM API controlling index vacuum strategies

2020-12-28 Thread Peter Geoghegan
On Mon, Dec 28, 2020 at 11:20 PM Peter Geoghegan  wrote:
> > That's a very good result in terms of skipping lazy_vacuum_heap(). How
> > much the table and indexes bloated? Also, I'm curious about that which
> > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
> > test if maintenance_work_mem space is running out? And what was the
> > impact on clearing all-visible bits?
>
> The pgbench_accounts heap table and 3 out of 4 of its indexes (i.e.
> all indexes except "abalance_ruin") had zero growth. They did not even
> become larger by 1 block. As I often say when talking about work in
> this area, this is not a quantitative difference -- it's a qualitative
> difference. (If they grew even a tiny amount, say by only 1 block,
> further growth is likely to follow.)

I forgot to say: I don't know what the exact impact was on the VM bit
setting, but I doubt that it was noticeably worse for the patch. It
cannot have been better, though.

It's inherently almost impossible to keep most of the VM bits set for
long with this workload. Perhaps VM bit setting would be improved with
workloads that have some HOT updates, but as I mentioned this workload
only had non-HOT updates (except in a tiny number of cases where
abalance did not change, just by random luck).

I also forget to say that the maintenance_work_mem test wasn't that
relevant, though I believe it triggered once. maintenance_work_mem was
set very high (5GB).

Here is a link with more details information, in case that is
interesting: 
https://drive.google.com/file/d/1TqpAQnqb4SMMuhehD8ELpf6Cv9A8ux2E/view?usp=sharing

-- 
Peter Geoghegan




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-12-28 Thread Masahiko Sawada
Hi Simon,

On Mon, Nov 30, 2020 at 1:53 AM Simon Riggs  wrote:
>
> On Fri, 20 Nov 2020 at 15:29, Alvaro Herrera  wrote:
> >
> > Note on heap_prepare_freeze_tuple()'s fifth parameter, it's not valid to
> > pass OldestXmin; you need a multixact limit there, not an Xid limit.  I
> > think the return value of GetOldestMultiXactId is a good choice.  AFAICS
> > this means that you'll need to add a new output argument to
> > vacuum_set_xid_limits (You *could* call GetOldestMultiXactId again, but
> > it seems a waste).
> >
> > Maybe it's time for vacuum_set_xid_limits to have a caller's-stack-
> > allocated struct for input and output values, rather than so many
> > arguments and output pointers to fill.
>
> The idea was to maximise freezing because we are already dirtying this
> data block, so the reasoning doesn't extend directly to multixacts.
> Being more active with multixacts could easily cause more churn there.
>
> So I'm changing the patch to only work with Xids not both xids and
> multixacts. If this gets committed we can think more about multixacts
> and whether to optimize freezing them in the same situation or not.
>

You sent in your patch, one_freeze_then_max_freeze.v7.patch to
pgsql-hackers on Nov 30, but you did not post it to the next
CommitFest[1].  If this was intentional, then you need to take no
action.  However, if you want your patch to be reviewed as part of the
upcoming CommitFest, then you need to add it yourself before
2021-01-01 AoE[2]. Thanks for your contributions.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: Detecting File Damage & Inconsistencies

2020-12-28 Thread Masahiko Sawada
Hi Simon,

On Wed, Nov 18, 2020 at 2:14 AM Simon Riggs  wrote:
>
> On Fri, 13 Nov 2020 at 11:24, Simon Riggs  wrote:
> >
> > On Fri, 13 Nov 2020 at 00:50, tsunakawa.ta...@fujitsu.com
> >  wrote:
> > >
> > > From: Simon Riggs 
> > > > If a rogue user/process is suspected, this would allow you to identify
> > > > more easily the changes made by specific sessions/users.
> > >
> > > Isn't that kind of auditing a job of pgAudit or log_statement = mod?  Or, 
> > > does "more easily" mean that you find pgAudit complex to use and/or 
> > > log_statement's overhead is big?
> >
> > Well, I designed pgaudit, so yes, I think pgaudit is useful.
> >
> > However, pgaudit works at the statement level, not the data level. So
> > using pgaudit to locate data rows that have changed is fairly hard.
> >
> > What I'm proposing is an option to add 16 bytes onto each COMMIT
> > record, which is considerably less than turning on full auditing in
> > pgaudit. This option would allow identifying data at the row level, so
> > you could for example find all rows changed by specific sessions.
> > Also, because it is stored in WAL it will show updates that might no
> > longer exist in the database because the changed row versions might
> > have been vacuumed away. So pgaudit will tell you that happened, but
> > having extra info in WAL is important also.
> >
> > So thank you for the question because it has allowed me to explain why
> > it is useful and important.
>
> Patch attached to implement "wal_sessioninfo" option.

You sent in your patch, wal_sessioninfo.v2.patch to pgsql-hackers on
Nov 18, but you did not post it to the next CommitFest[1].  If this
was intentional, then you need to take no action.  However, if you
want your patch to be reviewed as part of the upcoming CommitFest,
then you need to add it yourself before 2021-01-01 AoE[2]. Thanks for
your contributions.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: New IndexAM API controlling index vacuum strategies

2020-12-28 Thread Peter Geoghegan
On Mon, Dec 28, 2020 at 10:26 PM Masahiko Sawada  wrote:
> The second test checking if maintenane_work_mem space is running out
> also makes sense to me. Perhaps another idea would be to compare the
> number of collected garbage tuple to the total number of heap tuples
> so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
> certain amount of garbage in the table.

Right. Or to consider if this is an anti-wraparound VACUUM might be
nice -- maybe we should skip index vacuuming + lazy_vacuum_heap() if
and only if we're under pressure to advance datfrozenxid for the whole
DB, and really need to hurry up. (I think that we could both probably
think of way too many ideas like this one.)

> That's a very good result in terms of skipping lazy_vacuum_heap(). How
> much the table and indexes bloated? Also, I'm curious about that which
> tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
> test if maintenance_work_mem space is running out? And what was the
> impact on clearing all-visible bits?

The pgbench_accounts heap table and 3 out of 4 of its indexes (i.e.
all indexes except "abalance_ruin") had zero growth. They did not even
become larger by 1 block. As I often say when talking about work in
this area, this is not a quantitative difference -- it's a qualitative
difference. (If they grew even a tiny amount, say by only 1 block,
further growth is likely to follow.)

The "abalance_ruin" index was smaller with the patch. Its size started
off at 253,779 blocks with both the patch and master branch (which is
very small, because of B-Tree deduplication). By the end of 2 pairs of
runs for the patch (2 3 hour runs) the size grew to 502,016 blocks.
But with the master branch it grew to 540,090 blocks. (For reference,
the primary key on pgbench_accounts started out at 822,573 blocks.)

My guess is that this would compare favorably with "magic VACUUM" [1]
(I refer to a thought experiment that is useful for understanding the
principles behind bottom-up index deletion). The fact that
"abalance_ruin" becomes bloated probably doesn't have that much to do
with MVCC versioning. In other words, I suspect that the index
wouldn't be that smaller in a traditional two-phase locking database
system with the same workload. Words like "bloat" and "fragmentation"
have always been overloaded/ambiguous in highly confusing ways, which
is why I find it useful to compare a real world workload/benchmark to
some kind of theoretical ideal behavior.

This test wasn't particularly sympathetic to the patch because most of
the indexes (all but the PK) were useless -- they did not get used by
query plans. So the final size of "abalance_ruin" (or any other index)
isn't even the truly important thing IMV (the benchmark doesn't
advertise the truly important thing for me). The truly important thing
is that the worst case number of versions *per logical row* is tightly
controlled. It doesn't necessarily matter all that much if 30% of an
index's tuples are garbage, as long as the garbage tuples are evenly
spread across all logical rows in the table (in practice it's pretty
unlikely that that would actually happen, but it's possible in theory,
and if it did happen it really wouldn't be so bad).

[1] 
https://postgr.es/m/CAH2-Wz=rpkb5vxs7az+v8t3ve75v_dkgro+w4nwd64e9yic...@mail.gmail.com
--
Peter Geoghegan




Re: New IndexAM API controlling index vacuum strategies

2020-12-28 Thread Masahiko Sawada
On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan  wrote:
>
> On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan  wrote:
> > I experimented with this today, and I think that it is a good way to
> > do it. I like the idea of choose_vacuum_strategy() understanding that
> > heap pages that are subject to many non-HOT updates have a "natural
> > extra capacity for LP_DEAD items" that it must care about directly (at
> > least with non-default heap fill factor settings). My early testing
> > shows that it will often take a surprisingly long time for the most
> > heavily updated heap page to have more than about 100 LP_DEAD items.
>
> Attached is a rough patch showing what I did here. It was applied on
> top of my bottom-up index deletion patch series and your
> poc_vacuumstrategy.patch patch. This patch was written as a quick and
> dirty way of simulating what I thought would work best for bottom-up
> index deletion for one specific benchmark/test, which was
> non-hot-update heavy. This consists of a variant pgbench with several
> indexes on pgbench_accounts (almost the same as most other bottom-up
> deletion benchmarks I've been running). Only one index is "logically
> modified" by the updates, but of course we still physically modify all
> indexes on every update. I set fill factor to 90 for this benchmark,
> which is an important factor for how your VACUUM patch works during
> the benchmark.
>
> This rough supplementary patch includes VACUUM logic that assumes (but
> doesn't check) that the table has heap fill factor set to 90 -- see my
> changes to choose_vacuum_strategy(). This benchmark is really about
> stability over time more than performance (though performance is also
> improved significantly). I wanted to keep both the table/heap and the
> logically unmodified indexes (i.e. 3 out of 4 indexes on
> pgbench_accounts) exactly the same size *forever*.
>
> Does this make sense?

Thank you for sharing the patch. That makes sense.

+if (!vacuum_heap)
+{
+if (maxdeadpage > 130 ||
+/* Also check if maintenance_work_mem space is running out */
+vacrelstats->dead_tuples->num_tuples >
+vacrelstats->dead_tuples->max_tuples / 2)
+vacuum_heap = true;
+}

The second test checking if maintenane_work_mem space is running out
also makes sense to me. Perhaps another idea would be to compare the
number of collected garbage tuple to the total number of heap tuples
so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
certain amount of garbage in the table.

>
> Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
> pg_stat_database shows an almost ~28% reduction in blks_read after an
> overnight run for the patch series (it was 508,820,699 for the
> patches, 705,282,975 for the master branch). I think that the VACUUM
> component is responsible for some of that reduction. There were 11
> VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
> (these 7 VACUUM operations all only dead a btbulkdelete() call for the
> one problematic index on the table, named "abalance_ruin", which my
> supplementary patch has hard-coded knowledge of).

That's a very good result in terms of skipping lazy_vacuum_heap(). How
much the table and indexes bloated? Also, I'm curious about that which
tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
test if maintenance_work_mem space is running out? And what was the
impact on clearing all-visible bits?

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: New IndexAM API controlling index vacuum strategies

2020-12-28 Thread Masahiko Sawada
On Mon, Dec 28, 2020 at 4:42 PM Peter Geoghegan  wrote:
>
> On Sun, Dec 27, 2020 at 10:55 PM Masahiko Sawada  
> wrote:
> > > As you said, the next question must be: How do we teach lazy vacuum to
> > > not do what gets requested by amvacuumcleanup() when it cannot respect
> > > the wishes of one individual indexes, for example when the
> > > accumulation of LP_DEAD items in the heap becomes a big problem in
> > > itself? That really could be the thing that forces full heap
> > > vacuuming, even with several indexes.
> >
> > You mean requested by amvacuumstreategy(), not by amvacuumcleanup()? I
> > think amvacuumstrategy() affects only ambulkdelete(). But when all
> > ambulkdelete() were skipped by the requests by index AMs we might want
> > to skip amvacuumcleanup() as well.
>
> No, I was asking about how we should decide to do a real VACUUM even
> (a real ambulkdelete() call) when no index asks for it because
> bottom-up deletion works very well in every index. Clearly we will
> need to eventually remove remaining LP_DEAD items from the heap at
> some point if nothing else happens -- eventually LP_DEAD items in the
> heap alone will force a traditional heap vacuum (which will still have
> to go through indexes that have not grown, just to be safe/avoid
> recycling a TID that's still in the index).
>
> Postgres heap fillfactor is 100 by default, though I believe it's 90
> in another well known DB system. If you set Postgres heap fill factor
> to 90 you can fit a little over 200 LP_DEAD items in the "extra space"
> left behind in each heap page after initial bulk loading/INSERTs take
> place that respect our lower fill factor setting. This is about 4x the
> number of initial heap tuples in the pgbench_accounts table -- it's
> quite a lot!
>
> If we pessimistically assume that all updates are non-HOT updates,
> we'll still usually have enough space for each logical row to get
> updated several times before the heap page "overflows". Even when
> there is significant skew in the UPDATEs, the skew is not noticeable
> at the level of individual heap pages. We have a surprisingly large
> general capacity to temporarily "absorb" extra garbage LP_DEAD items
> in heap pages this way. Nobody really cared about this extra capacity
> very much before now, because it did not help with the big problem of
> index bloat that you naturally see with this workload. But that big
> problem may go away soon, and so this extra capacity may become
> important at the same time.
>
> I think that it could make sense for lazy_scan_heap() to maintain
> statistics about the number of LP_DEAD items remaining in each heap
> page (just local stack variables). From there, it can pass the
> statistics to the choose_vacuum_strategy() function from your patch.
> Perhaps choose_vacuum_strategy() will notice that the heap page with
> the most LP_DEAD items encountered within lazy_scan_heap() (among
> those encountered so far in the event of multiple index passes) has
> too many LP_DEAD items -- this indicates that there is a danger that
> some heap pages will start to "overflow" soon, which is now a problem
> that lazy_scan_heap() must think about. Maybe if the "extra space"
> left by applying heap fill factor (with settings below 100) is
> insufficient to fit perhaps 2/3 of the LP_DEAD items needed on the
> heap page that has the most LP_DEAD items (among all heap pages), we
> stop caring about what amvacuumstrategy()/the indexes say. So we do
> the right thing for the heap pages, while still mostly avoiding index
> vacuuming and the final heap pass.

Agreed. I like the idea that we calculate how many LP_DEAD items we
can absorb based on the extra space left by applying the fill factor.
Since there is a limit on the maximum number of line pointers in a
heap page we might need to consider that limit when calculation.

>From another point of view, given the maximum number of heap tuple in
one 8kb heap page (MaxHeapTuplesPerPage) is 291, I think how bad to
store LP_DEAD items in a heap page vary depending on the tuple size.

For example, suppose the tuple size is 200 we can store 40 tuples into
one heap page if there is no LP_DEAD item at all. Even if there are
150 LP_DEAD items on the page, we still are able to store 37 tuples
because we still can have 141 line pointers at most, which is enough
number to store the maximum number of heap tuples when there are no
LP_DEAD items, and we have (8192 - (4 * 150)) bytes space to store
tuples (with line pointers). That is, we can think that having 150
LP_DEAD items end up causing an overflow of 3 tuples. On the other
hand, suppose the tuple size is 40 we can store 204 tuples into one
heap page if there is no LP_DEAD item at all. If there are 150 LP_DEAD
items on the page, we are able to store 141 tuples. That is, having
150 LP_DEAD items end up causing an overflow of 63 tuples. I think
the impact on the table bloat by absorbing LP_DEAD items is larger in
the latter case.

The larger the tuple size, the 

Re: HOT chain bug in latestRemovedXid calculation

2020-12-28 Thread Peter Geoghegan
On Tue, Dec 22, 2020 at 9:52 AM Peter Geoghegan  wrote:
> ISTM that heap_compute_xid_horizon_for_tuples() calculates
> latestRemovedXid for index deletion callers without sufficient care.
> The function only follows line pointer redirects, which is necessary
> but not sufficient to visit all relevant heap tuple headers -- it also
> needs to traverse HOT chains, but that doesn't happen. AFAICT
> heap_compute_xid_horizon_for_tuples() might therefore fail to produce
> a sufficiently recent latestRemovedXid value for the index deletion
> operation as a whole. This might in turn lead to the REDO routine
> (e.g. btree_xlog_delete()) doing conflict processing incorrectly
> during hot standby.

I attach a concrete fix for this bug. My basic approach is to
restructure the code so that it follows both LP_REDIRECT redirects as
well as HOT chain t_ctid page offset numbers in the same loop. This is
loosely based on similar loops in places like heap_hot_search_buffer()
and heap_prune_chain().

I also replaced the old "conjecture" comments about why it is that our
handling of LP_DEAD line pointers is correct. These comments match
what you'll see in the original 2010 commit (commit a760893d), which
is inappropriate. At the time Simon wrote that comment, a
latestRemovedXid return value of InvalidTransactionId had roughly the
opposite meaning. The meaning changed significantly just a few months
after a760893d, in commit 52010027efc. The old "conjecture" comments
were intended to convey something along the lines of "here is why it
is currently thought necessary to take this conservative approach with
LP_DEAD line pointers". But the comment should say almost the opposite
thing now -- something close to "here is why it's okay that we take
the seemingly lax approach of skipping LP_DEAD line pointers -- that's
actually safe".

The patch has new comments that explain the issue by comparing it to
the approach taken by index AMs such as nbtree during VACUUM
proper/bulk deletion. Index vacuuming can rely on heap pruning records
having generated latestRemovedXid values that obviate any need for
nbtree VACUUM records to explicitly log their own latestRemovedXid
value (which is why nbtree VACUUM cannot include extra "recently dead"
index tuples). This makes it obvious, I think -- LP_DEAD line pointers
in heap pages come from pruning, and pruning generates its own
latestRemovedXid at precisely the point that line pointers become
LP_DEAD.

I would like to commit this patch to v12, the first version that did
this process during original execution rather than in REDO routines.
It seems worth keeping the back branches in sync here. I suspect that
the old approach used prior to Postgres 12 has subtle buglets caused
by inconsistencies during Hot Standby (I have heard rumors). I'd
rather just not go there given the lack of field reports about this
problem.

-- 
Peter Geoghegan


v1-0001-Fix-latestRemovedXid-index-deletion-calculation.patch
Description: Binary data


Re: Let's start using setenv()

2020-12-28 Thread Tom Lane
Thomas Munro  writes:
> On Tue, Dec 29, 2020 at 4:21 PM Tom Lane  wrote:
>> I also changed our unsetenv() emulations to make them return an int
>> error indicator, as per POSIX.  I have no desire to change the call
>> sites to check for errors, but it seemed like our compatibility stubs
>> should be compatible with the spec.  (Note: I think that unsetenv()
>> did return void in old BSD versions, before POSIX standardized it.
>> So that might be a real reason not to change the callers.)

> +1 for conformance.  I suppose it's out of spec that unsetenv() can
> return -1 with errno = ENOMEM (from malloc()), but that hardly
> matters.  Hmm... could a static buffer be used for that clobbering
> trick?

Hm, hadn't thought about that particularly.  I did think about
hacking setenv.c to try to reduce memory leakage from repeated
sets of the same variable (basically, try to overwrite the existing
environ member whenever the new value is no longer than the old).
But on the whole I think it'd all be wasted effort.  Neither setenv.c
nor unsetenv.c will ever be used on any machine that anyone would care
about performance of.  If I were willing to retire gaur I'd vote for
just nuking both of them.

> For the note in parens:  it looks like the 3 main BSDs all switched
> from the historical void function to the POSIX one 12-18 years
> ago[1][2][3], so I wouldn't worry about that.  Glibc made a similar
> change 19 years ago.

Ah, thanks for the research.  I'd found the glibc change from references
in current Linux man pages, but I didn't run down the BSD history.

(My other pet dinosaur prairiedog still has void unsetenv(), btw,
presumably because macOS is based on turn-of-the-century NetBSD.
Modern macOS does follow POSIX here; not sure when they changed.)

regards, tom lane




Re: Let's start using setenv()

2020-12-28 Thread Thomas Munro
On Tue, Dec 29, 2020 at 4:21 PM Tom Lane  wrote:
> Back in 2001 we decided to prefer putenv() over setenv() because the
> latter wasn't in POSIX (SUSv2) at the time.  That decision has been
> overtaken by events: more recent editions of POSIX not only provide
> setenv() but recommend it over putenv().  So I think it's time to
> change our policy and prefer setenv().  We've had previous discussions
> about that but nobody did the legwork yet.
>
> The attached patch provides the infrastructure to allow using setenv().
> I added a src/port/ implementation of setenv() for any machines that
> haven't caught up with POSIX lately.  (I've tested that code on my old
> dinosaur gaur, and I would not be surprised if that is the only machine
> anywhere that ever runs it.  But I could be wrong.)  I also extended
> win32env.c to support setenv().

+1, nice modernisation.

> I also changed our unsetenv() emulations to make them return an int
> error indicator, as per POSIX.  I have no desire to change the call
> sites to check for errors, but it seemed like our compatibility stubs
> should be compatible with the spec.  (Note: I think that unsetenv()
> did return void in old BSD versions, before POSIX standardized it.
> So that might be a real reason not to change the callers.)

+1 for conformance.  I suppose it's out of spec that unsetenv() can
return -1 with errno = ENOMEM (from malloc()), but that hardly
matters.  Hmm... could a static buffer be used for that clobbering
trick?

For the note in parens:  it looks like the 3 main BSDs all switched
from the historical void function to the POSIX one 12-18 years
ago[1][2][3], so I wouldn't worry about that.  Glibc made a similar
change 19 years ago.

[1] 
https://github.com/NetBSD/src/commit/13dee93fb3a93189a74a3705a5f7cd1c6b290125
[2] 
https://github.com/openbsd/src/commit/471b62eeaa7f3a18c0aa98b5d605e5cec1625b62
[3] 
https://github.com/freebsd/freebsd/commit/196b6346ba4e13a3f7679e2de3317b6aa65983df




Let's start using setenv()

2020-12-28 Thread Tom Lane
Back in 2001 we decided to prefer putenv() over setenv() because the
latter wasn't in POSIX (SUSv2) at the time.  That decision has been
overtaken by events: more recent editions of POSIX not only provide
setenv() but recommend it over putenv().  So I think it's time to
change our policy and prefer setenv().  We've had previous discussions
about that but nobody did the legwork yet.

The attached patch provides the infrastructure to allow using setenv().
I added a src/port/ implementation of setenv() for any machines that
haven't caught up with POSIX lately.  (I've tested that code on my old
dinosaur gaur, and I would not be surprised if that is the only machine
anywhere that ever runs it.  But I could be wrong.)  I also extended
win32env.c to support setenv().

I haven't made any serious effort to expunge all uses of putenv() in this
version of the patch; I just wanted to exercise setenv() in both backend
and frontend.  Seeing that POSIX considers putenv() to be semi-deprecated,
maybe we should try to eliminate all calls outside the (un)setenv
implementations, but first it'd be good to see if win32env.c works.

I also changed our unsetenv() emulations to make them return an int
error indicator, as per POSIX.  I have no desire to change the call
sites to check for errors, but it seemed like our compatibility stubs
should be compatible with the spec.  (Note: I think that unsetenv()
did return void in old BSD versions, before POSIX standardized it.
So that might be a real reason not to change the callers.)

regards, tom lane

diff --git a/configure b/configure
index 11a4284e5b..07529825d1 100755
--- a/configure
+++ b/configure
@@ -15959,12 +15959,29 @@ case $host_os in
 # Windows uses a specialised env handler
 mingw*)
 
+$as_echo "#define HAVE_SETENV 1" >>confdefs.h
+
+
 $as_echo "#define HAVE_UNSETENV 1" >>confdefs.h
 
+ac_cv_func_setenv=yes
 ac_cv_func_unsetenv=yes
 ;;
 *)
-ac_fn_c_check_func "$LINENO" "unsetenv" "ac_cv_func_unsetenv"
+ac_fn_c_check_func "$LINENO" "setenv" "ac_cv_func_setenv"
+if test "x$ac_cv_func_setenv" = xyes; then :
+  $as_echo "#define HAVE_SETENV 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" setenv.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS setenv.$ac_objext"
+ ;;
+esac
+
+fi
+
+ac_fn_c_check_func "$LINENO" "unsetenv" "ac_cv_func_unsetenv"
 if test "x$ac_cv_func_unsetenv" = xyes; then :
   $as_echo "#define HAVE_UNSETENV 1" >>confdefs.h
 
diff --git a/configure.ac b/configure.ac
index fc523c6aeb..7f855783f4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1757,11 +1757,13 @@ fi
 case $host_os in
 # Windows uses a specialised env handler
 mingw*)
+AC_DEFINE(HAVE_SETENV, 1, [Define to 1 because replacement version used.])
 AC_DEFINE(HAVE_UNSETENV, 1, [Define to 1 because replacement version used.])
+ac_cv_func_setenv=yes
 ac_cv_func_unsetenv=yes
 ;;
 *)
-AC_REPLACE_FUNCS([unsetenv])
+AC_REPLACE_FUNCS([setenv unsetenv])
 ;;
 esac
 
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 515ae95fe1..64d6e8 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -1059,24 +1059,16 @@ pg_GSS_recvauth(Port *port)
 		/*
 		 * Set default Kerberos keytab file for the Krb5 mechanism.
 		 *
-		 * setenv("KRB5_KTNAME", pg_krb_server_keyfile, 0); except setenv()
-		 * not always available.
+		 * We don't overwrite any existing environment variable ... is that
+		 * really a sane choice?
 		 */
-		if (getenv("KRB5_KTNAME") == NULL)
+		if (setenv("KRB5_KTNAME", pg_krb_server_keyfile, 0) != 0)
 		{
-			size_t		kt_len = strlen(pg_krb_server_keyfile) + 14;
-			char	   *kt_path = malloc(kt_len);
-
-			if (!kt_path ||
-snprintf(kt_path, kt_len, "KRB5_KTNAME=%s",
-		 pg_krb_server_keyfile) != kt_len - 2 ||
-putenv(kt_path) != 0)
-			{
-ereport(LOG,
-		(errcode(ERRCODE_OUT_OF_MEMORY),
-		 errmsg("out of memory")));
-return STATUS_ERROR;
-			}
+			/* We assume the error must be "out of memory" */
+			ereport(LOG,
+	(errcode(ERRCODE_OUT_OF_MEMORY),
+	 errmsg("out of memory")));
+			return STATUS_ERROR;
 		}
 	}
 
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index c39d67645c..148a0984ea 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -105,20 +105,6 @@ char	   *localized_full_months[12 + 1];
 static bool CurrentLocaleConvValid = false;
 static bool CurrentLCTimeValid = false;
 
-/* Environment variable storage area */
-
-#define LC_ENV_BUFSIZE (NAMEDATALEN + 20)
-
-static char lc_collate_envbuf[LC_ENV_BUFSIZE];
-static char lc_ctype_envbuf[LC_ENV_BUFSIZE];
-
-#ifdef LC_MESSAGES
-static char lc_messages_envbuf[LC_ENV_BUFSIZE];
-#endif
-static char 

Re: Parallel Full Hash Join

2020-12-28 Thread Thomas Munro
On Mon, Dec 28, 2020 at 9:49 PM Masahiko Sawada  wrote:
> On Thu, Nov 5, 2020 at 7:34 AM Melanie Plageman
>  wrote:
> > I've attached a patch with the corrections I mentioned upthread.
> > I've gone ahead and run pgindent, though, I can't say that I'm very
> > happy with the result.
> >
> > I'm still not quite happy with the name
> > BarrierArriveAndDetachExceptLast(). It's so literal. As you said, there
> > probably isn't a nice name for this concept, since it is a function with
> > the purpose of terminating parallelism.
>
> You sent in your patch, v3-0001-Support-Parallel-FOJ-and-ROJ.patch to
> pgsql-hackers on Nov 5, but you did not post it to the next
> CommitFest[1].  If this was intentional, then you need to take no
> action.  However, if you want your patch to be reviewed as part of the
> upcoming CommitFest, then you need to add it yourself before
> 2021-01-01 AOE[2]. Also, rebasing to the current HEAD may be required
> as almost two months passed since when this patch is submitted. Thanks
> for your contributions.

Thanks for this reminder Sawada-san.  I had some feedback I meant to
post in November but didn't get around to:

+bool
+BarrierArriveAndDetachExceptLast(Barrier *barrier)

I committed this part (7888b099).  I've attached a rebase of the rest
of Melanie's v3 patch.

+WAIT_EVENT_HASH_BATCH_PROBE,

That new wait event isn't needed (we can't and don't wait).

  *  PHJ_BATCH_PROBING-- all probe
- *  PHJ_BATCH_DONE   -- end
+
+ *  PHJ_BATCH_DONE   -- queries not requiring inner fill done
+ *  PHJ_BATCH_FILL_INNER_DONE -- inner fill completed, all queries done

Would it be better/tidier to keep _DONE as the final phase?  That is,
to switch around these two final phases.  Or does that make it too
hard to coordinate the detach-and-cleanup logic?

+/*
+ * ExecPrepHashTableForUnmatched
+ * set up for a series of ExecScanHashTableForUnmatched calls
+ * return true if this worker is elected to do the
unmatched inner scan
+ */
+bool
+ExecParallelPrepHashTableForUnmatched(HashJoinState *hjstate)

Comment name doesn't match function name.
From 9199bfcfa84acbcfeb9a8d3c21962096c7ff645c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 5 Nov 2020 16:20:24 +1300
Subject: [PATCH v4 1/2] Parallel Hash Full/Right Join.

Previously we allowed PHJ only for inner and left outer joins.
Share the hash table match bits between backends, so we can also also
handle full and right joins.  In order to do that without introducing
any deadlock risks, for now we drop down to a single process for the
unmatched scan at the end of each batch.  Other processes detach and
look for another batch to help with.  If there aren't any more batches,
they'll finish the hash join early, making work distribution suboptimal.
Improving that might require bigger executor changes.

Also switch the unmatched tuple scan to work in memory-chunk order,
rather than bucket order.  This prepares for potential later
improvements that would use chunks as parallel grain, and seems to be a
little cache-friendlier than the bucket-scan scheme scheme in the
meantime.

Author: Melanie Plageman 
Reviewed-by: Thomas Munro 
Discussion: https://postgr.es/m/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com
---
 src/backend/executor/nodeHash.c | 190 ++--
 src/backend/executor/nodeHashjoin.c |  61 
 src/backend/optimizer/path/joinpath.c   |  14 +-
 src/include/executor/hashjoin.h |  15 +-
 src/include/executor/nodeHash.h |   3 +
 src/test/regress/expected/join_hash.out |  56 ++-
 src/test/regress/sql/join_hash.sql  |  23 ++-
 7 files changed, 274 insertions(+), 88 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index ea69eeb2a1..36cc752163 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -510,6 +510,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 		hashtable->spaceAllowed * SKEW_HASH_MEM_PERCENT / 100;
 	hashtable->chunks = NULL;
 	hashtable->current_chunk = NULL;
+	hashtable->current_chunk_idx = 0;
 	hashtable->parallel_state = state->parallel_state;
 	hashtable->area = state->ps.state->es_query_dsa;
 	hashtable->batches = NULL;
@@ -2053,9 +2054,58 @@ ExecPrepHashTableForUnmatched(HashJoinState *hjstate)
 	 * hj_CurTuple: last tuple returned, or NULL to start next bucket
 	 *--
 	 */
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+
 	hjstate->hj_CurBucketNo = 0;
 	hjstate->hj_CurSkewBucketNo = 0;
 	hjstate->hj_CurTuple = NULL;
+	hashtable->current_chunk = hashtable->chunks;
+	hashtable->current_chunk_idx = 0;
+}
+
+/*
+ * ExecPrepHashTableForUnmatched
+ *		set up for a series of ExecScanHashTableForUnmatched calls
+ *		return true if this worker is elected to do the unmatched inner scan
+ */
+bool
+ExecParallelPrepHashTableForUnmatched(HashJoinState *hjstate)
+{
+	HashJoinTable 

Re: doc review for v14

2020-12-28 Thread Thomas Munro
On Thu, Dec 24, 2020 at 9:12 PM Michael Paquier  wrote:
> On Mon, Dec 21, 2020 at 10:11:53PM -0600, Justin Pryzby wrote:
> >  Specifies the amount of memory that should be allocated at server
> > -startup time for use by parallel queries.  When this memory region 
> > is
> > +startup for use by parallel queries.  When this memory region is
> >  insufficient or exhausted by concurrent queries, new parallel 
> > queries
> >  try to allocate extra shared memory temporarily from the operating
> >  system using the method configured with
> >  dynamic_shared_memory_type, which may be slower 
> > due
> >  to memory management overheads.  Memory that is allocated at 
> > startup
> > -time with min_dynamic_shared_memory is affected 
> > by
> > +with min_dynamic_shared_memory is affected by
> >  the huge_pages setting on operating systems 
> > where
> >  that is supported, and may be more likely to benefit from larger 
> > pages
> >  on operating systems where that is managed automatically.
>
> The current formulation is not that confusing, but I agree that this
> is an improvement.  Thomas, you are behind this one.  What do you
> think?

LGTM.




Re: doc review for v14

2020-12-28 Thread Michael Paquier
On Mon, Dec 28, 2020 at 11:42:03AM +0100, Magnus Hagander wrote:
> Not as much "tightly controlled" as "nobody's really bothered to grant any
> permissions".

Magnus, do I have an access to that?  This is the second time I am
crossing an issue with this issue, but I don't really know if I should
act on it or not :)
--
Michael


signature.asc
Description: PGP signature


Re: Proposed patch for key managment

2020-12-28 Thread Fabien COELHO



I want to repeat here what I said in another thread:


I think ultimately we will need three commands to control the keys.
First, there is the cluster_key_command, which we have now.  Second, I
think we will need an optional command which returns random bytes ---
this would allow users to get random bytes from a different source than
that used by the server code.

Third, we will probably need a command that returns the data encryption
keys directly, either heap/index or WAL keys, probably based on key
number --- you pass the key number you want, and the command returns the
data key.  There would not be a cluster key in this case, but the
command could still prompt the user for perhaps a password to the KMS
server. It could not be used if any of the previous two commands are
used. I assume an HMAC would still be stored in the pg_cryptokeys
directory to check that the right key has been returned.

I thought we should implement the first command, because it will
probably be the most common and easiest to use, and then see what people
want added.


There is also a fourth option where the command returns multiple keys,
one per line of hex digits.  That could be written in shell script, but
it would be fragile and complex.  It could be written in Perl, but that
would add a new language requirement for this feature.  It could be
written in C, but that would limits its flexibility because changes
would require a recompile, and you would probably need that C file to
call external scripts to tailor input like we do now from the server.

You could actually write a full implemention of what we do on the server
side in client code, but pg_alterckey would not work, since it would not
know the data format, so we would need another cluster key alter for that.

It could be written as a C extension, but that would be also have C's
limitations.  In summary, having the server do most of the complex work
for the default case seems best, and eventually allowing the ability for
the client to do everything seems ideal.  I think we need more input
before we go beyond what we do now.


As I said in the commit thread, I disagree with this approach because it 
pushes for no or partial or maybe bad design.


I think that an API should be carefully thought about, without assumption 
about the underlying cryptography (algorithm, key lengths, modes, how keys 
are derived and stored, and so on), and its usefulness be demonstrated by 
actually being used for one implementation which would be what is 
currently being proposed in the patch, and possibly others thrown in for 
free.


The implementations should not have to be in any particular language: 
Shell, Perl, Python, C should be possible.


After giving it more thought during the day, I think that only one
command and a basic protocol is needed. Maybe something as simple as

  /path/to/command --options arguments…

With a basic (text? binary?) protocol on stdin/stdout (?) for the 
different functions. What the command actually does (connect to a remote 
server, ask for a master password, open some other database, whatever) 
should be irrelevant to pg, which would just get and pass bunch of bytes 
to functions, which could use them for keys, secrets, whatever, and be 
easily replaceable.


The API should NOT make assumptions about the cryptographic design, what 
depends about what, where things are stored… ISTM that Pg should only care 
about naming keys, holding them when created/retrieved (but not create 
them), basically interacting with the key manager, passing the stuff to 
functions for encryption/decryption seen as black boxes.


I may have suggested something along these lines at the beginning of the 
key management thread, probably. Not going this way implicitely implies 
making some assumptions which may or may not suit other use cases, so 
makes them specific not generic. I do not think pg should do that.


--
Fabien.

Re: New IndexAM API controlling index vacuum strategies

2020-12-28 Thread Peter Geoghegan
On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan  wrote:
> I experimented with this today, and I think that it is a good way to
> do it. I like the idea of choose_vacuum_strategy() understanding that
> heap pages that are subject to many non-HOT updates have a "natural
> extra capacity for LP_DEAD items" that it must care about directly (at
> least with non-default heap fill factor settings). My early testing
> shows that it will often take a surprisingly long time for the most
> heavily updated heap page to have more than about 100 LP_DEAD items.

Attached is a rough patch showing what I did here. It was applied on
top of my bottom-up index deletion patch series and your
poc_vacuumstrategy.patch patch. This patch was written as a quick and
dirty way of simulating what I thought would work best for bottom-up
index deletion for one specific benchmark/test, which was
non-hot-update heavy. This consists of a variant pgbench with several
indexes on pgbench_accounts (almost the same as most other bottom-up
deletion benchmarks I've been running). Only one index is "logically
modified" by the updates, but of course we still physically modify all
indexes on every update. I set fill factor to 90 for this benchmark,
which is an important factor for how your VACUUM patch works during
the benchmark.

This rough supplementary patch includes VACUUM logic that assumes (but
doesn't check) that the table has heap fill factor set to 90 -- see my
changes to choose_vacuum_strategy(). This benchmark is really about
stability over time more than performance (though performance is also
improved significantly). I wanted to keep both the table/heap and the
logically unmodified indexes (i.e. 3 out of 4 indexes on
pgbench_accounts) exactly the same size *forever*.

Does this make sense?

Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
pg_stat_database shows an almost ~28% reduction in blks_read after an
overnight run for the patch series (it was 508,820,699 for the
patches, 705,282,975 for the master branch). I think that the VACUUM
component is responsible for some of that reduction. There were 11
VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
(these 7 VACUUM operations all only dead a btbulkdelete() call for the
one problematic index on the table, named "abalance_ruin", which my
supplementary patch has hard-coded knowledge of).

-- 
Peter Geoghegan


0007-btvacuumstrategy-bottom-up-index-deletion-changes.patch
Description: Binary data


Re: trailing junk in numeric literals

2020-12-28 Thread Tom Lane
Peter Eisentraut  writes:
> I was surprised to find that this doesn't error:
> => select 100a;
>a
> -
>   100

> I suspect this and similar cases used to error before aliases without AS 
> were introduced.  But now this seems possibly problematic.  Should we 
> try to handle this better?

Meh.  I think you'd get more brickbats than kudos if you start insisting
on a space there.

I'm too lazy to try to decipher the SQL spec right now, but ISTR that
it insists on whitespace between a numeric literal and an identifier.
So strictly speaking this SQL code is nonstandard anyway.  But our
lexer has always been forgiving about not requiring space if it's
not logically necessary to separate tokens.  I doubt trying to
change that would improve matters.

regards, tom lane




trailing junk in numeric literals

2020-12-28 Thread Peter Eisentraut

I was surprised to find that this doesn't error:

=> select 100a;
  a
-
 100

I suspect this and similar cases used to error before aliases without AS 
were introduced.  But now this seems possibly problematic.  Should we 
try to handle this better?





Re: Better client reporting for "immediate stop" shutdowns

2020-12-28 Thread Tom Lane
Andres Freund  writes:
> I don't think it needs to be done right now, but I again want to suggest
> it'd be nice if we split log levels into a bitmask. If we bits, separate
> from the log level, for do-not-log-to-client and do-not-log-to-server
> some of this code would imo look nicer.

Hmm, maybe.  I agree that would be better done as a separate patch though.

I had a thought while looking at elog.c: we could further reduce the risk
of quickdie() crashing if we make it do what elog.c does when it gets into
error recursion trouble:

error_context_stack = NULL;
debug_query_string = NULL;

Not invoking error context callbacks would significantly reduce the
footprint of code that can be reached from quickdie's ereports, and
the current call stack isn't really relevant to a report of SIGQUIT
anyway.  The argument for not reporting debug_query_string is a little
thinner, but if that string is long it could result in additional
palloc work inside elog.c, thus increasing the amount of stuff that
has to work to get the report out.

regards, tom lane




Re: Better client reporting for "immediate stop" shutdowns

2020-12-28 Thread Andres Freund
Hi,

On 2020-12-28 13:25:14 -0500, Tom Lane wrote:
> The most straightforward way to do that is to introduce a new error
> level.  Having to renumber existing levels is a bit of a pain, but
> I'm not aware of anything that should break in source-code terms.
> We make similar ABI breaks in every major release.

I don't see a problem either.


>   /* Select default errcode based on elevel */
>   if (elevel >= ERROR)
>   edata->sqlerrcode = ERRCODE_INTERNAL_ERROR;
> - else if (elevel == WARNING)
> + else if (elevel >= WARNING)
>   edata->sqlerrcode = ERRCODE_WARNING;
>   else
>   edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;

> @@ -2152,6 +2157,7 @@ write_eventlog(int level, const char *line, int len)
>   eventlevel = EVENTLOG_INFORMATION_TYPE;
>   break;
>   case WARNING:
> + case WARNING_CLIENT_ONLY:
>   eventlevel = EVENTLOG_WARNING_TYPE;
>   break;
>   case ERROR:
> [...]

I don't think it needs to be done right now, but I again want to suggest
it'd be nice if we split log levels into a bitmask. If we bits, separate
from the log level, for do-not-log-to-client and do-not-log-to-server
some of this code would imo look nicer.


Looks good to me.

Greetings,

Andres Freund




Re: Better client reporting for "immediate stop" shutdowns

2020-12-28 Thread Tom Lane
Andres Freund  writes:
> On 2020-12-26 13:37:15 -0500, Tom Lane wrote:
>>> I'd like to not log all these repeated messages into the server
>>> log. It's quite annoying to have to digg through thousands of lines of
>>> repeated "terminating connection..."

>> Hm.  That's an orthogonal issue, but certainly worth considering.
>> There are a couple of levels we could consider:
>> 1. Just make the logged messages less verbose (they certainly don't
>> need the DETAIL and HINT lines).
>> 2. Suppress the log entries altogether.

> My vote would be #2, with the same reasoning as yours.

The most straightforward way to do that is to introduce a new error
level.  Having to renumber existing levels is a bit of a pain, but
I'm not aware of anything that should break in source-code terms.
We make similar ABI breaks in every major release.

regards, tom lane

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d35c5020ea..e9504cdab9 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2791,6 +2791,10 @@ quickdie(SIGNAL_ARGS)
 	 *
 	 * Ideally these should be ereport(FATAL), but then we'd not get control
 	 * back to force the correct type of process exit.
+	 *
+	 * When responding to a postmaster-issued signal, send the message only to
+	 * the client; sending to the server log just creates log spam, plus it's
+	 * more code that we need to hope will work in a signal handler.
 	 */
 	switch (GetQuitSignalReason())
 	{
@@ -2802,7 +2806,7 @@ quickdie(SIGNAL_ARGS)
 			break;
 		case PMQUIT_FOR_CRASH:
 			/* A crash-and-restart cycle is in progress */
-			ereport(WARNING,
+			ereport(WARNING_CLIENT_ONLY,
 	(errcode(ERRCODE_CRASH_SHUTDOWN),
 	 errmsg("terminating connection because of crash of another server process"),
 	 errdetail("The postmaster has commanded this server process to roll back"
@@ -2814,7 +2818,7 @@ quickdie(SIGNAL_ARGS)
 			break;
 		case PMQUIT_FOR_STOP:
 			/* Immediate-mode stop */
-			ereport(WARNING,
+			ereport(WARNING_CLIENT_ONLY,
 	(errcode(ERRCODE_ADMIN_SHUTDOWN),
 	 errmsg("terminating connection due to immediate shutdown command")));
 			break;
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 3558e660c7..9a69038b80 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -202,6 +202,11 @@ is_log_level_output(int elevel, int log_min_level)
 		if (log_min_level == LOG || log_min_level <= ERROR)
 			return true;
 	}
+	else if (elevel == WARNING_CLIENT_ONLY)
+	{
+		/* never sent to log, regardless of log_min_level */
+		return false;
+	}
 	else if (log_min_level == LOG)
 	{
 		/* elevel != LOG */
@@ -453,7 +458,7 @@ errstart(int elevel, const char *domain)
 	/* Select default errcode based on elevel */
 	if (elevel >= ERROR)
 		edata->sqlerrcode = ERRCODE_INTERNAL_ERROR;
-	else if (elevel == WARNING)
+	else if (elevel >= WARNING)
 		edata->sqlerrcode = ERRCODE_WARNING;
 	else
 		edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;
@@ -2152,6 +2157,7 @@ write_eventlog(int level, const char *line, int len)
 			eventlevel = EVENTLOG_INFORMATION_TYPE;
 			break;
 		case WARNING:
+		case WARNING_CLIENT_ONLY:
 			eventlevel = EVENTLOG_WARNING_TYPE;
 			break;
 		case ERROR:
@@ -3109,6 +3115,7 @@ send_message_to_server_log(ErrorData *edata)
 break;
 			case NOTICE:
 			case WARNING:
+			case WARNING_CLIENT_ONLY:
 syslog_level = LOG_NOTICE;
 break;
 			case ERROR:
@@ -3484,6 +3491,7 @@ error_severity(int elevel)
 			prefix = gettext_noop("NOTICE");
 			break;
 		case WARNING:
+		case WARNING_CLIENT_ONLY:
 			prefix = gettext_noop("WARNING");
 			break;
 		case ERROR:
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index e8f04a1691..d2bdfa0be3 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -40,19 +40,19 @@
 #define WARNING		19			/* Warnings.  NOTICE is for expected messages
  * like implicit sequence creation by SERIAL.
  * WARNING is for unexpected messages. */
-#define ERROR		20			/* user error - abort transaction; return to
+#define WARNING_CLIENT_ONLY	20	/* Warnings to be sent to client as usual, but
+ * never to the server log. */
+#define ERROR		21			/* user error - abort transaction; return to
  * known state */
 /* Save ERROR value in PGERROR so it can be restored when Win32 includes
  * modify it.  We have to use a constant rather than ERROR because macros
  * are expanded only when referenced outside macros.
  */
 #ifdef WIN32
-#define PGERROR		20
+#define PGERROR		21
 #endif
-#define FATAL		21			/* fatal error - abort process */
-#define PANIC		22			/* take down the other backends with me */
-
- /* #define DEBUG DEBUG1 */	/* Backward compatibility with pre-7.3 */
+#define FATAL		22			/* fatal error - abort process */
+#define PANIC		23			/* take down the other backends with me */
 
 
 /* macros for representing SQLSTATE strings compactly */


Re: pgsql: Fix memory leak in plpgsql's CALL processing.

2020-12-28 Thread Tom Lane
I wrote:
> so it looks like a memory clobber --- probably some data structure is
> being built in the wrong memory context.  Will look closer shortly.

Brain-dead typo ... where's my brown paper bag?

regards, tom lane




Re: pgsql: Add pg_alterckey utility to change the cluster key

2020-12-28 Thread Bruce Momjian
On Mon, Dec 28, 2020 at 10:09:11AM -0400, Fabien COELHO wrote:
> Yep, my point is that it should be possible to have the whole key management
> outside of postgres.

I think this kind of discussion has to happen in a different thread,
parhsps:


https://www.postgresql.org/message-id/flat/20201227172526.GB17037%40momjian.us#f7c6dd32a5f3af7b8615b7ebfda6e669

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: pgsql: Fix memory leak in plpgsql's CALL processing.

2020-12-28 Thread Tom Lane
Peter Eisentraut  writes:
> This change has broken the following test case (reduced from actual 
> production code):

Thanks for the report.  What I see here is

ERROR:  unsupported target type: 2139062143

so it looks like a memory clobber --- probably some data structure is
being built in the wrong memory context.  Will look closer shortly.

regards, tom lane




Re: pgsql: Add pg_alterckey utility to change the cluster key

2020-12-28 Thread Fabien COELHO


Hello Bruce,

I put the thread back on hackers.


The first two keys are stored in pg_cryptokeys/ in the data directory,
while the third one is retrieved using a GUC for validation at server
startup for the other two.


Do we necessarily have to store the first level keys within the data 
directory?  I guess that this choice has been made for performance, but 
is that really something that a user would want all the time?  AES256 
is the only option available for the data keys.  What if somebody wants 
to roll in their own encryption?


To clarify, we encrypt the data keys using AES256, but the data keys
themselves can be 128, 192, or 256 bits.


Companies can have many requirements in terms of accepting the use of
one option or another.


I think ultimately we will need three commands to control the keys.
First, there is the cluster_key_command, which we have now.  Second, I
think we will need an optional command which returns random bytes ---
this would allow users to get random bytes from a different source than
that used by the server code.

Third, we will probably need a command that returns the data encryption
keys directly, either heap/index or WAL keys, probably based on key
number --- you pass the key number you want, and the command returns the
data key.  There would not be a cluster key in this case, but the
command could still prompt the user for perhaps a password to the KMS
server. It could not be used if any of the previous two commands are
used. I assume an HMAC would still be stored in the pg_cryptokeys
directory to check that the right key has been returned.


Yep, my point is that it should be possible to have the whole key 
management outside of postgres.


This said, postgres should provide a reasonable default implementation, 
obviously, preferably by using the provided mechanism (*NOT* a direct 
internal implementation and a possible switch to something else, IMHO, 
because then it would not be tested for whether it provides the right 
level of usability).


I agree that keys need to be identified. I somehow disagree with the 
naming of the script and the implied usage.


ISTM that there could be an external interface:

 - to initialize something. It may start a suid process, it may connect to 
a remote host, it may ask for a master password, who knows?


   /path/to/init --options arguments…

the init process would return something which would be reused later on, eg 
an authentication token, or maybe a path to a socket for communication, or 
a file which contains something, or even a master/cluster key, but not 
necessarily. It may be anything. How it is passed to the next 
process/connection is an open question. Maybe on its stdin?


 - to start a process (?) which provide keys, either created (new) or 
existing (get), and possibly destroy them (or not?). The init process 
result should/could be passed somehow to this process, which may be suid 
something else. Another option would be to rely on some IPC mechanism.

I'm not sure what the best choice is.

ISTM that this process/connection could/should be persistent, with a 
simplistic text or binary based client/server interface. What this 
process/connection does it beyond postgres. In my mind, it could implement 
getting random data as well. I'd suggest that under no circumstances 
should the postgres process create cryptographic keys, although it should 
probably name them with some predefine length limit.


   /path/to/run --options arguments…

Then there should be an postgres internal interface to store the results 
for local processing, retrieve them when needed, and so on, ok.


ISTM that there should also be an internal interface to load the 
cryptographic primitives. Possibly a so/dll would do, or maybe just an 
extension mechanism which would provide the necessary functions, but this 
raise the issue of bootstraping, so maybe not so great an idea. The 
functions should probably be able to implement a counter mode, so that 
actual keys depend on the page position in file position, but what is 
really does is not postgres concern.


A cryptographic concern for me is whether it would be possible to have 
authentication/integrity checks associated to each page. This means having 
the ability to reserve some space somewhere, possibly 8-16 bytes, in a 
page. Different algorithm could have different space requirements.


The same interface should be used by other back-end commands (pg_upgrade, 
whatever).


Somehow, the design should be abstract, without implying much, so that 
very different interfaces could be provided in term of whether there 
exists a master key, how keys are derived, what key sizes are, what 
algorithms are used, and so on. Postgres itself should not store keys, 
only key identifiers.


I'm wondering whether replication should be able to work without some/all 
keys, so that a streaming replication could be implemented without the 
remote host being fully aware of the cryptographic keys.


Another 

Re: a misbehavior of partition row movement (?)

2020-12-28 Thread Arne Roland
While I'd agree that simply deleting with "on delete cascade" on tuple 
rerouting is a strong enough violation of the spirit of partitioning changing 
that behavior, I am surprised by the idea to make a differentiation between fks 
and other triggers. The way user defined triggers work, is a violation to the 
same degree I get complaints about that on a monthly (if not weekly) basis. 
It's easy to point towards the docs, but the docs offer no solution to archive 
the behavior, that makes partitioning somewhat transparent. Most people I know 
don't partition because they like to complicate things, but just because they 
have to much data.

It isn't just a thing with after triggers. Imo before triggers are even worse: 
If we move a row between partitions we fire all three triggers at once (at 
different leaf pages obviously). It's easy to point towards the docs. Heart 
bleed was well documented, but I am happy that it was fixed. I don't want to 
imply this totally unrelated security issue has anything to do with our weird 
behavior. I just want to raise the question whether that's a good thing, 
because frankly I haven't met anyone thus far, who thought it is.

To me it seems the RI triggers are just a specific victim of the way we made 
triggers work in general.

What I tried to express, albeit I apparently failed: I care about having 
triggers, which make partitioning transparent, on the long run.

> because what can happen
> today with CASCADE triggers does not seem like a useful behavior by
> any measure.

I wholeheartedly agree. I just want to point out, that you could state the same 
about triggers in general.

Backwards compatibility is usually a pretty compelling argument. I would still 
want to raise the question, whether this should change, because I frankly think 
this is a bad idea.

You might ask: Why am I raising this question in this thread?

In case we can not (instantly) agree on the fact that this behavior should 
change, I'd still prefer to think about making that behavior possible for other 
triggers (possibly later) as well. One possibility would be having an entry in 
the catalog to mark when the trigger should fire.

I don't want to stall your definitely useful RI-Trigger fix, but I sincerely 
believe, that we have to do better with triggers in general.

If we would make the condition when to fire or not dependent something besides 
that fact whether the trigger is internal, we could at a later date choose to 
make both ways available, if anyone makes a good case for this. Even though I 
still think it's not worth it.

>I don't quite recall if the decision to implement it like this was
> based on assuming that this is what users would like to see happen in
> this case or the perceived difficulty of implementing it the other way
> around, that is, of firing AFTER UPDATE triggers in this case.

I tried to look that up, but I couldn't find any discussion about this. Do you 
have any ideas in which thread that was handled?

> Sorry, it seems you are right and the 2nd patch indeed fails to apply to 
> master.

Thank you! I hope to have a more in depth look later this week.

Regards
Arne




Re: New Table Access Methods for Multi and Single Inserts

2020-12-28 Thread Bharath Rupireddy
On Fri, Dec 25, 2020 at 8:10 AM Justin Pryzby  wrote:
> On Thu, Dec 24, 2020 at 05:48:42AM +0530, Bharath Rupireddy wrote:
> > I'm not posting the updated 0002 to 0004 patches, I plan to do so
> > after a couple of reviews happen on the design of the APIs in 0001.
> >
> > Thoughts?
>
> Are you familiar with this work ?
>
> https://commitfest.postgresql.org/31/2717/
> Reloptions for table access methods
>
> It seems like that can be relevant for your patch, and I think some of what
> your patch needs might be provided by AM opts.
>
> It's difficult to generalize AMs when we have only one, but your use-case 
> might
> be a concrete example which would help to answer some questions on the other
> thread.
>
> @Jeff: https://commitfest.postgresql.org/31/2871/

Note that I have not gone through the entire thread at [1]. On some
initial study, that patch is proposing to allow different table AMs to
have custom rel options.

In the v2 patch that I sent upthread [2] for new table AMs has heap AM
multi insert code moved inside the new heap AM implementation and I
don't see any need of having rel options. In case, any other AMs want
to have the control for their multi insert API implementation via rel
options, I think the proposal at [1] can be useful.

IIUC, there's no dependency or anything as such for the new table AM
patch with the rel options thread [1]. If I'm right, can this new
table AM patch [2] be reviewed further?

Thoughts?

[1] - https://commitfest.postgresql.org/31/2717/
[2] - 
https://www.postgresql.org/message-id/CALj2ACWMnZZCu%3DG0PJkEeYYicKeuJ-X%3DSU19i6vQ1%2B%3DuXz8u0Q%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW

2020-12-28 Thread Bharath Rupireddy
On Tue, Dec 22, 2020 at 7:01 PM Bharath Rupireddy
 wrote:
> Currently, $subject is not allowed. We do plan the mat view query
> before every refresh. I propose to show the explain/explain analyze of
> the select part of the mat view in case of Refresh Mat View(RMV). It
> will be useful for the user to know what exactly is being planned and
> executed as part of RMV. Please note that we already have
> explain/explain analyze CTAS/Create Mat View(CMV), where we show the
> explain/explain analyze of the select part. This proposal will do the
> same thing.
>
> The behaviour can be like this:
> EXPLAIN REFRESH MATERIALIZED VIEW mv1;   --> will not refresh the mat
> view, but shows the select part's plan of mat view.
> EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW mv1;   --> will refresh the
> mat view and shows the select part's plan of mat view.
>
> Thoughts? If okay, I will post a patch later.

Attaching below patches:

0001 - Rearrange Refresh Mat View Code - Currently, the function
ExecRefreshMatView in matview.c is having many lines of code which is
not at all good from readability and maintainability perspectives.
This patch adds a few functions and moves the code from
ExecRefreshMatView to them making the code look better.

0002 - EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW support and tests.

If this proposal is useful, I have few open points - 1) In the patch I
have added a new mat view info parameter to ExplainOneQuery(), do we
also need to add it to ExplainOneQuery_hook_type? 2) Do we document
(under respective command pages or somewhere else) that we allow
explain/explain analyze for a command?

Thoughts?



With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From e0422b72acbaed27182ad3816cdc921a0f962fe2 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 25 Dec 2020 14:03:17 +0530
Subject: [PATCH v1] Rearrange Refresh Mat View Code

Currently, the function ExecRefreshMatView in matview.c is having
many lines of code which is not at all good from readability and
maintainability perspectives. This patch adds few functions and
moves the code from ExecRefreshMatView to them making the code
look better.
---
 src/backend/commands/matview.c | 452 -
 1 file changed, 273 insertions(+), 179 deletions(-)

diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index cfc63915f3..40cb436d16 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -64,7 +64,7 @@ static void transientrel_startup(DestReceiver *self, int operation, TupleDesc ty
 static bool transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void transientrel_shutdown(DestReceiver *self);
 static void transientrel_destroy(DestReceiver *self);
-static uint64 refresh_matview_datafill(DestReceiver *dest, Query *query,
+static uint64 refresh_matview_datafill(Oid OIDNewHeap, Query *query,
 	   const char *queryString);
 static char *make_temptable_name_n(char *tempname, int n);
 static void refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
@@ -73,6 +73,16 @@ static void refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersist
 static bool is_usable_unique_index(Relation indexRel);
 static void OpenMatViewIncrementalMaintenance(void);
 static void CloseMatViewIncrementalMaintenance(void);
+static Query *get_matview_query(RefreshMatViewStmt *stmt, Relation *rel,
+Oid *objectId);
+static Query *rewrite_refresh_matview_query(Query *dataQuery);
+static Oid get_new_heap_oid(RefreshMatViewStmt *stmt, Relation matviewRel,
+			Oid matviewOid, char *relpersistence);
+static void match_matview_with_new_data(RefreshMatViewStmt *stmt,
+		Relation matviewRel, Oid matviewOid,
+		Oid OIDNewHeap, Oid relowner,
+		int save_sec_context,
+		char relpersistence, uint64	processed);
 
 /*
  * SetMatViewPopulatedState
@@ -140,127 +150,20 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
 {
 	Oid			matviewOid;
 	Relation	matviewRel;
-	RewriteRule *rule;
-	List	   *actions;
 	Query	   *dataQuery;
-	Oid			tableSpace;
-	Oid			relowner;
 	Oid			OIDNewHeap;
-	DestReceiver *dest;
 	uint64		processed = 0;
-	bool		concurrent;
-	LOCKMODE	lockmode;
+	Oid			relowner;
 	char		relpersistence;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
 	ObjectAddress address;
 
-	/* Determine strength of lock needed. */
-	concurrent = stmt->concurrent;
-	lockmode = concurrent ? ExclusiveLock : AccessExclusiveLock;
-
-	/*
-	 * Get a lock until end of transaction.
-	 */
-	matviewOid = RangeVarGetRelidExtended(stmt->relation,
-		  lockmode, 0,
-		  RangeVarCallbackOwnsTable, NULL);
-	matviewRel = table_open(matviewOid, NoLock);
-
-	/* Make sure it is a materialized view. */
-	if (matviewRel->rd_rel->relkind != RELKIND_MATVIEW)
-		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("\"%s\" is not a materialized 

Re: On login trigger: take three

2020-12-28 Thread Masahiko Sawada
On Sat, Dec 26, 2020 at 4:04 PM Pavel Stehule  wrote:
>
>
>
> so 26. 12. 2020 v 8:00 odesílatel Pavel Stehule  
> napsal:
>>
>> Hi
>>
>>
>>> Thank you.
>>> I have applied all your fixes in on_connect_event_trigger-12.patch.
>>>
>>> Concerning enable_client_connection_trigger GUC, I think that it is really 
>>> useful: it is the fastest and simplest way to disable login triggers in case
>>> of some problems with them (not only for superuser itself, but for all 
>>> users). Yes, it can be also done using "ALTER EVENT TRIGGER DISABLE".
>>> But assume that you have a lot of databases with different login policies 
>>> enforced by on-login event triggers. And you want temporary disable them 
>>> all, for example for testing purposes.
>>> In this case GUC is most convenient way to do it.
>>>
>>
>> There was typo in patch
>>
>> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
>> index f810789..8861f1b 100644
>> --- a/doc/src/sgml/config.sgml
>> +++ b/doc/src/sgml/config.sgml
>> @@ -1,4 +1,4 @@
>> -
>> +\
>>
>> I have not any objections against functionality or design. I tested the 
>> performance, and there are no negative impacts when this feature is not 
>> used. There is significant overhead related to plpgsql runtime 
>> initialization, but when this trigger will be used, then probably some other 
>> PLpgSQL procedures and functions will be used too, and then this overhead 
>> can be ignored.
>>
>> * make without warnings
>> * make check-world passed
>> * doc build passed
>>
>> Possible ToDo:
>>
>> The documentation can contain a note so usage connect triggers in 
>> environments with short life sessions and very short fast queries without 
>> usage PLpgSQL functions or procedures can have negative impact on 
>> performance due overhead of initialization of PLpgSQL engine.
>>
>> I'll mark this patch as ready for committers
>
>
> looks so this patch has not entry in commitfestapp 2021-01
>

Yeah, please register this patch before the next CommitFest[1] starts,
2021-01-01 AoE[2].

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: Parallel Inserts in CREATE TABLE AS

2020-12-28 Thread Bharath Rupireddy
On Mon, Dec 28, 2020 at 11:24 AM vignesh C  wrote:
> Test comments are detailed in a few cases and in few others it is not 
> detailed for similar kinds of parallelism selected tests. I felt we could 
> make the test comments consistent across the file.

Modified the test case description in the v17 patch set posted
upthread. Please have a look.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel Inserts in CREATE TABLE AS

2020-12-28 Thread Bharath Rupireddy
On Mon, Dec 28, 2020 at 1:16 AM Zhihong Yu  wrote:
> For v16-0002-Tuple-Cost-Adjustment-for-Parallel-Inserts-in-CTAS.patch:
>
> +   if (ignore &&
> +   (root->parse->CTASParallelInsInfo &
> +CTAS_PARALLEL_INS_TUP_COST_CAN_IGN))
>
> I wonder why CTAS_PARALLEL_INS_TUP_COST_CAN_IGN is checked again in the above 
> if since when ignore_parallel_tuple_cost returns true, 
> CTAS_PARALLEL_INS_TUP_COST_CAN_IGN is set already.

Sometimes, we may set the flag CTAS_PARALLEL_INS_TUP_COST_CAN_IGN
before generate_useful_gather_paths, but the
generate_useful_gather_paths can return without reaching cost_gather
where we reset. The generate_useful_gather_paths can return without
reaching cost_gather, in following case

if (rel->partial_pathlist == NIL)
return;

So, for such cases, I'm resetting it here.

> + * In this function we only care Append and Gather nodes.
>
> 'care' -> 'care about'

Done.

> +   for (int i = 0; i < aps->as_nplans; i++)
> +   {
> +   parallel |= PushDownCTASParallelInsertState(dest,
> +   aps->appendplans[i],
> +   gather_exists);
>
> It seems the loop termination condition can include parallel since we can 
> come out of the loop once parallel is true.

No, we can not come out of the for loop if parallel is true, because
our intention there is to look for all the child/sub plans under
Append, and push the inserts to the Gather nodes wherever possible.

> +   if (!allow && tuple_cost_flags && gather_exists)
>
> As the above code shows, gather_exists is only checked when allow is false.

Yes, if at least one gather node exists under the Append for which the
planner would have ignored the tuple cost, and now if we don't allow
parallel inserts, we should assert that the parallelism is not picked
because of wrong parallel tuple cost enforcement.

> +* We set the flag for two cases when there is no parent path will
> +* be created(such as : limit,sort,distinct...):
>
> Please correct the grammar : there are two verbs following 'when'

Done.

> For set_append_rel_size:
>
> +   {
> +   root->parse->CTASParallelInsInfo |=
> +   CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND;
> +   }
> +   }
> +
> +   if (root->parse->CTASParallelInsInfo &
> +   CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND)
> +   {
> +   root->parse->CTASParallelInsInfo &=
> +   
> ~CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND;
>
> In the if block for childrel->rtekind == RTE_SUBQUERY, 
> CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND maybe set. Why is it cleared 
> immediately after ?

Thanks for pointing that out. It's a miss, intention is to reset it
after set_rel_size(). Corrected in the v17 patch.

> +   /* Set to this in case tuple cost needs to be ignored for Append cases. */
> +   CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND = 1 << 3
>
> Since each CTAS_PARALLEL_INS_ flag is a bit, maybe it's better to use 'turn 
> on' or similar term in the comment. Because 'set to' normally means 
> assignment.

Done.

All the above comments are addressed in the v17 patch set posted
upthread. Please have a look.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Identify LWLocks in tracepoints

2020-12-28 Thread Masahiko Sawada
Hi Craig,

On Sat, Dec 19, 2020 at 2:00 PM Craig Ringer
 wrote:
>
> Hi all
>
> The attached patch set follows on from the discussion in [1] "Add LWLock 
> blocker(s) information" by adding the actual LWLock* and the numeric tranche 
> ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint.
>
> This does not provide complete information on blockers, because it's not 
> necessarily valid to compare any two LWLock* pointers between two process 
> address spaces. The locks could be in DSM segments, and those DSM segments 
> could be mapped at different addresses.
>
> I wasn't able to work out a sensible way to map a LWLock* to any sort of 
> (tranche-id, lock-index) because there's no requirement that locks in a 
> tranche be contiguous or known individually to the lmgr.
>
> Despite that, the patches improve the information available for LWLock 
> analysis significantly.
>
> Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be 
> fired from LWLockWaitForVar, despite that function never actually acquiring 
> the lock.
>
> Patch 2 adds the tranche id and lock pointer for each trace hit. This makes 
> it possible to differentiate between individual locks within a tranche, and 
> (so long as they aren't tranches in a DSM segment) compare locks between 
> processes. That means you can do lock-order analysis etc, which was not 
> previously especially feasible. Traces also don't have to do userspace reads 
> for the tranche name all the time, so the trace can run with lower overhead.
>
> Patch 3 adds a single-path tracepoint for all lock acquires and releases, so 
> you only have to probe the lwlock__acquired and lwlock__release events to see 
> all acquires/releases, whether conditional or otherwise. It also adds start 
> markers that can be used for timing wallclock duration of LWLock 
> acquires/releases.
>
> Patch 4 adds some comments on LWLock tranches to try to address some points I 
> found confusing and hard to understand when investigating this topic.
>

You sent in your patch to pgsql-hackers on Dec 19, but you did not
post it to the next CommitFest[1].  If this was intentional, then you
need to take no action.  However, if you want your patch to be
reviewed as part of the upcoming CommitFest, then you need to add it
yourself before 2021-01-01 AoE[2]. Thanks for your contributions.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: OpenSSL connection setup debug callback issue

2020-12-28 Thread Masahiko Sawada
Hi Daniel,

On Thu, Dec 10, 2020 at 10:43 PM Daniel Gustafsson  wrote:
>
> I went looking at the SSL connection state change information callback we
> install when setting up connections with OpenSSL, and I wasn't getting the
> state changes I expected.  Turns out we install it at the tail end of setting
> up the connection so we miss most of the calls.  Moving it to the beginning of
> be_tls_open_server allows us to catch the handshake etc.  I also extended it 
> by
> printing the human readable state change message available from OpenSSL to 
> make
> the logs more detailed (SSL_state_string_long has existed since 0.9.8).
>
> A randomly selected sequence from a src/test/ssl testrun with the callback
> moved but not extended with state information:
>
> LOG:  connection received: host=localhost port=51177
> DEBUG:  SSL: handshake start
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept exit (-1)
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept exit (-1)
> DEBUG:  SSL: accept exit (-1)
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: accept loop
> DEBUG:  SSL: handshake done
> DEBUG:  SSL: accept exit (1)
> DEBUG:  SSL connection from "(anonymous)"
>
> The same sequence with the patch applied:
>
> LOG:  connection received: host=localhost port=51177
> DEBUG:  SSL: handshake start: "before/accept initialization"
> DEBUG:  SSL: accept loop: "before/accept initialization"
> DEBUG:  SSL: accept exit (-1): "SSLv2/v3 read client hello A"
> DEBUG:  SSL: accept loop: "SSLv3 read client hello A"
> DEBUG:  SSL: accept loop: "SSLv3 write server hello A"
> DEBUG:  SSL: accept loop: "SSLv3 write certificate A"
> DEBUG:  SSL: accept loop: "SSLv3 write key exchange A"
> DEBUG:  SSL: accept loop: "SSLv3 write certificate request A"
> DEBUG:  SSL: accept loop: "SSLv3 flush data"
> DEBUG:  SSL: accept exit (-1): "SSLv3 read client certificate A"
> DEBUG:  SSL: accept exit (-1): "SSLv3 read client certificate A"
> DEBUG:  SSL: accept loop: "SSLv3 read client certificate A"
> DEBUG:  SSL: accept loop: "SSLv3 read client key exchange A"
> DEBUG:  SSL: accept loop: "SSLv3 read certificate verify A"
> DEBUG:  SSL: accept loop: "SSLv3 read finished A"
> DEBUG:  SSL: accept loop: "SSLv3 write change cipher spec A"
> DEBUG:  SSL: accept loop: "SSLv3 write finished A"
> DEBUG:  SSL: accept loop: "SSLv3 flush data"
> DEBUG:  SSL: handshake done: "SSL negotiation finished successfully"
> DEBUG:  SSL: accept exit (1): "SSL negotiation finished successfully"
> DEBUG:  SSL connection from "(anonymous)"
>
> The attached contains these two changes as well as comment fixups which Heikki
> noticed.

You sent in your patch,
0001-Move-information-callback-earlier-to-capture-connect.patch to
pgsql-hackers on Dec 10, but you did not post it to the next
CommitFest[1].  If this was intentional, then you need to take no
action.  However, if you want your patch to be reviewed as part of the
upcoming CommitFest, then you need to add it yourself before
2021-01-01 AoE[2]. Thanks for your contributions.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: [Doc Patch] Clarify that CREATEROLE roles can GRANT default roles

2020-12-28 Thread Masahiko Sawada
Hi Michael,

On Sat, Nov 28, 2020 at 7:50 AM Michael Banck  wrote:
>
> Hi,
>
> https://www.postgresql.org/docs/current/default-roles.html mentions the
> "Administrator" several times, but does not specify it further. I
> understand that it is mentioned elsewhere [1], but I think it would be
> beneficial to remind the reader on that page at least once that
> "Administrators" includes "roles with the CREATEROLE privilege" in the
> context of GRANTing and REVOKEing default privileges, e.g. with the
> attached patch.
>

You sent in your patch, default-roles-createrole.patch to
pgsql-hackers on Nov 28, but you did not post it to the next
CommitFest[1].  If this was intentional, then you need to take no
action.  However, if you want your patch to be reviewed as part of the
upcoming CommitFest, then you need to add it yourself before
2021-01-01 AoE[2]. Thanks for your contributions.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: Rename of triggers for partitioned tables

2020-12-28 Thread Masahiko Sawada
Hi Arne,

On Fri, Nov 27, 2020 at 9:20 AM Arne Roland  wrote:
>
> I'm sorry for the spam. Here a patch with updated comments, I missed one 
> before.
>

You sent in your patch, recursive_tgrename.2.patch to pgsql-hackers on
Nov 27, but you did not post it to the next CommitFest[1].  If this
was intentional, then you need to take no action.  However, if you
want your patch to be reviewed as part of the upcoming CommitFest,
then you need to add it yourself before 2021-01-01 AoE[2]. Thanks for
your contributions.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: Disable WAL logging to speed up data loading

2020-12-28 Thread Simon Riggs
On Fri, 25 Dec 2020 at 07:09, Michael Paquier  wrote:
>
> On Thu, Dec 03, 2020 at 03:52:47AM +, tsunakawa.ta...@fujitsu.com wrote:
> > The code looks good, and the performance seems to be nice, so I
> > marked this ready for committer.
>
> FWIW, I am extremely afraid of this proposal because this is basically
> a footgun able to corrupt customer instances, and I am ready to bet
> that people would switch wal_level to none because they see a lot of
> benefits in doing so at first sight, until the host running the server
> is plugged off and they need to use pg_resetwal in urgency to bring an
> instance online.

Agreed, it is a footgun. -1 to commit the patch as-is.

The patch to avoid WAL is simple but it is dangerous for both the user
and the PostgreSQL project.

In my experience, people will use this option and when it crashes and
they lose their data, they will claim PostgreSQL broke and that they
were not stupid enough to use this option. Data loss has always been
the most serious error for PostgreSQL and our reputation for
protecting data has been hard won; it can easily be lost in a moment
of madness. Please consider how the headlines will read, bearing in
mind past incidents and negative press. Yes, we did think of this
feature already and rejected it.

If we ever did allow such an option, it must contain these things (IMHO):
* the option should be called "unsafe" or "allows_data_loss", not just
"none" (anything less than "minimal" must be insufficient or
unsafe...)
* the option must be set in the control file and be part of the same
patch, so users cannot easily edit things to hide their unsafe usage
* we must not support the option of going down to "unsafe" and then
back up again. It must be a one-way transition from "unsafe" to a
higher level, so if people want to use this for temp reporting servers
or initial loading, great, but they can't use it as a quick speed-up
for databases containing needs-to-be-safe data. Possibly the state
change might be "unsafe" -> "needs_backup" -> "minimal"... or some
other way to signal to backup.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: postgres_fdw - cached connection leaks if the associated user mapping/foreign server is dropped

2020-12-28 Thread Fujii Masao




On 2020/12/24 23:45, Fujii Masao wrote:



On 2020/12/24 23:30, Bharath Rupireddy wrote:

On Thu, Dec 24, 2020 at 7:43 PM Fujii Masao  wrote:

Even when we're in the midst of transaction, if that transaction has not used
the cached connections yet, we close them immediately. So, to make the
comment more precise, what about updating the comment as follows?

-
  After a change to a pg_foreign_server or pg_user_mapping catalog entry,
  close connections depending on that entry immediately if current
  transaction has not used those connections yet. Otherwise, mark those
  connections as invalid and then make pgfdw_xact_callback() close them
  at the end of current transaction, since they cannot be closed in the 
midst
  of a transaction using them. Closed connections will be remade at the next
  opportunity if necessary.
-


Done.


+   /*
+    * Close the connection if it's not in midst of a xact. 
Otherwise
+    * mark it as invalid so that it can be disconnected at 
the end of
+    * main xact in pgfdw_xact_callback().
+    */

Because of the same reason as the above, what about updating this comment
as follows?

-
  Close the connection immediately if it's not used yet in this transaction.
  Otherwise mark it as invalid so that pgfdw_xact_callback() can close it
  at the end of this transaction.
-


Done.

Attaching v3 patch. Please have a look. Thanks!


Thanks a lot! Barring any objection, I will commit this version.


Pushed. Thanks!

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: [doc] adding way to examine the plan type of prepared statements

2020-12-28 Thread Masahiko Sawada
Hi Torikoshi-san,

On Thu, Nov 19, 2020 at 3:19 PM torikoshia  wrote:
>
> On 2020-11-18 11:04, torikoshia wrote:
> > Hi,
> >
> >
> > Currently, EXPLAIN is the only way to know whether the plan is generic
> > or custom according to the manual of PREPARE.
> >
> >   https://www.postgresql.org/docs/devel/sql-prepare.html
> >
> > After commit d05b172, we can also use pg_prepared_statements view to
> > examine the plan types.
> >
> > How about adding this explanation like the attached patch?
>
> Sorry, but on second thought, since it seems better to add
> the explanation to the current description of pg_prepared_statements,
> I modified the patch.
>

You sent in your patch, v2-0001-add-way-to-examine-plan-type.patch to
pgsql-hackers on Nov 19, but you did not post it to the next
CommitFest[1].  If this was intentional, then you need to take no
action.  However, if you want your patch to be reviewed as part of the
upcoming CommitFest, then you need to add it yourself before
2021-01-01 AoE[2]. Thanks for your contributions.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: doc review for v14

2020-12-28 Thread Magnus Hagander
On Sun, Dec 27, 2020 at 9:26 PM Justin Pryzby  wrote:

> On Thu, Dec 24, 2020 at 05:12:02PM +0900, Michael Paquier wrote:
> > 0001-pgindent-typos.not-a-patch touches pg_bsd_indent.
>
> I'm hoping that someone will apply it there, but I realize that access to
> its
> repository is tightly controlled :)
>

Not as much "tightly controlled" as "nobody's really bothered to grant any
permissions".

I've applied the patch, thanks! While at it I fixed the indentation of the
"target" row in the patch, I think you didn't take the fix all the way :)

You may also want to submit those fixes upstream in freebsd? The typos seem
to be present at
https://github.com/freebsd/freebsd/tree/master/usr.bin/indent as well. (If
so, please include the updated version that I applied, so we don't diverge
on that)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Table AM modifications to accept column projection lists

2020-12-28 Thread Masahiko Sawada
Hi Soumyadeep,

On Sat, Nov 14, 2020 at 3:02 AM Soumyadeep Chakraborty
 wrote:
>
> Hello,
>
> This patch introduces a set of changes to the table AM APIs, making them
> accept a column projection list. That helps columnar table AMs, so that
> they don't need to fetch all columns from disk, but only the ones
> actually needed.
>
> The set of changes in this patch is not exhaustive -
> there are many more opportunities that are discussed in the TODO section
> below. Before digging deeper, we want to elicit early feedback on the
> API changes and the column extraction logic.
>
> TableAM APIs that have been modified are:
>
> 1. Sequential scan APIs
> 2. Index scan APIs
> 3. API to lock and return a row
> 4. API to fetch a single row
>
> We have seen performance benefits in Zedstore for many of the optimized
> operations [0]. This patch is extracted from the larger patch shared in
> [0].
>
> 
> Building the column projection set:
>
> In terms of building the column projection set necessary for each of
> these APIs, this patch builds off of the scanCols patch [1], which
> Ashwin and Melanie had started earlier. As noted in [1], there are cases
> where the scanCols set is not representative of the columns to be
> projected. For instance, in a DELETE .. RETURNING query, there is
> typically a sequential scan and a separate invocation of
> tuple_fetch_row_version() in order to satisfy the RETURNING clause (see
> ExecDelete()). So for a query such as:
>
>   DELETE from foo WHERE i < 100 && j < 1000 RETURNING k, l;
>
> We need to pass the set (i, j) to the scan and (k, l) to the
> tuple_fetch_row_version() invocation. This is why we had to introduce
> the returningCols field.
>
> In the same spirit, separate column projection sets are computed for any
> operations that involve an EPQ check (INSERT, DELETE, UPDATE, row-level
> locking etc), the columns involved in an ON CONFLICT UPDATE etc.
>
> Recognizing and collecting these sets of columns is done at various
> stages: analyze and rewrite, planner and executor - depending on the
> type of operation for which the subset of columns is calculated. The
> column bitmaps are stored in different places as well - such as the ones
> for scans and RETURNING are stored in RangeTblEntry, whereas the set of
> columns for ON CONFLICT UPDATE are stored in OnConflictSetState.
>
> 
> Table AM API changes:
>
> The changes made to the table AM API, introducing the column projection
> set, come in different flavors. We would like feedback on what style
> we need to converge to or if we should use different styles depending
> on the situation.
>
> - A new function variant that takes a column projection list, such as:
>
>   TableScanDesc (*scan_begin) (Relation rel,
> Snapshot snapshot,
> int nkeys, struct ScanKeyData *key,
> ParallelTableScanDesc pscan,
> uint32 flags);
> ->
>
>   TableScanDesc (*scan_begin_with_column_projection)(Relation relation,
>  Snapshot snapshot,
>  int nkeys, struct ScanKeyData *key,
>  ParallelTableScanDesc parallel_scan,
>  uint32 flags,
>  Bitmapset *project_columns);
>
> - Modifying the existing function to take a column projection list, such
> as:
>
>   TM_Result (*tuple_lock) (Relation rel,
>  ItemPointer tid,
>  Snapshot snapshot,
>  TupleTableSlot *slot,
>  CommandId cid,
>  LockTupleMode mode,
>  LockWaitPolicy wait_policy,
>  uint8 flags,
>  TM_FailureData *tmfd);
>
> ->
>
>   TM_Result (*tuple_lock) (Relation rel,
>  ItemPointer tid,
>  Snapshot snapshot,
>  TupleTableSlot *slot,
>  CommandId cid,
>  LockTupleMode mode,
>  LockWaitPolicy wait_policy,
>  uint8 flags,
>  TM_FailureData *tmfd,
>  Bitmapset *project_cols);
>
> - A new function index_fetch_set_column_projection() to be called after
> index_beginscan() to set the column projection set, which will be used
> later by index_getnext_slot().
>
>   void (*index_fetch_set_column_projection) (struct IndexFetchTableData *data,
>  Bitmapset *project_columns);
>
> The set of columns expected by the new/modified functions is represented
> as a Bitmapset of attnums for a specific base relation. An empty/NULL
> bitmap signals to the AM that no data columns are needed. A bitmap
> containing the single element 0 indicates that we want all data columns
> to be fetched.
>
> The bitmaps do not include system columns.
>
> Additionally, the TupleTableSlots populated by functions such
> as table_scan_getnextslot(), need to be densely filled upto the highest
> numbered column in the projection list (any column not in the projection
> list should be populated with NULL). This is due to the implicit
> assumptions of the slot_get_***() APIs.
>
> 
> TODOs:
>
> - Explore opportunities to push the 

Re: PATCH: Report libpq version and configuration

2020-12-28 Thread Masahiko Sawada
Hi Craig,

On Wed, Nov 11, 2020 at 1:11 PM Craig Ringer
 wrote:
>
> On Tue, Nov 10, 2020 at 2:22 PM Craig Ringer  
> wrote:
>>
>>
>> The main things I'd really like to get in place are a way to get the version 
>> as an ELF data symbol, and a simple way to ID the binary.
>>
>> So the minimal change would be to declare:
>>
>> const char LIBPQ_VERSION_STR[] = PG_VERSION_STR;
>> const int LIBPQ_VERSION_NUM = PG_VERSION_NUM;
>>
>> then change PQgetVersion() to return LIBPQ_VERSION_NUM and add a 
>> PQgetVersionStr() that returns LIBPQ_VERSION_STR.
>>
>> That OK with you?
>
>
> Proposed minimal patch attached.

You sent in your patch, v2-0001-Add-PQlibVersionString-to-libpq.patch
to pgsql-hackers on Nov 11, but you did not post it to the next
CommitFest[1].  If this was intentional, then you need to take no
action.  However, if you want your patch to be reviewed as part of the
upcoming CommitFest, then you need to add it yourself before
2021-01-01 AOE[2]. Thanks for your contributions.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: Disable WAL logging to speed up data loading

2020-12-28 Thread Masahiko Sawada
On Mon, Dec 28, 2020 at 4:29 PM osumi.takami...@fujitsu.com
 wrote:
>
> Hello Sawada-San
>
>
> On Monday, December 28, 2020 2:29 PM Masahiko Sawada  
> wrote:
> > On Thu, Dec 3, 2020 at 12:14 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > I've made a new patch v05 that took in comments to filter out WALs
> > > more strictly and addressed some minor fixes that were discussed
> > > within past few days.
> > > Also, I changed the documentations, considering those modifications.
> >
> > From a backup management tool perspective, how can they detect that
> > wal_level has been changed to ‘none' (and back to the normal)? IIUC once
> > we changed wal_level to none, old backups that are taken before setting to
> > ‘none’ can be used only for restoring the database to the point before the
> > LSN where setting 'wal_level = none'. The users can neither restore the
> > database to any points in the term of 'wal_level = none' nor use an old 
> > backup
> > to restore the database to the point after LSN where setting 'wal_level =
> > none’. I think we might need to provide a way to detect the changes other
> > than reading XLOG_PARAMETER_CHANGE.
> In the past, we discussed the aspect of backup management tool in [1]
> and concluded that this should be another patch separated from this thread
> because to compare the wal_level changes between snapshots
> applies to wal_level = minimal, too. Please have a look at the "second idea"
> in the e-mail in the [1] and responses to it.
>

Thank you for telling me about the discussion!

The discussion already started on another thread? I think it might be
better to consider it in parallel, if not started yet. We can verify
beforehand that the proposed solutions will really work well together
with backup management tools. And from the user perspective, I wonder
if they would like to see this feature in the same release where
wal_level = none is introduced. Otherwise, the fact that some backup
management tools won’t be able to work together with wal_level = none
will be a big restriction for users. That patch would probably not be
a blocker of this patch but will facilitate the discussion.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: pgsql: Fix memory leak in plpgsql's CALL processing.

2020-12-28 Thread Peter Eisentraut
This change has broken the following test case (reduced from actual 
production code):


CREATE OR REPLACE PROCEDURE prc_inout_test_inner(INOUT x integer)
LANGUAGE plpgsql
AS
$procedure$
BEGIN
  COMMIT;
  x := 2;
END;
$procedure$;

CREATE OR REPLACE PROCEDURE prc_inout_test_main()
LANGUAGE plpgsql
AS
$procedure$
DECLARE
  y INTEGER := 1;
BEGIN
  CALL prc_inout_test_inner(y);
END;
$procedure$;

CALL prc_inout_test_main();

With the patch, it fails like this:

ERROR:  XX000: unsupported target type: 0
CONTEXT:  PL/pgSQL function prc_inout_test_main() line 5 at CALL
LOCATION:  exec_move_row_from_fields, pl_exec.c:7196


On 2020-09-29 17:18, Tom Lane wrote:

Fix memory leak in plpgsql's CALL processing.

When executing a CALL or DO in a non-atomic context (i.e., not inside
a function or query), plpgsql creates a new plan each time through,
as a rather hacky solution to some resource management issues.  But
it failed to free this plan until exit of the current procedure or DO
block, resulting in serious memory bloat in procedures that called
other procedures many times.  Fix by remembering to free the plan,
and by being more honest about restoring the previous state (otherwise,
recursive procedure calls have a problem).

There was also a smaller leak associated with recalculation of the
"target" list of output variables.  Fix that by using the statement-
lifespan context to hold non-permanent values.

Back-patch to v11 where procedures were introduced.

Pavel Stehule and Tom Lane

Discussion: 
https://postgr.es/m/cafj8prdiiu1dqym+_p4_gutwm76knju7z9opwaybjtc0nqg...@mail.gmail.com

Branch
--
master

Details
---
https://git.postgresql.org/pg/commitdiff/a6b1f5365d58356b5d42829e9cd89a6c725d7a0a

Modified Files
--
src/pl/plpgsql/src/pl_exec.c | 91 ++--
1 file changed, 70 insertions(+), 21 deletions(-)







Re: Parallel copy

2020-12-28 Thread vignesh C
On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  wrote:
>
> On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas  wrote:
> >
> > On 02/11/2020 08:14, Amit Kapila wrote:
> > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas  
> > > wrote:
> > >>
> > >> In this design, you don't need to keep line boundaries in shared memory,
> > >> because each worker process is responsible for finding the line
> > >> boundaries of its own block.
> > >>
> > >> There's a point of serialization here, in that the next block cannot be
> > >> processed, until the worker working on the previous block has finished
> > >> scanning the EOLs, and set the starting position on the next block,
> > >> putting it in READY state. That's not very different from your patch,
> > >> where you had a similar point of serialization because the leader
> > >> scanned the EOLs,
> > >
> > > But in the design (single producer multiple consumer) used by the
> > > patch the worker doesn't need to wait till the complete block is
> > > processed, it can start processing the lines already found. This will
> > > also allow workers to start much earlier to process the data as it
> > > doesn't need to wait for all the offsets corresponding to 64K block
> > > ready. However, in the design where each worker is processing the 64K
> > > block, it can lead to much longer waits. I think this will impact the
> > > Copy STDIN case more where in most cases (200-300 bytes tuples) we
> > > receive line-by-line from client and find the line-endings by leader.
> > > If the leader doesn't find the line-endings the workers need to wait
> > > till the leader fill the entire 64K chunk, OTOH, with current approach
> > > the worker can start as soon as leader is able to populate some
> > > minimum number of line-endings
> >
> > You can use a smaller block size.
> >
>
> Sure, but the same problem can happen if the last line in that block
> is too long and we need to peek into the next block. And then there
> could be cases where a single line could be greater than 64K.
>
> > However, the point of parallel copy is
> > to maximize bandwidth.
> >
>
> Okay, but this first-phase (finding the line boundaries) can anyway be
> not done in parallel and we have seen in some of the initial
> benchmarking that this initial phase is a small part of work
> especially when the table has indexes, constraints, etc. So, I think
> it won't matter much if this splitting is done in a single process or
> multiple processes.
>

I wrote a patch to compare the performance of the current
implementation leader identifying the line bound design vs the workers
identifying the line boundary. The results of the same is given below:
The below data can be read as parallel copy time taken in seconds
based on the leader identifying the line boundary design, parallel
copy time taken in seconds based on the workers identifying the line
boundary design, workers.

Use case 1 - 10million rows, 5.2GB data,3 indexes on integer columns:
(211.206, 632.583, 1), (165.402, 360.152, 2), (137.608, 219.623, 4),
(128.003, 206.851, 8), (114.518, 177.790, 16), (109.257, 170.058, 20),
(102.050, 158.376, 30)

Use case 2 - 10million rows, 5.2GB data,2 indexes on integer columns,
1 index on text column, csv file:
(1212.356, 1602.118, 1), (707.191, 849.105, 2), (369.620, 441.068, 4),
(221.359, 252.775, 8), (167.152, 180.207, 16), (168.804, 181.986, 20),
(172.320, 194.875, 30)

Use case 3 - 10million rows, 5.2GB data without index:
(96.317, 437.453, 1), (70.730, 240.517, 2), (64.436, 197.604, 4),
(67.186, 175.630, 8), (76.561, 156.015, 16), (81.025, 150.687, 20),
(86.578, 148.481, 30)

Use case 4 - 1 records, 9.6GB, toast data:
(147.076, 276.323, 1), (101.610, 141.893, 2), (100.703, 134.096, 4),
(112.583, 134.765, 8), (101.898, 135.789, 16), (109.258, 135.625, 20),
(109.219, 136.144, 30)

Attached is a patch that was used for the same. The patch is written
on top of the parallel copy patch.
The design Amit, Andres & myself voted for that is the leader
identifying the line bound design and sharing it in shared memory is
performing better.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From dd9b6be19573b5391d01373b53e64a5c1dc305fd Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Mon, 28 Dec 2020 15:00:48 +0530
Subject: [PATCH v12 7/7] Parallel copy based on workers identifying line
 boundary.

Parallel copy based on workers identifying line boundary.
---
 src/backend/commands/copyfromparse.c |  93 +++
 src/backend/commands/copyparallel.c  | 441 +--
 src/include/commands/copyfrom_internal.h |  38 ++-
 src/test/regress/expected/copy2.out  |   2 +-
 4 files changed, 318 insertions(+), 256 deletions(-)

diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index a767bae..1d79da9 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -82,13 +82,15 @@ if (1) \
 { \
 	if (raw_buf_ptr > cstate->raw_buf_index) \
 	{ 

RE: [HACKERS] logical decoding of two-phase transactions

2020-12-28 Thread osumi.takami...@fujitsu.com
Hi, Amit-San


On Thursday, Dec 24, 2020 2:35 PM Amit Kapila  wrote:
> On Wed, Dec 23, 2020 at 3:08 PM Ajin Cherian  wrote:
> >
> > > Can you please update the patch for the points we agreed upon?
> >
> > Changed and attached.
> >
> 
> Thanks, I have looked at these patches again and it seems patches 0001 to
> 0004 are in good shape, and among those
> v33-0001-Extend-the-output-plugin-API-to-allow-decoding-o is good to go.
> So, I am planning to push the first patch (0001*) in next week sometime
> unless you or someone else has any comments on it.
I agree this from the perspective of good code quality for memory management.

I reviewed the v33 patchset by using valgrind and
conclude that the patchset of version 33th has no problem in terms of memory 
management.
This can be applied to v34 because the difference between the two versions are 
really small.

I conducted comparison of valgrind logfiles between master and master with v33 
patchset applied.
I checked both testing of contrib/test-decoding and src/test/subscription of 
course, using valgrind.

The first reason why I reached the conclusion is that
I don't find any description of memcheck error in the log files.
I picked up and greped error message expressions in the documentation of the 
valgrind - [1],
but there was no grep matches.

Secondly, I surveyed function stack of valgrind's 3 types of memory leak,
"Definitely lost", "Indirectly lost" and "Possibly lost" and
it turned out that the patchset didn't add any new cause of memory leak.

[1] - https://valgrind.org/docs/manual/mc-manual.html#mc-manual.errormsgs

Best Regards,
Takamichi Osumi


Re: [PATCH] TAP test showing that pg_replication_slot_advance() works on standby

2020-12-28 Thread Masahiko Sawada
Hi Craig,

On Tue, Nov 10, 2020 at 2:32 PM Craig Ringer
 wrote:
>
> Hi all
>
> The attached patch adds a test to the TAP logical decoding suite to show that 
> pg_replication_slot_advance() works on a server in standby mode.
>
> I didn't add a full demonstration of how to do failover with logical slots 
> because we're still missing a way to "sync" a logical slot from a primary to 
> a standby, or a way to directly create such a slot safely on a standby in a 
> way that enforces a safe catalog_xmin etc.
>
> You can't replay from the slot unless the server is promoted, so I don't test 
> that.
>
> I'm not sure if anyone's going to find it worth committing, but it's here so 
> searchers can find it at least.

You sent in your patch,
0001-Extend-TAP-test-for-pg_replication_slot_advance-to-c_patch.txt to
pgsql-hackers on Nov10, but you did not post it to the next
CommitFest[1].  If this was intentional, then you need to take no
action.  However, if you want your patch to be reviewed as part of the
upcoming CommitFest, then you need to add it yourself before
2021-01-01 AOE[2]. Thanks for your contributions.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: Parallel Full Hash Join

2020-12-28 Thread Masahiko Sawada
Hi Melanie,

On Thu, Nov 5, 2020 at 7:34 AM Melanie Plageman
 wrote:
>
> I've attached a patch with the corrections I mentioned upthread.
> I've gone ahead and run pgindent, though, I can't say that I'm very
> happy with the result.
>
> I'm still not quite happy with the name
> BarrierArriveAndDetachExceptLast(). It's so literal. As you said, there
> probably isn't a nice name for this concept, since it is a function with
> the purpose of terminating parallelism.

You sent in your patch, v3-0001-Support-Parallel-FOJ-and-ROJ.patch to
pgsql-hackers on Nov 5, but you did not post it to the next
CommitFest[1].  If this was intentional, then you need to take no
action.  However, if you want your patch to be reviewed as part of the
upcoming CommitFest, then you need to add it yourself before
2021-01-01 AOE[2]. Also, rebasing to the current HEAD may be required
as almost two months passed since when this patch is submitted. Thanks
for your contributions.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Lazy JIT IR code generation to increase JIT speed with partitions

2020-12-28 Thread Luc Vlaming

Hi,

I would like to propose a small patch to the JIT machinery which makes 
the IR code generation lazy. The reason for postponing the generation of 
the IR code is that with partitions we get an explosion in the number of 
JIT functions generated as many child tables are involved, each with 
their own JITted functions, especially when e.g. partition-aware 
joins/aggregates are enabled. However, only a fraction of those 
functions is actually executed because the Parallel Append node 
distributes the workers among the nodes. With the attached patch we get 
a lazy generation which makes that this is no longer a problem.


For benchmarks I have in TPC-H and TPC-DS like queries with partitioning 
by hash seen query runtimes increase by 20+ seconds even on the simpler 
queries. Also I created a small benchmark to reproduce the case easily 
(see attached sql file):


without patch, using 7 launched workers:
- without jit: ~220ms
- with jit: ~1880ms
without patch, using 50 launched workers:
- without jit: ~280ms
- with jit: ~3400ms

with patch, using 7 launched workers:
- without jit: ~220ms
- with jit: ~590ms

with patch, using 50 launched workers:
- without jit: ~280ms
- with jit: ~530ms

Thoughts?

With Regards,
Luc Vlaming
Swarm64


jit_partitions.sql
Description: application/sql
>From 5dd5df7e29bb3262b8f7f02c4fd3896bb34c3133 Mon Sep 17 00:00:00 2001
From: Luc Vlaming 
Date: Mon, 28 Dec 2020 09:01:32 +0100
Subject: [PATCH v1] generate JIT IR code lazily

---
 src/backend/jit/llvm/llvmjit_expr.c | 98 +
 1 file changed, 59 insertions(+), 39 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 3aa08a9743..2ac79b7571 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -52,6 +52,7 @@ typedef struct CompiledExprState
 } CompiledExprState;
 
 
+static Datum ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 static Datum ExecRunCompiledExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 
 static LLVMValueRef BuildV1Call(LLVMJitContext *context, LLVMBuilderRef b,
@@ -70,18 +71,66 @@ static LLVMValueRef create_LifetimeEnd(LLVMModuleRef mod);
 	   lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
 	   ((LLVMValueRef[]){__VA_ARGS__}))
 
-
 /*
- * JIT compile expression.
+ * Prepare the JIT compile expression.
  */
 bool
 llvm_compile_expr(ExprState *state)
 {
 	PlanState  *parent = state->parent;
-	char	   *funcname;
-
 	LLVMJitContext *context = NULL;
 
+
+	/*
+	 * Right now we don't support compiling expressions without a parent, as
+	 * we need access to the EState.
+	 */
+	Assert(parent);
+
+	llvm_enter_fatal_on_oom();
+
+	/* get or create JIT context */
+	if (parent->state->es_jit)
+		context = (LLVMJitContext *) parent->state->es_jit;
+	else
+	{
+		context = llvm_create_context(parent->state->es_jit_flags);
+		parent->state->es_jit = >base;
+	}
+
+	/*
+	 * Don't immediately emit nor actually generate the function.
+	 * instead do so the first time the expression is actually evaluated.
+	 * That allows to emit a lot of functions together, avoiding a lot of
+	 * repeated llvm and memory remapping overhead. It also helps with not
+	 * compiling functions that will never be evaluated, as can be the case
+	 * if e.g. a parallel append node is distributing workers between its
+	 * child nodes.
+	 */
+	{
+
+		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
+
+		cstate->context = context;
+
+		state->evalfunc = ExecCompileExpr;
+		state->evalfunc_private = cstate;
+	}
+
+	llvm_leave_fatal_on_oom();
+
+	return true;
+}
+
+/*
+ * JIT compile expression.
+ */
+static Datum
+ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull)
+{
+	CompiledExprState *cstate = state->evalfunc_private;
+	LLVMJitContext *context = cstate->context;
+
 	LLVMBuilderRef b;
 	LLVMModuleRef mod;
 	LLVMValueRef eval_fn;
@@ -125,31 +174,16 @@ llvm_compile_expr(ExprState *state)
 
 	llvm_enter_fatal_on_oom();
 
-	/*
-	 * Right now we don't support compiling expressions without a parent, as
-	 * we need access to the EState.
-	 */
-	Assert(parent);
-
-	/* get or create JIT context */
-	if (parent->state->es_jit)
-		context = (LLVMJitContext *) parent->state->es_jit;
-	else
-	{
-		context = llvm_create_context(parent->state->es_jit_flags);
-		parent->state->es_jit = >base;
-	}
-
 	INSTR_TIME_SET_CURRENT(starttime);
 
 	mod = llvm_mutable_module(context);
 
 	b = LLVMCreateBuilder();
 
-	funcname = llvm_expand_funcname(context, "evalexpr");
+	cstate->funcname = llvm_expand_funcname(context, "evalexpr");
 
 	/* create function */
-	eval_fn = LLVMAddFunction(mod, funcname,
+	eval_fn = LLVMAddFunction(mod, cstate->funcname,
 			  llvm_pg_var_func_type("TypeExprStateEvalFunc"));
 	LLVMSetLinkage(eval_fn, LLVMExternalLinkage);
 	LLVMSetVisibility(eval_fn, LLVMDefaultVisibility);
@@ -2350,30 +2384,16 @@ llvm_compile_expr(ExprState *state)
 
 	LLVMDisposeBuilder(b);
 

Re: Add table AM 'tid_visible'

2020-12-28 Thread Masahiko Sawada
On Tue, Nov 3, 2020 at 5:23 PM Jinbao Chen  wrote:
>
> Hi Andres,
>
>
>
> > Yea, it's something we should improve. Have you checked if this has
>
> > performance impact for heap? Should we also consider planning costs?
>
> Since the visibility map is very small, all pages of the visibility map will
>
> usually reside in memory. The IO cost of accessing the visibility map can
>
> be ignored. We should add the CPU cost of accessing visibility map. The
>
> CPU cost of accessing visibility map is usually smaller than cpu_tuple_cost.
>
> But Postgres does not have a Macro to describe such a small cost. Should
>
> We add one?
>
>
>
> > As far as I can tell you have not acually attached the patch.
>
> Ah, forgot to upload the patch. Attach it below.

You sent in your patch, tid_visible-1.patch to pgsql-hackers on Nov 3,
but you did not post it to the next CommitFest[1].  If this was
intentional, then you need to take no action.  However, if you want
your patch to be reviewed as part of the upcoming CommitFest, then you
need to add it yourself before 2021-01-01 AoE[2]. Thanks for your
contributions.

Regards,

[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-28 Thread Tang, Haiying
Hi Amit,

>I think one table with a varying amount of data is sufficient for the vacuum 
>test. 
>I think with more number of tables there is a greater chance of variation. 
>We have previously used multiple tables in one of the tests because of 
>the Truncate operation (which uses DropRelFileNodesAllBuffers that 
>takes multiple relations as input) and that is not true for Vacuum operation 
>which I suppose you are testing here.

I retested performance on single table for several times, the table size is 
varying with the BUF_DROP_FULL_SCAN_THRESHOLD for different shared buffers.
When shared_buffers is below 20G, there were no significant changes between 
master(HEAD) and patched.
And according to the results compared between 20G and 100G, we can get 
optimization up to NBuffers/128, but there is no benefit from NBuffers/256.
I've tested many times, most times the same results came out, I don't know why. 
But If I used 5 tables(each table size is set as BUF_DROP_FULL_SCAN_THRESHOLD), 
then we can get benefit from it(NBuffers/256).

Here is my test results for single table. If you have any question or 
suggestion, kindly let me know.

%reg= (patched- master(HEAD))/ patched
Optimized percentage:
shared_buffers  %reg(NBuffers/512)  %reg(NBuffers/256)  
%reg(NBuffers/128)  %reg(NBuffers/64)   %reg(NBuffers/32)   
%reg(NBuffers/16)   %reg(NBuffers/8)%reg(NBuffers/4)
--
128M0%  0%  -1% 
0%  1%  0%  
0%  0%
1G  -1% 0%  -1% 
0%  0%  0%  
0%  0%
20G 0%  0%  -33%
0%  0%  -13%
0%  0%
100G-32%0%  -49%
0%  10% 30% 
0%  6%

Result details(unit: second):
patched  (sec)  
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8  NBuffers/4
128M0.107   0.107   0.107   0.107   
0.108   0.107   0.108   0.208
1G  0.107   0.107   0.107   0.108   
0.208   0.208   0.308   0.409 
20G 0.208   0.308   0.308   0.409   
0.609   0.808   1.511   2.713 
100G0.309   0.408   0.609   1.010   
2.011   5.017   6.620   13.931

master(HEAD) (sec)  
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8  NBuffers/4
128M0.107   0.107   0.108   0.107   
0.107   0.107   0.108   0.208
1G  0.108   0.107   0.108   0.108   
0.208   0.207   0.308   0.409 
20G 0.208   0.309   0.409   0.409   
0.609   0.910   1.511   2.712 
100G0.408   0.408   0.909   1.010   
1.811   3.515   6.619   13.032

Regards
Tang