date:20200604

Re: A wrong index choose issue because of inaccurate statistics

2020-06-04 Thread Pavel Stehule

pá 5. 6. 2020 v 8:19 odesílatel David Rowley  napsal:

> On Mon, 1 Jun 2020 at 01:24, Andy Fan  wrote:
> > The one-line fix describe the exact idea in my mind:
> >
> > +++ b/src/backend/optimizer/path/costsize.c
> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root,
> double loop_count,
> >
> > cpu_run_cost += cpu_per_tuple * tuples_fetched;
> >
> > +   /*
> > +* To make the planner more robust to handle some inaccurate
> statistics
> > +* issue, we will add a extra cost to qpquals so that the less
> qpquals
> > +* the lower cost it has.
> > +*/
> > +   cpu_run_cost += 0.01 * list_length(qpquals);
> > +
> >
> > This change do fix the issue above, but will it make some other cases
> worse? My
> > answer is no based on my current knowledge, and this is most important
> place
> > I want to get advised. The mainly impact I can figure out is: it not only
> > change the cost difference between (a, b) and (a, c) it also cause the
> cost
> > difference between Index scan on (a, c) and seq scan.  However the
> > cost different between index scan and seq scan are huge by practice so
> > I don't think this impact is harmful.
>
> Didn't that idea already get shot down in the final paragraph on [1]?
>
> I understand that you wish to increase the cost by some seemingly
> innocent constant to fix your problem case.  Here are my thoughts
> about that: Telling lies is not really that easy to pull off. Bad
> liers think it's easy and good ones know it's hard. The problem is
> that the lies can start small, but then at some point the future you
> must fashion some more lies to account for your initial lie. Rinse and
> repeat that a few times and before you know it, your course is set
> well away from the point of truth.  I feel the part about "rinse and
> repeat" applies reasonably well to how join costing works.  The lie is
> likely to be amplified as the join level gets deeper.
>
> I think you need to think of a more generic solution and propose that
> instead.  There are plenty of other quirks in the planner that can
> cause suffering due to inaccurate or non-existing statistics. For
> example, due to how we multiply individual selectivity estimates,
> having a few correlated columns in a join condition can cause the
> number of join rows to be underestimated. Subsequent joins can then
> end up making bad choices on which join operator to use based on those
> inaccurate row estimates.  There's also a problem with WHERE  ORDER
> BY col LIMIT n; sometimes choosing an index that provides pre-sorted
> input to the ORDER BY but cannot use  as an indexable condition.
> We don't record any stats to make better choices there, maybe we
> should, but either way, we're taking a bit risk there as all the rows
> matching  might be right at the end of the index and we might need
> to scan the entire thing before hitting the LIMIT. For now, we just
> assume completely even distribution of rows. i.e. If there are 50 rows
> estimated in the path and the limit is for 5 rows, then we'll assume
> we need to read 10% of those before finding all the ones we need. In
> reality, everything matching  might be 95% through the index and we
> could end up reading 100% of rows. That particular problem is not just
> caused by the uneven distribution of rows in the index, but also from
> selectivity underestimation.
>
> I'd more recently wondered if we shouldn't have some sort of "risk"
> factor to the cost model. I really don't have ideas on how exactly we
> would calculate the risk factor in all cases, but in this case,  say
> the risk factor always starts out as 1. We could multiply that risk
> factor by some >1 const each time we find another index filter qual.
> add_path() can prefer lower risk paths when the costs are similar.
> Unsure what the exact add_path logic would be. Perhaps a GUC would
> need to assist with the decision there.   Likewise, with
> NestedLoopPaths which have a large number of join quals, the risk
> factor could go up a bit with those so that we take a stronger
> consideration for hash or merge joins instead.
>
>
I thought about these ideas too. And I am not alone.

https://hal.archives-ouvertes.fr/hal-01316823/document

Regards

Pavel

Anyway, it's pretty much a large research subject which would take a
> lot of work to iron out even just the design. It's likely not a
> perfect idea, but I think it has a bit more merit that trying to
> introduce lies to the cost modal to account for a single case where
> there is a problem.
>
> David
>
> [1]
> https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development
>
>
>

Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-04 Thread Michael Paquier

On Wed, Jun 03, 2020 at 10:10:21PM +0900, Inoue, Hiroshi wrote:
> On 2020/06/03 11:14, Michael Paquier wrote:
>> I have been looking at the ODBC driver and the need for currtid() as
>> well as currtid2(), and as mentioned already in [1], matching with my
>> lookup of things, these are actually not needed by the driver as long
>> as we connect to a server newer than 8.2 able to support RETURNING.
> 
> Though currtid2() is necessary even for servers which support RETURNING,
> I don't object to remove it.

In which cases is it getting used then?  From what I can see there is
zero coverage for that part in the tests.  And based on a rough read
of the code, this would get called with LATEST_TUPLE_LOAD being set,
where there is some kind of bulk deletion involved.  Couldn't that be
a problem?
--
Michael


signature.asc
Description: PGP signature

Re: A wrong index choose issue because of inaccurate statistics

2020-06-04 Thread David Rowley

On Mon, 1 Jun 2020 at 01:24, Andy Fan  wrote:
> The one-line fix describe the exact idea in my mind:
>
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double 
> loop_count,
>
> cpu_run_cost += cpu_per_tuple * tuples_fetched;
>
> +   /*
> +* To make the planner more robust to handle some inaccurate 
> statistics
> +* issue, we will add a extra cost to qpquals so that the less qpquals
> +* the lower cost it has.
> +*/
> +   cpu_run_cost += 0.01 * list_length(qpquals);
> +
>
> This change do fix the issue above, but will it make some other cases worse? 
> My
> answer is no based on my current knowledge, and this is most important place
> I want to get advised. The mainly impact I can figure out is: it not only
> change the cost difference between (a, b) and (a, c) it also cause the cost
> difference between Index scan on (a, c) and seq scan.  However the
> cost different between index scan and seq scan are huge by practice so
> I don't think this impact is harmful.

Didn't that idea already get shot down in the final paragraph on [1]?

I understand that you wish to increase the cost by some seemingly
innocent constant to fix your problem case.  Here are my thoughts
about that: Telling lies is not really that easy to pull off. Bad
liers think it's easy and good ones know it's hard. The problem is
that the lies can start small, but then at some point the future you
must fashion some more lies to account for your initial lie. Rinse and
repeat that a few times and before you know it, your course is set
well away from the point of truth.  I feel the part about "rinse and
repeat" applies reasonably well to how join costing works.  The lie is
likely to be amplified as the join level gets deeper.

I think you need to think of a more generic solution and propose that
instead.  There are plenty of other quirks in the planner that can
cause suffering due to inaccurate or non-existing statistics. For
example, due to how we multiply individual selectivity estimates,
having a few correlated columns in a join condition can cause the
number of join rows to be underestimated. Subsequent joins can then
end up making bad choices on which join operator to use based on those
inaccurate row estimates.  There's also a problem with WHERE  ORDER
BY col LIMIT n; sometimes choosing an index that provides pre-sorted
input to the ORDER BY but cannot use  as an indexable condition.
We don't record any stats to make better choices there, maybe we
should, but either way, we're taking a bit risk there as all the rows
matching  might be right at the end of the index and we might need
to scan the entire thing before hitting the LIMIT. For now, we just
assume completely even distribution of rows. i.e. If there are 50 rows
estimated in the path and the limit is for 5 rows, then we'll assume
we need to read 10% of those before finding all the ones we need. In
reality, everything matching  might be 95% through the index and we
could end up reading 100% of rows. That particular problem is not just
caused by the uneven distribution of rows in the index, but also from
selectivity underestimation.

I'd more recently wondered if we shouldn't have some sort of "risk"
factor to the cost model. I really don't have ideas on how exactly we
would calculate the risk factor in all cases, but in this case,  say
the risk factor always starts out as 1. We could multiply that risk
factor by some >1 const each time we find another index filter qual.
add_path() can prefer lower risk paths when the costs are similar.
Unsure what the exact add_path logic would be. Perhaps a GUC would
need to assist with the decision there.   Likewise, with
NestedLoopPaths which have a large number of join quals, the risk
factor could go up a bit with those so that we take a stronger
consideration for hash or merge joins instead.

Anyway, it's pretty much a large research subject which would take a
lot of work to iron out even just the design. It's likely not a
perfect idea, but I think it has a bit more merit that trying to
introduce lies to the cost modal to account for a single case where
there is a problem.

David

[1] 
https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-04 Thread Amit Kapila

On Fri, May 29, 2020 at 8:31 PM Dilip Kumar  wrote:
>
> Apart from this one more fix in 0005,  basically, CheckLiveXid was
> never reset, so I have fixed that as well.
>

I have made a number of modifications in the 0001 patch and attached
is the result.  I have changed/added comments, done some cosmetic
cleanup, and ran pgindent.  The most notable change is to remove the
below code change:
DecodeXactOp()
{
..
- * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+ * However, it's critical to process records with subxid assignment even
  * when the snapshot is being built: it is possible to get later records
  * that require subxids to be properly assigned.
  */
  if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
- info != XLOG_XACT_ASSIGNMENT)
+ !TransactionIdIsValid(XLogRecGetTopXid(r)))
..
}

I have not only removed the change done by the patch but the check
related to XLOG_XACT_ASSIGNMENT as well.  That check has been added by
commit bac2fae05c to ensure that we process XLOG_XACT_ASSIGNMENT even
if snapshot state is not SNAPBUILD_FULL_SNAPSHOT.  Now, with this
patch that is not required because we are making the subtransaction
and top-level transaction much earlier than this.  I have verified
that it doesn't reopen the bug by running the test provided in the
original report [1].

Let me know what you think of the changes?  If you find them okay,
then feel to include them in the next patch-set.

[1] - 
https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

v27-0001-Immediately-WAL-log-subtransaction-and-top-level.patch
Description: Binary data

Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-04 Thread Michael Paquier

On Thu, Jun 04, 2020 at 10:59:05AM -0700, Andres Freund wrote:
> What's the point of that change? I think the differentiation between
> heapam_handler.c and heapam.c could be clearer, but if anything, I'd
> argue that heap_get_latest_tid is sufficiently low-level that it'd
> belong in heapam.c.

Well, heap_get_latest_tid() is only called in heapam_handler.c if
anything, as it is not used elsewhere and not publish it.  And IMO we
should try to encourage using table_get_latest_tid() instead if some
plugins need that.  Anyway, if you are opposed to this change, I won't
push hard for it either.
--
Michael


signature.asc
Description: PGP signature

Re: BufFileRead() error signalling

2020-06-04 Thread Thomas Munro

On Thu, May 28, 2020 at 7:10 PM Michael Paquier  wrote:
> On Wed, May 27, 2020 at 11:59:59AM -0400, Alvaro Herrera wrote:
> > In the discussion that led to 811b6e36a9e2 I did suggest to use "read
> > only M of N" instead, but there wasn't enough discussion on that fine
> > point so we settled on what you now call prevalent without a lot of
> > support specifically on that.  I guess it was enough of an improvement
> > over what was there.  But like Robert, I too prefer the wording that
> > includes "only" and "bytes" over the wording that doesn't.
> >
> > I'll let it be known that from a translator's point of view, it's a
> > ten-seconds job to update a fuzzy string from not including "only" and
> > "bytes" to one that does.  So let's not make that an argument for not
> > changing.
>
> Using "only" would be fine by me, though I tend to prefer the existing
> one.  Now I think that we should avoid "bytes" to not have to worry
> about pluralization of error messages.  This has been a concern in the
> past (see e5d11b9 and the likes).

Sorry for the delay in producing a new patch.  Here's one that tries
to take into account the various feedback in this thread:

Ibrar said:
> You are checking file->dirty twice, first before calling the function
> and within the function too. Same for the Assert.

Fixed.

Robert, Melanie, Michael and Alvaro put forward various arguments
about the form of "short read" and block seek error message.  While
removing bogus cases of "%m", I changed them all to say "read only %zu
of %zu bytes" in the same place.  I removed the bogus "%m" from
BufFileSeek() and BufFileSeekBlock() callers (its call to
BufFileFlush() already throws).  I added block numbers to the error
messages about failure to seek by block.

Following existing practice, I made write failure assume ENOSPC if
errno is 0, so that %m would make sense (I am not sure what platform
reports out-of-space that way, but there are certainly a lot of copies
of that code in the tree so it seemed to be missing here).

I didn't change BufFileWrite() to be void, to be friendly to existing
callers outside the tree (if there are any), though I removed all the
code that checks the return code.  We can make it void later.

For the future: it feels a bit like we're missing a one line way to
say "read this many bytes and error out if you run out".
From ab6f90f5e5f3b6e70de28a4ee734e732c2a8c30b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 30 Nov 2019 12:44:47 +1300
Subject: [PATCH] Redesign error handling in buffile.c.

Convert error handling in buffile.c functions to raise errors rather
than expecting callers to check the return value, and improve the error
messages.

This fixes various cases that forgot to check for errors, or couldn't
check for errors because EOF was indistinguishable from error, or
checked for errors unless they fell on certain boundaries which they
assumed to be EOF.

This had been identified by code inspection, but was later identified as
the probable cause of a crash report involving a corrupted hash join
spill file when disk space ran out.

Back-patch to all releases.  For now, BufFileWrite() still returns a
redundant value so as not to change the function prototype for the
benefit of any extensions that might be using it.

Reported-by: Amit Khandekar 
Reported-by: Alvaro Herrera 
Reviewed-by: Melanie Plageman 
Reviewed-by: Alvaro Herrera 
Reviewed-by: Robert Haas 
Reviewed-by: Ibrar Ahmed 
Reviewed-by: Michael Paquier 
Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com
---
 src/backend/access/gist/gistbuildbuffers.c | 23 -
 src/backend/executor/nodeHashjoin.c| 24 --
 src/backend/replication/backup_manifest.c  |  2 +-
 src/backend/storage/file/buffile.c | 51 +++-
 src/backend/utils/sort/logtape.c   | 18 ---
 src/backend/utils/sort/sharedtuplestore.c  | 21 
 src/backend/utils/sort/tuplestore.c| 56 ++
 7 files changed, 91 insertions(+), 104 deletions(-)

diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 4b562d8d3f..7ff513a5fe 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -757,26 +757,19 @@ gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 static void
 ReadTempFileBlock(BufFile *file, long blknum, void *ptr)
 {
+	size_t		nread;
+
 	if (BufFileSeekBlock(file, blknum) != 0)
-		elog(ERROR, "could not seek temporary file: %m");
-	if (BufFileRead(file, ptr, BLCKSZ) != BLCKSZ)
-		elog(ERROR, "could not read temporary file: %m");
+		elog(ERROR, "could not seek block %ld in temporary file", blknum);
+	if ((nread = BufFileRead(file, ptr, BLCKSZ)) != BLCKSZ)
+		elog(ERROR, "could not read temporary file: read only %zu of %zu bytes",
+			 nread, (size_t) BLCKSZ);
 }

 static void
 WriteTempFileBlock(BufFile *file, long blknum, voi

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-06-04 Thread Andy Fan

On Fri, Jun 5, 2020 at 10:57 AM David Rowley  wrote:

> On Fri, 5 Jun 2020 at 14:36, Andy Fan  wrote:
> > On Mon, May 25, 2020 at 2:34 AM David Rowley 
> wrote:
> >>
> >> On Sun, 24 May 2020 at 04:14, Dmitry Dolgov <9erthali...@gmail.com>
> wrote:
> >> >
> >> > > On Fri, May 22, 2020 at 08:40:17AM +1200, David Rowley wrote:
> >> > > I imagine we'll set some required UniqueKeys during
> >> > > standard_qp_callback()
> >> >
> >> > In standard_qp_callback, because pathkeys are computed at this point I
> >> > guess?
> >>
> >> Yes. In particular, we set the pathkeys for DISTINCT clauses there.
> >>
> >
> > Actually I have some issues to understand from here,  then try to read
> index
> > skip scan patch to fully understand what is the requirement, but that
> doesn't
> > get it so far[1].  So what  is the "UniqueKeys" in "UniqueKeys during
> > standard_qp_callback()" and what is the "pathkeys" in "pathkeys are
> computed
> > at this point” means?  I tried to think it as root->distinct_pathkeys,
> however I
> > didn't fully understand where root->distinct_pathkeys is used for as
> well.
>
> In standard_qp_callback(), what we'll do with uniquekeys is pretty
> much what we already do with pathkeys there. Basically pathkeys are
> set there to have the planner attempt to produce a plan that satisfies
> those pathkeys.  Notice at the end of standard_qp_callback() we set

the pathkeys according to the first upper planner operation that'll
> need to make use of those pathkeys.  e.g, If there's a GROUP BY and a
> DISTINCT in the query, then use the pathkeys for GROUP BY, since that
> must occur before DISTINCT.


Thanks for your explanation.  Looks I understand now based on your comments.
Take root->group_pathkeys for example,  the similar information also
available in
root->parse->groupClauses but we do use of root->group_pathkeys  with
pathkeys_count_contained_in function in many places, that is mainly because
the content between between the 2 is different some times, like the case in
pathkey_is_redundant.

Likely uniquekeys will want to follow the
> same rules there for the operations that can make use of paths with
> uniquekeys, which in this case, I believe, will be the same as the
> example I just mentioned for pathkeys, except we'll only be able to
> support GROUP BY without any aggregate functions.
>
>
All the places I want to use UniqueKey so far (like distinct, group by and
others)
have an input_relation (RelOptInfo),  and the UniqueKey information can be
get
there.  at the same time,  all the pathkey in PlannerInfo is used for Upper
planner
but UniqueKey may be used in current planner some time, like
reduce_semianti_joins/
remove_useless_join, I am not sure if we must maintain uniquekey in
PlannerInfo.

-- 
Best Regards
Andy Fan

Re: A wrong index choose issue because of inaccurate statistics

2020-06-04 Thread Andy Fan

>
>
>
> Why will the (a, c) be choose?  If planner think a = x has only 1 row ..
>

I just did more research and found above statement is not accurate,
the root cause of this situation is because IndexSelectivity = 0. Even
through I
don't think we can fix anything here since IndexSelectivity is calculated
from
statistics and we don't know it is 1% wrong or 10% wrong or more just like
What
Tomas said.

The way of fixing it is just add a "small" extra cost for pqquals for index
scan. that should be small enough to not have impacts on others part. I will
discuss how small is small later with details, we can say it just a guc
variable
for now.

+++ b/src/backend/optimizer/path/costsize.c
@@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double
loop_count,

cpu_run_cost += cpu_per_tuple * tuples_fetched;

+   /*
+* To make the planner more robust to handle some inaccurate
statistics
+* issue, we will add a extra cost to qpquals so that the less
qpquals
+* the lower cost it has.
+*/
+   cpu_run_cost += stat_stale_cost * list_length(qpquals);
+

If we want to reduce the impact of this change further, we can only add
this if
the IndexSelecivity == 0.

How to set the value of stat_stale_cost? Since the minimum cost for a query
should be a cpu_tuple_cost which is 0.01 default. Adding an 0.01 cost for
each
pqqual in index scan should not make a big difference.  However sometimes
we may
set it to 0.13 if we consider index->tree_height was estimated wrongly for
1 (cost is
50 * 0.0025 = 0.125). I don't know how it happened, but looks it do happen
in prod
environment. At the same time it is unlikely index->tree_height is estimated
wrongly for 2 or more. so basically we can set this value to 0(totally
disable
this feature), 0.01 (should be ok for most case), 0.13 (A bit aggressive).

The wrong estimation of IndexSelectitity = 0 might be common case and if
people just have 2 related index like (A, B) and (A, C). we have 50%
chances to
have a wrong decision, so I would say this case worth the troubles. My
current
implementation looks not cool, so any suggestion to research further is
pretty
welcome.

-- 
Best Regards
Andy Fan

Re: Atomic operations within spinlocks

2020-06-04 Thread Andres Freund

Hi,

On 2020-06-04 15:13:29 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-06-04 14:50:40 -0400, Tom Lane wrote:
> >> 2. The computed completePasses value would go backwards.  I bet
> >> that wouldn't matter too much either, or at least we could teach
> >> BgBufferSync to cope.  (I notice the comments therein suggest that
> >> it is already designed to cope with completePasses wrapping around,
> >> so maybe nothing needs to be done.)
>
> > If we're not concerned about that, then we can remove the
> > atomic-inside-spinlock, I think. The only reason for that right now is
> > to avoid assuming a wrong pass number.
>
> Hmm.  That might be a less-invasive path to a solution.  I can take
> a look, if you don't want to.

First, I think it would be problematic:

/*
 * Find out where the freelist clock sweep currently is, and how many
 * buffer allocations have happened since our last call.
 */
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
...

/*
 * Compute strategy_delta = how many buffers have been scanned by the
 * clock sweep since last time.  If first time through, assume none. 
Then
 * see if we are still ahead of the clock sweep, and if so, how many
 * buffers we could scan before we'd catch up with it and "lap" it. 
Note:
 * weird-looking coding of xxx_passes comparisons are to avoid bogus
 * behavior when the passes counts wrap around.
 */
if (saved_info_valid)
{
int32   passes_delta = strategy_passes - 
prev_strategy_passes;

strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;

Assert(strategy_delta >= 0);

ISTM that if we can get an out-of-sync strategy_passes and
strategy_buf_id we'll end up with a pretty wrong strategy_delta. Which,
I think, can cause to reset bgwriter's position:
else
{
/*
 * We're behind, so skip forward to the strategy point 
and start
 * cleaning from there.
 */
#ifdef BGW_DEBUG
elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u 
delta=%ld",
 next_passes, next_to_clean,
 strategy_passes, strategy_buf_id,
 strategy_delta);
#endif
next_to_clean = strategy_buf_id;
next_passes = strategy_passes;
bufs_to_lap = NBuffers;
}

While I think that the whole logic in BgBufferSync doesn't make a whole
lot of sense, it does seem to me this has a fair potential to make it
worse. In a scenario with a decent cache hit ratio (leading to high
usagecounts) and a not that large NBuffers, we can end up up doing quite
a few passes (as in many a second), so it might not be that hard to hit
this.

I am not immediately coming up with a cheap solution that doesn't do the
atomics-within-spinlock thing. The best I can come up with is using a
64bit atomic, with the upper 32bit containing the number of passes, and
the lower 32bit containing the current buffer. Where the lower 32bit /
the buffer is handled like it currently is, i.e. we "try" to keep it
below NBuffers. So % is only used for the "cold" path. That'd just add a
64->32 bit cast in the hot path, which shouldn't be measurable. But it'd
regress platforms without 64bit atomics substantially.

We could obviously also just rewrite the BgBufferSync() logic, so it
doesn't care about things like "lapping", but that's not an easy change.

So the best I can really suggest, unless we were to agree on atomics
being ok inside spinlocks, is probably to just replace the spinlock with
an lwlock. That'd perhaps cause a small slowdown for a few cases, but
it'd make workload that e.g. use the freelist a lot (e.g. when tables
are dropped regularly) scale better.

Greetings,

Andres Freund

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-06-04 Thread David Rowley

On Fri, 5 Jun 2020 at 14:36, Andy Fan  wrote:
> On Mon, May 25, 2020 at 2:34 AM David Rowley  wrote:
>>
>> On Sun, 24 May 2020 at 04:14, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>> >
>> > > On Fri, May 22, 2020 at 08:40:17AM +1200, David Rowley wrote:
>> > > I imagine we'll set some required UniqueKeys during
>> > > standard_qp_callback()
>> >
>> > In standard_qp_callback, because pathkeys are computed at this point I
>> > guess?
>>
>> Yes. In particular, we set the pathkeys for DISTINCT clauses there.
>>
>
> Actually I have some issues to understand from here,  then try to read index
> skip scan patch to fully understand what is the requirement, but that doesn't
> get it so far[1].  So what  is the "UniqueKeys" in "UniqueKeys during
> standard_qp_callback()" and what is the "pathkeys" in "pathkeys are computed
> at this point” means?  I tried to think it as root->distinct_pathkeys,  
> however I
> didn't fully understand where root->distinct_pathkeys is used for as well.

In standard_qp_callback(), what we'll do with uniquekeys is pretty
much what we already do with pathkeys there. Basically pathkeys are
set there to have the planner attempt to produce a plan that satisfies
those pathkeys.  Notice at the end of standard_qp_callback() we set
the pathkeys according to the first upper planner operation that'll
need to make use of those pathkeys.  e.g, If there's a GROUP BY and a
DISTINCT in the query, then use the pathkeys for GROUP BY, since that
must occur before DISTINCT.  Likely uniquekeys will want to follow the
same rules there for the operations that can make use of paths with
uniquekeys, which in this case, I believe, will be the same as the
example I just mentioned for pathkeys, except we'll only be able to
support GROUP BY without any aggregate functions.

David

> [1] 
> https://www.postgresql.org/message-id/CAKU4AWq%3DwWkAo-CDOQ5Ea6UwYvZCgb501w6iqU0rtnTT-zg6bQ%40mail.gmail.com

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-06-04 Thread Andy Fan

On Mon, May 25, 2020 at 2:34 AM David Rowley  wrote:

> On Sun, 24 May 2020 at 04:14, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> >
> > > On Fri, May 22, 2020 at 08:40:17AM +1200, David Rowley wrote:
> > > I imagine we'll set some required UniqueKeys during
> > > standard_qp_callback()
> >
> > In standard_qp_callback, because pathkeys are computed at this point I
> > guess?
>
> Yes. In particular, we set the pathkeys for DISTINCT clauses there.
>
>
Actually I have some issues to understand from here,  then try to read index
skip scan patch to fully understand what is the requirement, but that
doesn't
get it so far[1].  So what  is the "UniqueKeys" in "UniqueKeys during
standard_qp_callback()" and what is the "pathkeys" in "pathkeys are computed
at this point” means?  I tried to think it as root->distinct_pathkeys,
however I
didn't fully understand where root->distinct_pathkeys is used for as well.

[1]
https://www.postgresql.org/message-id/CAKU4AWq%3DwWkAo-CDOQ5Ea6UwYvZCgb501w6iqU0rtnTT-zg6bQ%40mail.gmail.com

-- 
Best Regards
Andy Fan

Re: Atomic operations within spinlocks

2020-06-04 Thread Andres Freund

Hi,

On 2020-06-04 15:07:34 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I'd still like to know which problem we're actually trying to solve
> > here. I don't understand the "error" issues you mentioned upthread.
>
> If you error out of getting the inner spinlock, the outer spinlock
> is stuck, permanently, because there is no mechanism for spinlock
> release during transaction abort.  Admittedly it's not very likely
> for the inner acquisition to fail, but it's possible.

We PANIC on stuck spinlocks, so I don't think that's a problem.

>  * We map all spinlocks onto a set of NUM_SPINLOCK_SEMAPHORES semaphores.
>  * It's okay to map multiple spinlocks onto one semaphore because no process
>  * should ever hold more than one at a time.
>
> You've falsified that argument ...  and no, I don't want to upgrade
> the spinlock infrastructure enough to make this OK.

Well, theoretically we take care to avoid this problem. That's why we
have a separate define for spinlocks and atomics:

/*
 * When we don't have native spinlocks, we use semaphores to simulate them.
 * Decreasing this value reduces consumption of OS resources; increasing it
 * may improve performance, but supplying a real spinlock implementation is
 * probably far better.
 */
#define NUM_SPINLOCK_SEMAPHORES 128

/*
 * When we have neither spinlocks nor atomic operations support we're
 * implementing atomic operations on top of spinlock on top of semaphores. To
 * be safe against atomic operations while holding a spinlock separate
 * semaphores have to be used.
 */
#define NUM_ATOMICS_SEMAPHORES  64

and

#ifndef HAVE_SPINLOCKS

/*
 * NB: If we're using semaphore based TAS emulation, be careful to use a
 * separate set of semaphores. Otherwise we'd get in trouble if an 
atomic
 * var would be manipulated while spinlock is held.
 */
s_init_lock_sema((slock_t *) &ptr->sema, true);
#else
SpinLockInit((slock_t *) &ptr->sema);
#endif

But it looks like that code is currently buggy (and looks like it always
has been), because we don't look at the nested argument when
initializing the semaphore.  So we currently allocate too many
semaphores, without benefiting from them :(.

> We shouldn't ever be holding spinlocks long enough, or doing anything
> complicated enough inside them, for such an upgrade to have merit.

Well, I don't think atomic instructions are that complicated.  And I
think prohibiting atomics-within-spinlock adds a problematic
restriction, without much in the way of benefits:

There's plenty things where it's somewhat easy to make the fast-path
lock-free, but the slow path still needs a lock (e.g. around a
freelist). And for those it's really useful to still be able to have a
coherent update to an atomic variable, to synchronize with the fast-path
that doesn't take the spinlock.

Greetings,

Andres Freund

Re: REINDEX CONCURRENTLY and indisreplident

2020-06-04 Thread Michael Paquier

On Thu, Jun 04, 2020 at 11:23:36AM +0900, Michael Paquier wrote:
> On Wed, Jun 03, 2020 at 12:40:38PM -0300, Euler Taveira wrote:
> > On Wed, 3 Jun 2020 at 03:54, Michael Paquier  wrote:
> >> I have bumped into $subject, causing a replica identity index to
> >> be considered as dropped if running REINDEX CONCURRENTLY on it.  This
> >> means that the old tuple information would get lost in this case, as
> >> a REPLICA IDENTITY USING INDEX without a dropped index is the same as
> >> NOTHING.
> >
> > LGTM. I tested in both versions (12, master) and it works accordingly.
> 
> Thanks for the review.  I'll try to get that fixed soon.

Applied this one, just in time before the branching:
https://www.postgresql.org/message-id/1931934b-09dc-e93e-fab9-78c5bc727...@postgresql.org
--
Michael


signature.asc
Description: PGP signature

Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-04 Thread Andres Freund

Hi,

On 2020-06-04 18:22:03 -0700, Jeff Davis wrote:
> On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:
> > +/* minimum number of initial hash table buckets */
> > +#define HASHAGG_MIN_BUCKETS 256
> > 
> > 
> > I don't really see much explanation for that part in the commit,
> > perhaps
> > Jeff can chime in?
> 
> I did this in response to a review comment (point #5):
> 
> 
> https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development
> 
> Tomas suggested a min of 1024, and I thought I was being more
> conservative choosing 256. Still too high, I guess?

> I can lower it. What do you think is a reasonable minimum?

I don't really see why there needs to be a minimum bigger than 1?

Greetings,

Andres Freund

Re: v13: Performance regression related to FORTIFY_SOURCE

2020-06-04 Thread Tom Lane

Jeff Davis  writes:
> On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote:
>> If it is something worth worrying about, let's discuss what's a good
>> fix for it.

> I did post a fix for it, but it's not a very clean fix. I'm slightly
> inclined to proceed with that fix, but I was hoping someone else would
> have a better suggestion.
> How about if I wait another week, and if we still don't have a better
> fix, I will commit this one.

TBH, I don't think we should do this, at least not on the strength of the
evidence you posted so far.  It looks to me like you are micro-optimizing
for one compiler on one platform.  Moreover, you're basically trying to
work around a compiler codegen bug that might not be there next year.

I think what'd make more sense is to file this as a gcc bug ("why doesn't
it remove the useless object size check?") and see what they say about
that.  If the answer is that this isn't a gcc bug for whatever reason,
then we could think about whether we should work around it on the
source-code level.  Even then, I'd want more evidence than has been
presented about this not causing a regression on other toolchains/CPUs.

regards, tom lane

Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-04 Thread Jeff Davis

On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:
> +/* minimum number of initial hash table buckets */
> +#define HASHAGG_MIN_BUCKETS 256
> 
> 
> I don't really see much explanation for that part in the commit,
> perhaps
> Jeff can chime in?

I did this in response to a review comment (point #5):

https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development

Tomas suggested a min of 1024, and I thought I was being more
conservative choosing 256. Still too high, I guess?

I can lower it. What do you think is a reasonable minimum?

Regards,
Jeff Davis

Re: v13: Performance regression related to FORTIFY_SOURCE

2020-06-04 Thread Jeff Davis

On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote:
> If it is something worth worrying about, let's discuss what's a good
> fix for it.

I did post a fix for it, but it's not a very clean fix. I'm slightly
inclined to proceed with that fix, but I was hoping someone else would
have a better suggestion.

How about if I wait another week, and if we still don't have a better
fix, I will commit this one.

Regards,
Jeff Davis

Re: v13: Performance regression related to FORTIFY_SOURCE

2020-06-04 Thread Jeff Davis

On Sun, 2020-04-19 at 16:19 -0700, Peter Geoghegan wrote:
> Is it possible that the issue has something to do with what the
> compiler knows about the alignment of the tapes back when they were a
> flexible array vs. now, where it's a separate allocation? Perhaps I'm
> over reaching, but it occurs to me that MemSetAligned() is itself
> concerned about the alignment of data returned from palloc(). Could
> be
> a similar issue here, too.

Perhaps, but if so, what remedy would that suggest?

Regards,
Jeff Davis

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-06-04 Thread David Rowley

On Mon, 25 May 2020 at 19:14, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Mon, May 25, 2020 at 06:34:30AM +1200, David Rowley wrote:
> > The difference will be that you'd be setting some distinct_uniquekeys
> > in standard_qp_callback() to explicitly request that some skip scan
> > paths be created for the uniquekeys, whereas the patch here just does
> > not bother doing DISTINCT if the upper relation already has unique
> > keys that state that the DISTINCT is not required. The skip scans
> > patch should check if the RelOptInfo for the uniquekeys set in
> > standard_qp_callback() are already mentioned in the RelOptInfo's
> > uniquekeys.  If they are then there's no point in skip scanning as the
> > rel is already unique for the distinct_uniquekeys.
>
> It sounds like it makes semantics of UniqueKey a bit more confusing,
> isn't it? At the moment it says:
>
>  Represents the unique properties held by a RelOptInfo.
>
> With the proposed changes it would be "unique properties, that are held"
> and "unique properties, that are requested", which are partially
> duplicated, but stored in some different fields. From the skip scan
> patch perspective it's probably doesn't make any difference, seems like
> the implementation would be almost the same, just created UniqueKeys
> would be of different type. But I'm afraid potentiall future users of
> UniqueKeys could be easily confused.

If there's some comment that says UniqueKeys are for RelOptInfos, then
perhaps that comment just needs to be expanded to mention the Path
uniqueness when we add the uniquekeys field to Path.

I think the main point of basing skip scans on top of this uniquekeys
patch is to ensure it's the right thing for the job. I don't think
it's realistic to be maintaining two different sets of infrastructure
which serve a very similar purpose. It's important we make UniqueKeys
general purpose enough to support future useful forms of optimisation.
Basing skip scans on it seems like a good exercise towards that. I'm
not expecting that we need to make zero changes here to allow it to
work well with skip scans.

David

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-04 Thread Alvaro Herrera

On 2020-Jun-04, Andres Freund wrote:

> postgres[52656][1]=# SELECT 1;
> ┌──┐
> │ ?column? │
> ├──┤
> │1 │
> └──┘
> (1 row)
> 
> 
> I am very much not in love with the way that was implemented, but it's
> there, and it's used as far as I know (cf tablesync.c).

Ouch ... so they made IDENT in the replication grammar be a trigger to
enter the regular grammar.  Crazy.  No way to put those worms back in
the tin now, I guess.

It is still my opinion that we should prohibit a logical replication
connection from being used to do physical replication.  Horiguchi-san,
Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
the JDBC team) is not opposed to the change -- he says they're just
using it because they didn't realize they should be doing differently.

Both Michael P. and you are saying we shouldn't break it because it
works today, but there isn't a real use-case for it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-04 Thread Andres Freund

Hi,

On 2020-06-04 16:44:53 -0400, Alvaro Herrera wrote:
> A logical replication connection cannot run SQL anyway, can it?

You can:

andres@awork3:~/src/postgresql$ psql 'replication=database'

postgres[52656][1]=# IDENTIFY_SYSTEM;
┌─┬──┬┬──┐
│  systemid   │ timeline │  xlogpos   │  dbname  │
├─┼──┼┼──┤
│ 6821634567571961151 │1 │ 1/D256EC40 │ postgres │
└─┴──┴┴──┘
(1 row)

postgres[52656][1]=# SELECT 1;
┌──┐
│ ?column? │
├──┤
│1 │
└──┘
(1 row)


I am very much not in love with the way that was implemented, but it's
there, and it's used as far as I know (cf tablesync.c).

Greetings,

Andres Freund

Re: what can go in root.crt ?

2020-06-04 Thread Tom Lane

Chapman Flack  writes:
> Sure. It seems sensible to me to start by documenting /what/ it is doing
> now, and to what extent that should be called "its standard behavior"
> versus "the way libpq is calling it", because even if nothing is to be
> changed, there will be people who need to be able to find that information
> to understand what will and won't work.

Fair enough.  I'm certainly prepared to believe that there might be things
we're doing with that API that are not (anymore?) considered best
practice.  But I'd want to approach any changes as "what is considered
best practice", not "how can we get this predetermined behavior".

regards, tom lane

Re: what can go in root.crt ?

2020-06-04 Thread Chapman Flack

On 06/04/20 18:03, Tom Lane wrote:
> It's possible that we could force openssl to validate cases it doesn't
> accept now.  Whether we *should* deviate from its standard behavior is
> a fairly debatable question though.  I would not be inclined to do so
> unless we find that many other consumers of the library also do that.
> Overriding a library in its specific area of expertise seems like a
> good way to get your fingers burnt.

Sure. It seems sensible to me to start by documenting /what/ it is doing
now, and to what extent that should be called "its standard behavior"
versus "the way libpq is calling it", because even if nothing is to be
changed, there will be people who need to be able to find that information
to understand what will and won't work.

Regards,
-Chap

Re: what can go in root.crt ?

2020-06-04 Thread Tom Lane

Chapman Flack  writes:
> On 06/04/20 17:31, Andrew Dunstan wrote:
>> Do we actually do any of this sort of thing? I confess my impression was
>> this is all handled by the openssl libraries, we just hand over the
>> certs and let openssl do its thing. Am I misinformed about that?

> By analogy to other SSL libraries I have worked with, my guess would
> be that there are certain settings and callbacks available that would
> determine some of what it is doing.

It's possible that we could force openssl to validate cases it doesn't
accept now.  Whether we *should* deviate from its standard behavior is
a fairly debatable question though.  I would not be inclined to do so
unless we find that many other consumers of the library also do that.
Overriding a library in its specific area of expertise seems like a
good way to get your fingers burnt.

regards, tom lane

Re: what can go in root.crt ?

2020-06-04 Thread Chapman Flack

On 06/04/20 17:31, Andrew Dunstan wrote:
> Do we actually do any of this sort of thing? I confess my impression was
> this is all handled by the openssl libraries, we just hand over the
> certs and let openssl do its thing. Am I misinformed about that?

I haven't delved very far into the code yet (my initial aim with this
thread was not to pose a rhetorical question, but an ordinary one, and
somebody would know the answer).

By analogy to other SSL libraries I have worked with, my guess would
be that there are certain settings and callbacks available that would
determine some of what it is doing.

In the javax.net.ssl package [1], for example, there are HostnameVerifier
and TrustManager interfaces; client code can supply implementations of these
that embody its desired policies.

Regards,
-Chap

Re: what can go in root.crt ?

2020-06-04 Thread Andrew Dunstan



On 6/3/20 7:57 PM, Chapman Flack wrote:
>
> In an ideal world, I think libpq would be using this algorithm:
>
>   I'm looking at the server's certificate, s.
>   Is s unexpired and in the trust file? If so, SUCCEED.
>
>   otherwise, loop:
> get issuer certificate i from s (if s is self-signed, FAIL).
> does i have CA:TRUE and Certificate Sign bits? If not, FAIL.
> does i's Domain Constraint allow it to sign s? If not, FAIL.
> is i unexpired, or has s a Signed Certificate Timestamp made
>  while i was unexpired? If not, FAIL.
> is i in the trust file? If so, SUCCEED.
> s := i, continue.
>
> (I left out steps like verify signature, check revocation, etc.)
>
> What it seems to be doing, though, is just:
>
>   I'm looking at s
>   Follow chain all the way to a self-signed cert
>   is that in the file?
>
> which seems too simplistic.
>


Do we actually do any of this sort of thing? I confess my impression was
this is all handled by the openssl libraries, we just hand over the
certs and let openssl do its thing. Am I misinformed about that?


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: repeat() function, CHECK_FOR_INTERRUPTS(), and unlikely()

2020-06-04 Thread Alvaro Herrera

On 2020-May-28, Joe Conway wrote:

> I backpatched and pushed the changes to the repeat() function. Any other
> opinions regarding backpatch of the unlikely() addition to 
> CHECK_FOR_INTERRUPTS()?

We don't use unlikely() in 9.6 at all, so I would stop that backpatching
at 10 anyhow.  (We did backpatch unlikely()'s definition afterwards.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: repeat() function, CHECK_FOR_INTERRUPTS(), and unlikely()

2020-06-04 Thread Joe Conway

On 5/28/20 1:23 PM, Joe Conway wrote:
> On 5/27/20 3:29 AM, Michael Paquier wrote:
>>> I think that each of those tests should have a separate unlikely() marker,
>>> since the whole point here is that we don't expect either of those tests
>>> to yield true in the huge majority of CHECK_FOR_INTERRUPTS executions.
>> 
>> +1.  I am not sure that the addition of unlikely() should be
>> backpatched though, that's not something usually done.
> 
> I backpatched and pushed the changes to the repeat() function. Any other
> opinions regarding backpatch of the unlikely() addition to 
> CHECK_FOR_INTERRUPTS()?

So far I have

 Tom +1
 Michael -1
 me +0

on backpatching the addition of unlikely() to CHECK_FOR_INTERRUPTS().

Assuming no one else chimes in I will push the attached to all supported
branches sometime before Tom creates the REL_13_STABLE branch on Sunday.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127..18bc8a7 100644
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
*** extern void ProcessInterrupts(void);
*** 98,113 
  
  #define CHECK_FOR_INTERRUPTS() \
  do { \
! 	if (InterruptPending) \
  		ProcessInterrupts(); \
  } while(0)
  #else			/* WIN32 */
  
  #define CHECK_FOR_INTERRUPTS() \
  do { \
! 	if (UNBLOCKED_SIGNAL_QUEUE()) \
  		pgwin32_dispatch_queued_signals(); \
! 	if (InterruptPending) \
  		ProcessInterrupts(); \
  } while(0)
  #endif			/* WIN32 */
--- 98,113 
  
  #define CHECK_FOR_INTERRUPTS() \
  do { \
! 	if (unlikely(InterruptPending)) \
  		ProcessInterrupts(); \
  } while(0)
  #else			/* WIN32 */
  
  #define CHECK_FOR_INTERRUPTS() \
  do { \
! 	if (unlikely(UNBLOCKED_SIGNAL_QUEUE())) \
  		pgwin32_dispatch_queued_signals(); \
! 	if (unlikely(InterruptPending)) \
  		ProcessInterrupts(); \
  } while(0)
  #endif			/* WIN32 */


signature.asc
Description: OpenPGP digital signature

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-04 Thread Alvaro Herrera

On 2020-Jun-04, Michael Paquier wrote:

> On Wed, Jun 03, 2020 at 06:33:11PM -0700, Andres Freund wrote:

> >> I don't think having a physical replication connection access catalog
> >> data directly is a great idea.  We already have gadgets like
> >> IDENTIFY_SYSTEM for physical replication that can do that, and if you
> >> need particular settings you can use SHOW (commit d1ecd539477).  If
> >> there was a strong need for even more than that, we can add something to
> >> the grammar.
> > 
> > Those special case things are a bad idea, and we shouldn't introduce
> > more. It's unrealistic that we can ever make that support everything,
> > and since we already have to support the database connected thing, I
> > don't see the point.
> 
> Let's continue discussing this part as well.

A logical replication connection cannot run SQL anyway, can it?  it's
limited to the replication grammar.  So it's not like you can run
arbitrary queries to access catalog data.  So even if we do need to
access the catalogs, we'd have to add stuff to the replication grammar
in order to support that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: v13: Performance regression related to FORTIFY_SOURCE

2020-06-04 Thread Alvaro Herrera

Speaking with my RMT hat on, I'm concerned that this item is not moving
forward at all.  ISTM we first and foremost need to decide whether this
is a problem worth worrying about, or not.

If it is something worth worrying about, let's discuss what's a good
fix for it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Atomic operations within spinlocks

2020-06-04 Thread Tom Lane

Andres Freund  writes:
> On 2020-06-04 14:50:40 -0400, Tom Lane wrote:
>> 2. The computed completePasses value would go backwards.  I bet
>> that wouldn't matter too much either, or at least we could teach
>> BgBufferSync to cope.  (I notice the comments therein suggest that
>> it is already designed to cope with completePasses wrapping around,
>> so maybe nothing needs to be done.)

> If we're not concerned about that, then we can remove the
> atomic-inside-spinlock, I think. The only reason for that right now is
> to avoid assuming a wrong pass number.

Hmm.  That might be a less-invasive path to a solution.  I can take
a look, if you don't want to.

regards, tom lane

Re: Atomic operations within spinlocks

2020-06-04 Thread Tom Lane

Andres Freund  writes:
> I'd still like to know which problem we're actually trying to solve
> here. I don't understand the "error" issues you mentioned upthread.

If you error out of getting the inner spinlock, the outer spinlock
is stuck, permanently, because there is no mechanism for spinlock
release during transaction abort.  Admittedly it's not very likely
for the inner acquisition to fail, but it's possible.  Aside from
timeout scenarios (e.g., process holding lock gets swapped out to
Timbuktu), it could be that both spinlocks are mapped onto a single
implementation lock by spin.c, which notes

 * We map all spinlocks onto a set of NUM_SPINLOCK_SEMAPHORES semaphores.
 * It's okay to map multiple spinlocks onto one semaphore because no process
 * should ever hold more than one at a time.

You've falsified that argument ... and no, I don't want to upgrade
the spinlock infrastructure enough to make this OK.  We shouldn't
ever be holding spinlocks long enough, or doing anything complicated
enough inside them, for such an upgrade to have merit.

regards, tom lane

Re: Atomic operations within spinlocks

2020-06-04 Thread Andres Freund

Hi,

On 2020-06-04 14:50:40 -0400, Tom Lane wrote:
> Actually ... we could probably use this design with a uint32 counter
> as well, on machines where the 64-bit operations would be slow.

On skylake-x even a 32bit [i]div is still 26 cycles. That's more than an
atomic operation 18 cycles.

> 2. The computed completePasses value would go backwards.  I bet
> that wouldn't matter too much either, or at least we could teach
> BgBufferSync to cope.  (I notice the comments therein suggest that
> it is already designed to cope with completePasses wrapping around,
> so maybe nothing needs to be done.)

If we're not concerned about that, then we can remove the
atomic-inside-spinlock, I think. The only reason for that right now is
to avoid assuming a wrong pass number.

I don't think completePasses wrapping around is comparable in frequency
to wrapping around nextVictimBuffer. It's not really worth worrying
about bgwriter wrongly assuming it lapped the clock sweep once ever
UINT32_MAX * NBuffers ticks, but there being a risk every NBuffers seems
worth worrying about.

Greetings,

Andres Freund

Re: Atomic operations within spinlocks

2020-06-04 Thread Andres Freund

Hi,

On 2020-06-04 13:57:19 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-06-03 14:19:45 -0400, Tom Lane wrote:
> >> This seems to me to be very bad code.
>
> > I think straight out prohibiting use of atomics inside a spinlock will
> > lead to more complicated and slower code, rather than than improving
> > anything. I'm to blame for the ClockSweepTick() case, and I don't really
> > see how we could avoid the atomic-while-spinlocked without relying on
> > 64bit atomics, which are emulated on more platforms, and without having
> > a more complicated retry loop.
>
> Well, if you don't want to touch freelist.c then there is no point
> worrying about what pg_stat_get_wal_receiver is doing.  But having
> now studied that freelist.c code, I don't like it one bit.  It's
> overly complicated and not very accurately commented, making it
> *really* hard to convince oneself that it's not buggy.
>
> I think a good case could be made for ripping out what's there now
> and just redefining nextVictimBuffer as a pg_atomic_uint64 that we
> never reset (ie, make its comment actually true).  Then ClockSweepTick
> reduces to

Note that we can't do that in the older back branches, there wasn't any
64bit atomics fallback before

commit e8fdbd58fe564a29977f4331cd26f9697d76fc40
Author: Andres Freund 
Date:   2017-04-07 14:44:47 -0700

Improve 64bit atomics support.

> pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1) % NBuffers
>
> and when we want to know how many cycles have been completed, we just
> divide the counter by NBuffers.  Now admittedly, this is relying on both
> pg_atomic_fetch_add_u64 being fast and int64 division being fast ... but
> to throw your own argument back at you, do we really care anymore about
> performance on machines where those things aren't true?  Furthermore,
> since all this is happening in a code path that's going to lead to I/O,
> I'm not exactly convinced that a few nanoseconds matter anyway.

It's very easy to observe this code being a bottleneck. If we only
performed a single clock tick before IO, sure, then the cost would
obviously be swamped by the IO cost. But it's pretty common to end up
having to do that ~ NBuffers * 5 times for a single buffer.

I don't think it's realistic to rely on 64bit integer division being
fast in this path. The latency is pretty darn significant (64bit div is
35-88 cycles on skylake-x, 64bit idiv 42-95). And unless I
misunderstand, you'd have to do so (for % NBuffers) every single clock
tick, not just when we wrap around.

We could however avoid the spinlock if we were to use 64bit atomics, by
storing the number of passes in the upper 32bit, and the next victim
buffer in the lower. But that doesn't seem that simple either, and will
regress performance on 32bit platforms.

I don't the whole strategy logic at all, so I guess there's some
argument to improve things from that end. It's probably possible to
avoid the lock with 32bit atomics as well.

I'd still like to know which problem we're actually trying to solve
here. I don't understand the "error" issues you mentioned upthread.

Greetings,

Andres Freund

Re: Atomic operations within spinlocks

2020-06-04 Thread Tom Lane

I wrote:
> I think a good case could be made for ripping out what's there now
> and just redefining nextVictimBuffer as a pg_atomic_uint64 that we
> never reset (ie, make its comment actually true).  Then ClockSweepTick
> reduces to
> pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1) % NBuffers
> and when we want to know how many cycles have been completed, we just
> divide the counter by NBuffers.

Actually ... we could probably use this design with a uint32 counter
as well, on machines where the 64-bit operations would be slow.
In that case, integer overflow of nextVictimBuffer would happen from
time to time, resulting in

1. The next actual victim buffer index would jump strangely.  This
doesn't seem like it'd matter at all, as long as it was infrequent.

2. The computed completePasses value would go backwards.  I bet
that wouldn't matter too much either, or at least we could teach
BgBufferSync to cope.  (I notice the comments therein suggest that
it is already designed to cope with completePasses wrapping around,
so maybe nothing needs to be done.)

If NBuffers was large enough to be a significant fraction of UINT_MAX,
then these corner cases would happen often enough to perhaps be
problematic.  But I seriously doubt that'd be possible on hardware
that wasn't capable of using the 64-bit code path.

regards, tom lane

REL_13_STABLE Branch

2020-06-04 Thread Jonathan S. Katz

Hi,

After conferring, the PostgreSQL 13 RMT[1] has decided that it is time
to create the REL_13_STABLE branch. Tom has volunteered to create the
branch this Sunday (2020-06-07).

Please let us know if you have any questions.

Thanks,

Alvaro, Peter, Jonathan

[1] https://wiki.postgresql.org/wiki/Release_Management_Team



signature.asc
Description: OpenPGP digital signature

Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-04 Thread Andres Freund

Hi,

On 2020-06-03 13:26:43 -0700, Andres Freund wrote:
> On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote:
> > So there seems to be +40% between 9.6 and 10, and further +25% between
> > 10 and master. However, plain hashagg, measured e.g. like this:

As far as I can tell the 10->master difference comes largely from the
difference of the number of buckets in the hashtable.

In 10 it is:
Breakpoint 1, tuplehash_create (ctx=0x5628251775c8, nelements=75, 
private_data=0x5628251952f0)
and in master it is:
Breakpoint 1, tuplehash_create (ctx=0x5628293a0a70, nelements=256, 
private_data=0x5628293a0b90)

As far as I can tell the timing difference simply is the cost of
iterating 500k times over a hashtable with fairly few entries. Which is,
unsurprisingly, more expensive if the hashtable is larger.

The reason the hashtable got bigger in 12 is

commit 1f39bce021540fde00990af55b4432c55ef4b3c7
Author: Jeff Davis 
Date:   2020-03-18 15:42:02 -0700

Disk-based Hash Aggregation.

which introduced

+/* minimum number of initial hash table buckets */
+#define HASHAGG_MIN_BUCKETS 256

I don't really see much explanation for that part in the commit, perhaps
Jeff can chime in?

I think optimizing for the gazillion hash table scans isn't particularly
important. Rarely is a query going to have 500k scans of unchanging
aggregated data. So I'm not too concerned about the 13 regression - but
I also see very little reason to just always use 256 buckets? It's
pretty darn common to end up with 1-2 groups, what's the point of this?

I'll look into 9.6->10 after buying groceries... But I'd wish there were
a relevant benchmark, I don't think it's worth optimizing for this case.

Greetings,

Andres Freund

Re: Regarding TZ conversion

2020-06-04 Thread Rajin Raj

Thanks for the clarification.

Is it advisable to modify the Default? Will it override when we apply a
patch or upgrade the DB?

What about creating a new file like below and update the postgres.conf with
the new name.

# New tz offset
   @INCLUDE Default

@OVERRDIE
   IST 19800
   


*Regards,*
*Rajin *


On Thu, Jun 4, 2020 at 7:23 PM Tom Lane  wrote:

> Rajin Raj  writes:
> > Option 1:  AT TIME ZONE 'IST'
> > Option 2:  AT TIME ZONE 'Asia/Kolkata'
> > In the first option, I get +2:00:00 offset (when *timezone_abbrevations =
> > 'Default'*) and for option 2 , +5:30 offset.
>
> > I can see multiple entries for IST in pg_timezone_names with
> > different utc_offset, but in pg_timezone_abbrev there is one entry. I
> guess
> > AT TIME ZONE function using the offset shown in pg_timezone_abbrev.
>
> No.  If you use an abbreviation rather than a spelled-out zone name,
> you get whatever the timezone_abbrevations file says, which by default
> is
>
> $ grep IST .../postgresql/share/timezonesets/Default
> # CONFLICT! IST is not unique
> # - IST: Irish Standard Time (Europe)
> # - IST: Indian Standard Time (Asia)
> IST  7200# Israel Standard Time
>
> If that's not what you want, change it.  See
>
> https://www.postgresql.org/docs/current/datetime-config-files.html
>
> and also
>
>
> https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES
>
> regards, tom lane
>

Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-04 Thread Andres Freund

Hi,

On 2020-06-03 11:14:48 +0900, Michael Paquier wrote:
> I would like to remove those two functions and the surrounding code
> for v14, leading to some cleanup:
>  6 files changed, 326 deletions(-)

+1

> While on it, I have noticed that heap_get_latest_tid() is still
> located within heapam.c, but we can just move it within
> heapam_handler.c.

What's the point of that change? I think the differentiation between
heapam_handler.c and heapam.c could be clearer, but if anything, I'd
argue that heap_get_latest_tid is sufficiently low-level that it'd
belong in heapam.c.

Greetings,

Andres Freund

Re: Atomic operations within spinlocks

2020-06-04 Thread Tom Lane

Andres Freund  writes:
> On 2020-06-03 14:19:45 -0400, Tom Lane wrote:
>> This seems to me to be very bad code.

> I think straight out prohibiting use of atomics inside a spinlock will
> lead to more complicated and slower code, rather than than improving
> anything. I'm to blame for the ClockSweepTick() case, and I don't really
> see how we could avoid the atomic-while-spinlocked without relying on
> 64bit atomics, which are emulated on more platforms, and without having
> a more complicated retry loop.

Well, if you don't want to touch freelist.c then there is no point
worrying about what pg_stat_get_wal_receiver is doing.  But having
now studied that freelist.c code, I don't like it one bit.  It's
overly complicated and not very accurately commented, making it
*really* hard to convince oneself that it's not buggy.

I think a good case could be made for ripping out what's there now
and just redefining nextVictimBuffer as a pg_atomic_uint64 that we
never reset (ie, make its comment actually true).  Then ClockSweepTick
reduces to

pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1) % NBuffers

and when we want to know how many cycles have been completed, we just
divide the counter by NBuffers.  Now admittedly, this is relying on both
pg_atomic_fetch_add_u64 being fast and int64 division being fast ... but
to throw your own argument back at you, do we really care anymore about
performance on machines where those things aren't true?  Furthermore,
since all this is happening in a code path that's going to lead to I/O,
I'm not exactly convinced that a few nanoseconds matter anyway.

regards, tom lane

Re: should INSERT SELECT use a BulkInsertState?

2020-06-04 Thread Andres Freund

Hi,

On 2020-05-08 02:25:45 -0500, Justin Pryzby wrote:
> Seems to me it should, at least conditionally.  At least if there's a function
> scan or a relation or ..

Well, the problem is that this can cause very very significant
regressions. As in 10x slower or more. The ringbuffer can cause constant
XLogFlush() calls (due to the lsn interlock), and the eviction from
shared_buffers (regardless of actual available) will mean future vacuums
etc will be much slower.  I think this is likely to cause pretty
widespread regressions on upgrades.

Now, it sucks that we have this problem in the general facility that's
supposed to be used for this kind of bulk operation. But I don't really
see it realistic as expanding use of bulk insert strategies unless we
have some more fundamental fixes.

Regards,

Andres

Re: Expand the use of check_canonical_path() for more GUCs

2020-06-04 Thread Tom Lane

Peter Eisentraut  writes:
> This (and some other messages in this thread) appears to assume that 
> canonicalize_path() turns relative paths into absolute paths, but 
> AFAICT, it does not do that.

Ah, fair point --- I'd been assuming that we were applying
canonicalize_path as cleanup for an absolute-ification operation,
but you are right that check_canonical_path does not do that.

Digging around, though, I notice a different motivation.
In assign_pgstat_temp_directory we have

/* check_canonical_path already canonicalized newval for us */
...
tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
sprintf(tname, "%s/global.tmp", newval);
fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
sprintf(fname, "%s/global.stat", newval);

and I believe what the comment is on about is that these path derivation
operations are unreliable if newval isn't in canonical form.  I seem
to remember for example that in some Windows configurations, mixing
slashes and backslashes doesn't work.

So the real point here is that we could use the user's string unmodified
as long as we only use it exactly as-is, but cases where we derive other
pathnames from it require more work.

Of course, we could leave the GUC string alone and only canonicalize while
forming derived paths, but that seems mighty error-prone.  In any case,
just ripping out the check_canonical_path usages without any other mop-up
will break things.

regards, tom lane

Re: Expand the use of check_canonical_path() for more GUCs

2020-06-04 Thread Peter Eisentraut


On 2020-06-03 20:45, Tom Lane wrote:

Robert Haas  writes:

On Tue, Jun 2, 2020 at 5:04 AM Peter Eisentraut
 wrote:

The archeology reveals that these calls where originally added to
canonicalize the data_directory and config_file settings (7b0f060d54),
but that was then moved out of guc.c to be done early during postmaster
startup (337ffcddba).  The remaining calls of check_canonical_path() in
guc.c appear to be leftovers from a previous regime.



Thanks for looking into it. Sounds like it can just be ripped out,
then, unless someone knows of a reason to do otherwise.


In the abstract, I agree with Peter's point that we shouldn't alter
user-given strings without need.  However, I think there's strong
reason for canonicalizing the data directory and config file locations.


It is not proposed to change that.  It is only debated whether the same 
canonicalization should be applied to other GUCs that represent paths.



We access those both before and after chdir'ing into the datadir, so
we'd better have absolute paths to them --- and at least for the
datadir, it's documented that you can initially give it as a path
relative to wherever you started the postmaster from.  If the other
files are only accessed after the chdir happens then we could likely
do without canonicalizing them.  But ... do we know which directory
the user (thought he) specified them with reference to?  Forced
canonicalization does have the advantage that it's clear to all
onlookers how we are interpreting the paths.


This (and some other messages in this thread) appears to assume that 
canonicalize_path() turns relative paths into absolute paths, but 
AFAICT, it does not do that.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: libpq copy error handling busted

2020-06-04 Thread Tom Lane

I wrote:
> * As for control-C not getting out of it: there is
>   if (CancelRequested)
>   break;
> in pgbench's loop, but this does nothing in this scenario because
> fe-utils/cancel.c only sets that flag when it successfully sends a
> Cancel ... which it certainly cannot if the postmaster is gone.
> I suspect that this may be relatively recent breakage.  It doesn't look
> too well thought out, in any case.  The places that are testing this
> flag look like they'd rather not be bothered with the fine point of
> whether the cancel request actually went anywhere.

On closer inspection, it seems that scripts_parallel.c does have a
dependency on the cancel request having been sent, because it insists
on collecting a query result off the active connection after detecting
CancelRequested.  This seems dangerously overoptimistic to me; it will
lock up if for any reason the server doesn't honor the cancel request.
It's also pointless, because all the calling apps are just going to close
their connections and exit(1) afterwards, so there's no use in trying to
resynchronize the connection state.  (Plus, disconnectDatabase will
issue cancels on any busy connections; which would be necessary anyway
in a parallel operation, since cancel.c could only have signaled one of
them.)  So the attached patch just removes the useless consumeQueryResult
call, and then simplifies select_loop's API a bit.

With that change, I don't see any place that wants the existing definition
of CancelRequested rather than the simpler meaning of "SIGINT was
received", so I just changed it to mean that.  We could certainly also
have a variable tracking whether a cancel request was sent, but I see
no point in one right now.

It's easiest to test this *without* the other patch -- just run the
pgbench scenario Andres demonstrated, and see whether control-C gets
pgbench to quit cleanly.

regards, tom lane

diff --git a/src/bin/scripts/scripts_parallel.c b/src/bin/scripts/scripts_parallel.c
index 45c69b8d19..c3384d452a 100644
--- a/src/bin/scripts/scripts_parallel.c
+++ b/src/bin/scripts/scripts_parallel.c
@@ -28,7 +28,7 @@
 #include "scripts_parallel.h"
 
 static void init_slot(ParallelSlot *slot, PGconn *conn);
-static int	select_loop(int maxFd, fd_set *workerset, bool *aborting);
+static int	select_loop(int maxFd, fd_set *workerset);
 
 static void
 init_slot(ParallelSlot *slot, PGconn *conn)
@@ -41,23 +41,16 @@ init_slot(ParallelSlot *slot, PGconn *conn)
 /*
  * Loop on select() until a descriptor from the given set becomes readable.
  *
- * If we get a cancel request while we're waiting, we forego all further
- * processing and set the *aborting flag to true.  The return value must be
- * ignored in this case.  Otherwise, *aborting is set to false.
+ * Returns -1 on failure (including getting a cancel request).
  */
 static int
-select_loop(int maxFd, fd_set *workerset, bool *aborting)
+select_loop(int maxFd, fd_set *workerset)
 {
 	int			i;
 	fd_set		saveSet = *workerset;
 
 	if (CancelRequested)
-	{
-		*aborting = true;
 		return -1;
-	}
-	else
-		*aborting = false;
 
 	for (;;)
 	{
@@ -90,7 +83,7 @@ select_loop(int maxFd, fd_set *workerset, bool *aborting)
 		if (i < 0 && errno == EINTR)
 			continue;			/* ignore this */
 		if (i < 0 || CancelRequested)
-			*aborting = true;	/* but not this */
+			return -1;			/* but not this */
 		if (i == 0)
 			continue;			/* timeout (Win32 only) */
 		break;
@@ -135,7 +128,6 @@ ParallelSlotsGetIdle(ParallelSlot *slots, int numslots)
 	{
 		fd_set		slotset;
 		int			maxFd = 0;
-		bool		aborting;
 
 		/* We must reconstruct the fd_set for each call to select_loop */
 		FD_ZERO(&slotset);
@@ -157,19 +149,12 @@ ParallelSlotsGetIdle(ParallelSlot *slots, int numslots)
 		}
 
 		SetCancelConn(slots->connection);
-		i = select_loop(maxFd, &slotset, &aborting);
+		i = select_loop(maxFd, &slotset);
 		ResetCancelConn();
 
-		if (aborting)
-		{
-			/*
-			 * We set the cancel-receiving connection to the one in the zeroth
-			 * slot above, so fetch the error from there.
-			 */
-			consumeQueryResult(slots->connection);
+		/* failure? */
+		if (i < 0)
 			return NULL;
-		}
-		Assert(i != 0);
 
 		for (i = 0; i < numslots; i++)
 		{
diff --git a/src/fe_utils/cancel.c b/src/fe_utils/cancel.c
index eb4056a9a6..51fb67d384 100644
--- a/src/fe_utils/cancel.c
+++ b/src/fe_utils/cancel.c
@@ -43,11 +43,11 @@
 static PGcancel *volatile cancelConn = NULL;
 
 /*
- * CancelRequested tracks if a cancellation request has completed after
- * a signal interruption.  Note that if cancelConn is not set, in short
- * if SetCancelConn() was never called or if ResetCancelConn() freed
- * the cancellation object, then CancelRequested is switched to true after
- * all cancellation attempts.
+ * CancelRequested is set when we receive SIGINT (or local equivalent).
+ * There is no provision in this module for resetting it; but applications
+ * might choose to clear it after successfully

Re: Just for fun: Postgres 20?

2020-06-04 Thread Avinash Kumar

On Tue, Jun 2, 2020 at 2:45 PM Robert Haas  wrote:

> On Mon, Jun 1, 2020 at 3:20 PM Tom Lane  wrote:
> > Robert Haas  writes:
> > > As has already been pointed out, it could definitely happen, but we
> > > could solve that by just using a longer version number, say, including
> > > the month and, in case we ever do multiple major releases in the same
> > > month, also the day. In fact, we might as well take it one step
> > > further and use the same format for the release number that we use for
> > > CATALOG_VERSION_NO: MMDDN. So this fall, piggybacking on the
> > > success of PostgreSQL 10, 11, and 12, we could look then release
> > > PostgreSQL 202009241 or so.
> >
> > But then where do you put the minor number for maintenance releases?
>
> Oh, well that's easy. The first maintenance release would just be
> 202009241.1.
>
> Unless, of course, we want to simplify things by using the same format
> for both parts of the version number. Then, supposing the first
> maintenance release follows the major release by a month or so, it
> would be PostgreSQL 202009241.202010291 or something of this sort.
>
Since there is a proposal to have a 64-bit transaction ID, we could rather
have a 64-bit random number which could solve all of these problems. :P
And then if I ask my customer what Postgres version is he/she using, it
could be a postgres fun ride.

>
> It's hard to agree on anything around here but I suspect we can come
> to near-unanimous agreement on the topic of how much merit this
> proposal has.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

-- 
Regards,
Avinash Vallarapu

Re: what can go in root.crt ?

2020-06-04 Thread Chapman Flack

On 06/04/20 11:04, Laurenz Albe wrote:
> I was referring to the wish to *not* use a self-signed CA certificate,
> but an intermediate certificate as the ultimate authority, based on
> a distrust of the certification authority that your organization says
> you should trust.

Are you aware of any principled reason it should be impossible to
include an end-entity certificate in the trust store used by a client?

Are you aware of any principled reason it should be impossible to
include a certificate that has the CA:TRUE and Certificate Sign bits
in the trust store used by a client, whether it is its own signer
or has been signed by another CA?

Regards,
-Chap

Re: Possible bug on Postgres 12 (CASE THEN evaluated prematurely) - Change of behaviour compared to 11, 10, 9

2020-06-04 Thread Juan Fuentes

Thanks Tom!

I was just hopping somebody could point out if this kind of issue has been 
reported before spending 2 days fabricating a simpler self contained example.

Best,
Juan

> On 4 Jun 2020, at 16:26, Tom Lane  wrote:
> 
> Juan Fuentes  writes:
>> As you could see the query includes castings, we noticed testing with 
>> Postgres 12 that the castings of the CASE THEN statement (commented out 
>> below) where failing in some cases, of course if you do the INNER JOIN and 
>> CASE WHEN first our expectation is that the value can be casted.
> 
> You're unlikely to get any useful comments on this if you don't provide
> a self-contained example.  The query by itself lacks too many details.
> As an example, one way "t7.value::numeric = 1" could fail despite being
> inside a CASE is if t7 is a view whose "value" column is actually a
> constant.  Flattening of the view would replace "t7.value" with that
> constant, and then constant-folding would cause the failure, and neither
> of those things are prevented by a CASE.  I kind of doubt that that's
> the specific issue here, but I'm not going to guess at what is in your
> thirty-some input tables.
> 
>   regards, tom lane

Re: what can go in root.crt ?

2020-06-04 Thread Laurenz Albe

On Thu, 2020-06-04 at 08:25 -0400, Chapman Flack wrote:
> > I feel bad about bending the basic idea of certificates and trust to suit
> > some misbegotten bureaucratic constraints on good security.
> 
> Can you elaborate on what, in the email message you replied to here,
> represented a bending of the basic idea of certificates and trust?
> 
> I didn't notice any.

I was referring to the wish to *not* use a self-signed CA certificate,
but an intermediate certificate as the ultimate authority, based on
a distrust of the certification authority that your organization says
you should trust.

Yours,
Laurenz Albe

Re: Wrong width of UNION statement

2020-06-04 Thread Kenichiro Tanaka

Hello,

Thank you for your quick response and sorry for my late reply.

> (I suppose you're using UTF8 encoding...)
It is right.
As you said, my encoding of database is UTF8.

>There's room for improvement there, but this is all bound up in the legacy
>mess that we have in prepunion.c.
At first,I think it is easy to fix it.
Because I think that it is easy to fix by calculating in the same way
as UNION ALL.
But ,now,I understand it is not so easy.

I'll report if I find some strong reason enough to throw away and
rewrite prepunion.c.

Thank you.

Regards
Kenichiro Tanaka.

2020年6月2日(火) 0:04 Tom Lane :
>
> Kenichiro Tanaka  writes:
> > I think table column width of  UNION statement should be equal one of UNION 
> > ALL.
>
> I don't buy that argument, because there could be type coercions involved,
> so that the result width isn't necessarily equal to any one of the inputs.
>
> Having said that, the example you give shows that we make use of
> pg_statistic.stawidth values when estimating the width of immediate
> relation outputs, but that data isn't available by the time we get to
> a UNION output.  So we fall back to get_typavgwidth, which in this
> case is going to produce something involving the typmod times the
> maximum encoding length.  (I suppose you're using UTF8 encoding...)
>
> There's room for improvement there, but this is all bound up in the legacy
> mess that we have in prepunion.c.  For example, because we don't have
> RelOptInfo nodes associated with individual set-operation outputs, it's
> difficult to figure out where we might store data about the widths of such
> outputs.  Nor could we easily access the data if we had it, since the
> associated Vars don't have valid RTE indexes.  So to my mind, that code
> needs to be thrown away and rewritten, using actual relations to represent
> the different setop results and Paths to represent possible computations.
> In the meantime, it's hard to get excited about layering some additional
> hacks on top of what's there now.
>
> regards, tom lane

Re: Possible bug on Postgres 12 (CASE THEN evaluated prematurely) - Change of behaviour compared to 11, 10, 9

2020-06-04 Thread Tom Lane

Juan Fuentes  writes:
> As you could see the query includes castings, we noticed testing with 
> Postgres 12 that the castings of the CASE THEN statement (commented out 
> below) where failing in some cases, of course if you do the INNER JOIN and 
> CASE WHEN first our expectation is that the value can be casted.

You're unlikely to get any useful comments on this if you don't provide
a self-contained example.  The query by itself lacks too many details.
As an example, one way "t7.value::numeric = 1" could fail despite being
inside a CASE is if t7 is a view whose "value" column is actually a
constant.  Flattening of the view would replace "t7.value" with that
constant, and then constant-folding would cause the failure, and neither
of those things are prevented by a CASE.  I kind of doubt that that's
the specific issue here, but I'm not going to guess at what is in your
thirty-some input tables.

regards, tom lane

Re: Regarding TZ conversion

2020-06-04 Thread Tom Lane

Rajin Raj  writes:
> Option 1:  AT TIME ZONE 'IST'
> Option 2:  AT TIME ZONE 'Asia/Kolkata'
> In the first option, I get +2:00:00 offset (when *timezone_abbrevations =
> 'Default'*) and for option 2 , +5:30 offset.

> I can see multiple entries for IST in pg_timezone_names with
> different utc_offset, but in pg_timezone_abbrev there is one entry. I guess
> AT TIME ZONE function using the offset shown in pg_timezone_abbrev.

No.  If you use an abbreviation rather than a spelled-out zone name,
you get whatever the timezone_abbrevations file says, which by default
is

$ grep IST .../postgresql/share/timezonesets/Default 
# CONFLICT! IST is not unique
# - IST: Irish Standard Time (Europe)
# - IST: Indian Standard Time (Asia)
IST  7200# Israel Standard Time

If that's not what you want, change it.  See

https://www.postgresql.org/docs/current/datetime-config-files.html

and also

https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES

regards, tom lane

Re: question regarding copyData containers

2020-06-04 Thread Jerome Wagner

Hello,

thank you for your feedback.

I agree that modifying the COPY subprotocols is hard to do because it would
have an impact on the client ecosystem.

My understanding (which seems to be confirmed by what Tom Lane said) is
that the server discards the framing and
manages to make sense of the underlying data.

> the expectation is that clients can send CopyData messages that are
> split up however they choose; the message boundaries needn't correspond
> to any semantic boundaries in the data stream.

So I thought that a client could decide to have the same behavior and could
start parsing the payload of a copyData message without assembling it first.
It works perfectly with COPY TO but I hit a roadblock on copyBoth during
logical replication with test_decoding because the subprotocol doesn't have
any framing.

> Right now all 'w' messages should be contained in one CopyData/'d' that
> doesn't contain anything but the XLogData/'w'.

The current format of the XLogData/'w' message is
w lsn lsn time byten

and even if it is maybe too late now I was wondering why it was not decided
to be
w lsn lsn time n byten

because it seems to me that the missing n ties the XLogData to the copyData
framing.

>The input data exists in a linear
>buffer already, so you're not going to reduce peak memory usage by
>sending smaller CopyData chunks.

That is very surprising to me. Do you mean that on the server in COPY TO
mode, a full row is prepared in a linear buffer in memory before
beeing sent as a copyData/d'
I found the code around
https://github.com/postgres/postgres/blob/master/src/backend/commands/copy.c#L2153
and
indeed the whole row seems to be buffered in memory.

Good thing or bad thing, users tend to use bigger fields (text, jsonb,
bytea) and that can be very memory hungry.
Do you know a case in postgres (other than large_objects I suppose) where
the server can flush data from a field without buffering it in memory ?

And then as you noted, there is the multiplexing of events. a very long
copyData makes the communication impossible between the client and the
server during the transfer.

I briefly looked at
https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c
and
I found

/*
* Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
*
* We don't have a good idea of what a good value would be; there's some
* overhead per message in both walsender and walreceiver, but on the other
* hand sending large batches makes walsender less responsive to signals
* because signals are checked only between messages. 128kB (with
* default 8k blocks) seems like a reasonable guess for now.
*/
#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
so I thought that the maximum copyData/d' I would receive during logical
replication was MAX_SEND_SIZE but it seems that this is not used for
logical decoding.
the whole output of the output plugin seem to be prepared in memory so for
an insert like

insert into mytable (col) values (repeat('-', pow(2, 27)::int)

a 128MB linear buffer will be created on the server and sent as 1 copyData
over many network chunks.

So I understand that in the long term copyData framing should not carry any
semantic to be able to keep messages small enough to allow multiplexing but
that there are many steps to climb before that.

Would it make sense one day in some way to try and do streaming at the
sub-field level ? I guess that is a huge undertaking since most of the
field unit interfaces are probably based on a buffer/field one-to-one
mapping.

Greetings,
Jérôme

On Thu, Jun 4, 2020 at 12:08 AM Andres Freund  wrote:

> Hi,
>
> On 2020-06-03 19:28:12 +0200, Jerome Wagner wrote:
> > I have been working on a node.js streaming client for different COPY
> > scenarios.
> > usually, during CopyOut, clients tend to buffer network chunks until they
> > have gathered a full copyData message and pass that to the user.
> >
> > In some cases, this can lead to very large copyData messages. when there
> > are very long text fields or bytea fields it will require a lot of memory
> > to be handled (up to 1GB I think in the worst case scenario)
> >
> > In COPY TO, I managed to relax that requirement, considering that
> copyData
> > is simply a transparent container. For each network chunk, the relevent
> > message content is forwarded which makes for 64KB chunks at most.
>
> Uhm.
>
>
> > We loose the semantics of the "row" that copyData has according to the
> > documentation
> > https://www.postgresql.org/docs/10/protocol-flow.html#PROTOCOL-COPY
> > >The backend sends a CopyOutResponse message to the frontend, followed by
> > zero or more >CopyData messages (**always one per row**), followed by
> > CopyDone
> >
> > but it is not a problem because the raw bytes are still parsable (rows +
> > fields) in text mode (tsv) and in binary mode)
>
> This seems like an extremely bad idea to me. Are we really going to ask
> clients to incur the overhead (both in complexity and runtime) to parse
> incoming

[PATCH] pg_dump: Add example and link for --encoding option

2020-06-04 Thread 이동욱

To let users know what kind of character set
can be used add examples and a link to --encoding option.

Thanks,
Dong wook


0001-pg_dump-Add-example-and-link-for-encoding-option.patch
Description: Binary data

Re: what can go in root.crt ?

2020-06-04 Thread Chapman Flack

On 06/04/20 02:07, Laurenz Albe wrote:
> I feel bad about bending the basic idea of certificates and trust to suit
> some misbegotten bureaucratic constraints on good security.

Can you elaborate on what, in the email message you replied to here,
represented a bending of the basic idea of certificates and trust?

I didn't notice any.

Regards,
-Chap

Re: Read access for pg_monitor to pg_replication_origin_status view

2020-06-04 Thread Martín Marqués

Hi Kyotaro-san,

> Sorry for not mentioning it at that time, but about the following diff:
>
> +GRANT SELECT ON pg_replication_origin_status TO pg_read_all_stats;
>
> system_views.sql already has a REVOKE command on the view. We should
> put the above just below the REVOKE command.
>
> I'm not sure where to put the GRANT on
> pg_show_replication_origin_status(), but maybe it also should be at
> the same place.

Yes, I agree that it makes the revoking/granting easier to read if
it's grouped by objects, or groups of objects.

Done.

> In the previous comment, one point I meant is that the "to the
> superuser" should be "to superusers", because a PostgreSQL server
> (cluster) can define multiple superusers. Another is that "permitted
> to other users by using the GRANT command." might be obscure for
> users. In this regard I found a more specific description in the same
> file:

OK, now I understand what you were saying. :-)

>   Computes the total disk space used by the database with the specified
>   name or OID.  To use this function, you must
>   have CONNECT privilege on the specified database
>   (which is granted by default) or be a member of
>   the pg_read_all_stats role.
>
> So, as the result it would be like the following: (Note that, as you
> know, I'm not good at this kind of task..)
>
>   Use of functions for replication origin is restricted to superusers.
>   Use for these functions may be permitted to other users by granting
>   EXECUTE privilege on the functions.
>
> And in regard to the view, granting privileges on both the view and
> function to individual user is not practical so we should mention only
> granting pg_read_all_stats to users, like the attached patch.

I did some re-writing of the doc, which is pretty close to what you
proposed above.

> By the way, the attachements of your mail are out-of-order. I'm not
> sure that that does something bad, though.

That's likely Gmail giving them random order when you attach multiple
files all at once.

New patches attached.

Regards,

-- 
Martín Marquéshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From 87c00782c691f2c6c13768cd6d96b19de69cab16 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mart=C3=ADn=20Marqu=C3=A9s?=
 
Date: Tue, 2 Jun 2020 10:44:42 -0300
Subject: [PATCH v4 1/4] Access to pg_replication_origin_status view has
 restricted access only for superusers due to it using
 pg_show_replication_origin_status(), and all replication origin functions
 requiring superuser rights.

This patch removes the check for superuser priviledges when executing
replication origin functions, which is hardcoded, and instead rely on
ACL checks.

This patch will also revoke execution of such functions from PUBLIC
---
 src/backend/catalog/system_views.sql | 16 
 src/backend/replication/logical/origin.c |  5 -
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 56420bbc9d6..6658f0c2eb2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1469,6 +1469,22 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+--
+-- Permision to execute Replication Origin functions should be revoked from public
+--
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_advance(text, pg_lsn) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_create(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_drop(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_oid(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_progress(text, boolean) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_is_setup() FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_progress(boolean) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_reset() FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_setup(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_xact_reset() FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_xact_setup(pg_lsn, timestamp with time zone) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_show_replication_origin_status() FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 981d60f135d..1f223daf21f 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -182,11 +182,6 @@ static ReplicationState *session_replication_state = NULL;
 static void
 replorigin_check_prerequisites(bool check_slots, bool recoveryOK)
 {
-	if (!superuser())
-		ereport(ERROR,
-(errcode(ERRCODE_INSUFFICIENT_PRIVILE

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-04 Thread Mahendra Singh Thalor

On Fri, 29 May 2020 at 15:52, Amit Kapila  wrote:
>
> On Wed, May 27, 2020 at 5:19 PM Mahendra Singh Thalor 
wrote:
>>
>> On Tue, 26 May 2020 at 16:46, Amit Kapila 
wrote:
>>
>> Hi all,
>> On the top of v16 patch set [1], I did some testing for DDL's and DML's
to test wal size and performance. Below is the testing summary;
>>
>> Test parameters:
>> wal_level= 'logical
>> max_connections = '150'
>> wal_receiver_timeout = '600s'
>> max_wal_size = '2GB'
>> min_wal_size = '2GB'
>> autovacuum= 'off'
>> checkpoint_timeout= '1d'
>>
>> Test results:
>>
>> CREATE index operationsAdd col int(date) operationsAdd col text
operations
>> SN.operation nameLSN diff (in bytes)time (in sec)% LSN changeLSN diff
(in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN
change
>> 1
>> 1 DDL without patch177280.89116
>> 1.624548
>> 9760.764393
>> 11.475409
>> 339040.80044
>> 2.80792
>> with patch180160.80486810880.763602348560.787108
>> 2
>> 2 DDL without patch198720.860348
>> 2.73752
>> 16320.763199
>> 13.7254902
>> 345600.806086
>> 3.078703
>> with patch204160.83906518560.733147356240.829281
>> 3
>> 3 DDL without patch220160.894891
>> 3.63372093
>> 2 2880.776871
>> 14.685314
>> 352160.803493
>> 3.339391186
>> with patch228160.82802826240.737177363920.800194
>> 4
>> 4 DDL without patch241600.901686
>> 4.4701986
>> 29440.768445
>> 15.217391
>> 358720.77489
>> 3.590544
>> with patch252400.88714333920.768382371600.82777
>> 5
>> 5 DDL without patch263280.901686
>> 4.9832877
>> 36000.751879
>> 15.55
>> 365280.817928
>> 3.832676
>> with patch276400.91407841600.74709379280.820621
>> 6
>> 6 DDL without patch284720.936385
>> 5.5071649
>> 42560.745179
>> 15.78947368
>> 371840.797043
>> 4.066265
>> with patch300400.95822649280.725321386960.814535
>> 7
>> 8 DDL without patch327601.0022203
>> 6.422466
>> 55680.757468
>> 16.091954
>> 384960.83207
>> 4.509559
>> with patch348640.96677764640.769072402320.903604
>> 8
>> 11 DDL without patch502961.0022203
>> 5.662478
>> 75360.748332
>> 16.66
>> 404640.822266
>> 5.179913
>> with patch531440.96677787920.750553425600.797133
>> 9
>> 15 DDL without patch588961.267253
>> 5.662478
>> 101840.776875
>> 16.496465
>> 431120.821916
>> 5.84524
>> with patch627681.27234118640.746844456320.812567
>> 10
>> 1 DDL & 3 DML without patch182400.812551
>> 1.6228
>> 11920.771993
>> 10.067114
>> 341200.849467
>> 2.8113599
>> with patch185360.81908913120.785117350800.855456
>> 11
>> 3 DDL & 5 DML without patch236560.926616
>> 3.4832606
>> 26560.758029
>> 13.55421687
>> 355840.829377
>> 3.372302
>> with patch244800.91551730160.797206367840.839176
>> 12
>> 10 DDL & 5 DML without patch527601.101005
>> 4.958301744
>> 72880.763065
>> 16.02634468
>> 402160.837843
>> 4.993037
>> with patch553761.10524184560.779257422240.835206
>> 13
>> 10 DML without patch10080.791091
>> 6.349206
>> 10080.81105
>> 6.349206
>> 10080.78817
>> 6.349206
>> with patch10720.80787510720.77111310720.759789
>>
>> To see all operations, please see[2] test_results
>>
>
> Why are you seeing any additional WAL in case-13 (10 DML) where there is
no DDL?  I think it is because you have used savepoints in that case which
will add some additional WAL.  You seems to have 9 savepoints in that test
which should ideally generate 36 bytes of additional WAL (4-byte per
transaction id for each subtransaction).  Also, in other cases where you
took data for DDL and DML, you have also used savepoints in those tests. I
suggest for savepoints, let's do separate tests as you have done in case-13
but we can do it 3,5,7,10 savepoints and probably each transaction can
update a row of 200 bytes or so.
>

Thanks Amit for reviewing results.

Yes, you are correct.  I used savepoints in DML so it was showing
additional wal.

As suggested above, I did testing for DML's, DDL's and savepoints. Below is
the test results:

*Test results:*

CREATE index operations Add col int(date) operations Add col text operations
SN. operation name LSN diff (in bytes) time (in sec) % LSN change LSN diff
(in bytes) time (in sec) % LSN change LSN diff (in bytes) time (in sec) %
LSN change

1
1 DDL without patch <#gid=0&range=B2> 17728 0.89116
1.624548
976 0.764393
11.475409
33904 0.80044
2.80792
with patch 18016 0.804868 1088 0.763602 34856 0.787108

2
2 DDL without patch <#gid=0&range=B3> 19872 0.860348
2.73752
1632 0.763199
13.7254902
34560 0.806086
3.078703
with patch 20416 0.839065 1856 0.733147 35624 0.829281

3
3 DDL without patch <#gid=0&range=B4> 22016 0.894891
3.63372093
2288 0.776871
14.685314
35216 0.803493
3.339391186
with patch 22816 0.828028 2624 0.737177 36392 0.800194

4
4 DDL without patch <#gid=0&range=B5> 24160 0.901686
4.4701986
2944 0.768445
15.217391
35872 0.77489
3.590544
with patch 25240 0.887143 3392 0.768382 37160 0.82777

5
5 DDL without patch <#gid=0&range=B6> 26328 0.901686
4.9832877
3600 0.751879
15.55
36528 0.817928
3.832676
with patch 27640 0.914078 4160 0.74709 37928 0.820621

6
6 DDL without patch <#gid=0&range=B7> 28472 0.936385
5.5071649
4

Re: proposal - function string_to_table

2020-06-04 Thread movead...@highgo.ca

+{ oid => '2228', descr => 'split delimited text',
+  proname => 'string_to_table', prorows => '1000', proretset => 't',
+  prorettype => 'text', proargtypes => 'text text',
+  prosrc => 'text_to_table' },
+{ oid => '2282', descr => 'split delimited text with null string',
+  proname => 'string_to_table', prorows => '1000', proretset => 't',
+  prorettype => 'text', proargtypes => 'text text text',
+  prosrc => 'text_to_table_null' },

I go through the patch, and everything looks good to me. But I do not know
why it needs a 'text_to_table_null()', it's ok to put a 'text_to_table' there, 
I think.




Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Regarding TZ conversion

2020-06-04 Thread Rajin Raj

Hi ,

What is the right approach for using AT TIME ZONE function?

Option 1:  AT TIME ZONE 'IST'
Option 2:  AT TIME ZONE 'Asia/Kolkata'

In the first option, I get +2:00:00 offset (when *timezone_abbrevations =
'Default'*) and for option 2 , +5:30 offset.

I can see multiple entries for IST in pg_timezone_names with
different utc_offset, but in pg_timezone_abbrev there is one entry. I guess
AT TIME ZONE function using the offset shown in pg_timezone_abbrev.

ovdb=> select * from pg_timezone_names where abbrev = 'IST';
name | abbrev | utc_offset | is_dst
-+++
 Asia/Calcutta   | IST| 05:30:00   | f
 Asia/Kolkata| IST| 05:30:00   | f
 Europe/Dublin   | IST| 01:00:00   | t
 posix/Asia/Calcutta | IST| 05:30:00   | f
 posix/Asia/Kolkata  | IST| 05:30:00   | f
 posix/Europe/Dublin | IST| 01:00:00   | t
 posix/Eire  | IST| 01:00:00   | t
 Eire| IST| 01:00:00   | t

ovdb=> select * from pg_timezone_abbrevs where abbrev = 'IST';
 abbrev | utc_offset | is_dst
++
 IST| 02:00:00   | f

In my system, we receive TZ in abbrev format (3 character, like EST, PST
...).

I have tried changing the  timezone_abbrevations = 'India', then it worked
fine (IST is giving +5:30 offset)

So,
What is recommended, use name instead of abbrev in TZ conversion
function?
Or
Change the timezone_abbrevations to 'India'?

*Regards,*
*Rajin *

Re: libpq copy error handling busted

2020-06-04 Thread Thomas Munro

On Thu, Jun 4, 2020 at 6:22 PM Oleksandr Shulgin
 wrote:
> On Thu, Jun 4, 2020 at 5:37 AM Thomas Munro  wrote:
>> Here's what I tested.  First, I put this into pgdata/postgresql.conf:

> Would it be feasible to capture this in a sort of a regression (TAP?) test?

If I'm remembering correctly, it wouldn't work on Windows.  After
you've had an error sending to a socket, you can't receive (even if
there was something sent to you earlier).  At least that's how it
seemed from the experiments on that other thread.  The other problem
is that it requires inserting a sleep into the code...

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-04 Thread Dilip Kumar

On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila  wrote:
>
> On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar  wrote:
> >
> > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila  wrote:
> > >
> > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar  wrote:
> > > >
> > > > I thin for our use case BufFileCreateShared is more suitable.  I think
> > > > we need to do some modifications so that we can use these apps without
> > > > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > > > SharedFileSet for each transaction and also need to maintain it in xid
> > > > array or xid hash until transaction commit/abort.  So I suggest
> > > > following modifications in shared files set so that we can
> > > > conveniently use it.
> > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> > > >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > > > If fileset is NULL then in directory path we can use MyProcPID or
> > > > something instead of fileset->creator_pid.
> > > >
> > >
> > > Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> > > better than the patch maintains sharedfileset information.
> >
> > I think we might do something better here, maybe by supplying function
> > pointer or so,  but maintaining sharedfileset which contains different
> > tablespace/mutext which we don't need at all for our purpose also
> > doesn't sound very appealing.
> >
>
> I think we can say something similar for Relation (rel cache entry as
> well) maintained in LogicalRepRelMapEntry.  I think we only need a
> pointer to that information.

Yeah, I see.

> >  Let me see if I can not come up with
> > some clean way of avoiding the need to shared-fileset then maybe we
> > can go with the shared fileset idea.
> >
>
> Fair enough.

While evaluating it further I feel there are a few more problems to
solve if we are using BufFile,  First thing is that in subxact file we
maintain the information of xid and its offset in the changes file.
So now, we will also have to store 'fileno' but that we can find using
BufFileTell.  Yet another problem is that currently, we don't
have the truncate option in the BufFile,  but we need it if the
sub-transaction gets aborted.  I think we can implement an extra
interface with the BufFile and should not be very hard as we already
know the fileno and the offset.  I will evaluate this part further and
let you know about the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Is it useful to record whether plans are generic or custom?

2020-06-04 Thread Atsushi Torikoshi

On Mon, May 25, 2020 at 10:54 AM Atsushi Torikoshi  wrote:

> On Thu, May 21, 2020 at 5:10 PM Kyotaro Horiguchi 
> wrote:
>
>> Cost numbers would look better if it is cooked a bit.  Is it worth
>> being in core?
>
>
> I didn't come up with ideas about how to use them.
> IMHO they might not so necessary..
>

> Perhaps plan_generation is not needed there.
>>
>
> +1.
> Now 'calls' is sufficient and 'plan_generation' may confuse users.
>
> BTW, considering 'calls' in pg_stat_statements is the number of times
> statements were EXECUTED and 'plans' is the number of times
> statements were PLANNED,  how about substituting 'calls' for 'plans'?
>

I've modified the above points and also exposed the numbers of each
 generic plans and custom plans.

I'm now a little bit worried about the below change which removed
the overflow checking for num_custom_plans, which was introduced
in 0001-Expose-counters-of-plancache-to-pg_prepared_statement.patch,
but I've left it because the maximum of int64 seems enough large
for counters.
Also referencing other counters in pg_stat_user_tables, they don't
seem to care about it.

```
-   /* Accumulate total costs of custom plans, but 'ware
overflow */
-   if (plansource->num_custom_plans < INT_MAX)
-   {
-   plansource->total_custom_cost +=
cached_plan_cost(plan, true);
-   plansource->num_custom_plans++;
-   }
```

Regards,

Atsushi Torikoshi

>

0003-Expose-counters-of-plancache-to-pg_prepared_statement.patch
Description: Binary data

Possible bug on Postgres 12 (CASE THEN evaluated prematurely) - Change of behaviour compared to 11, 10, 9

2020-06-04 Thread Juan Fuentes

Greetings,

Our system uses an EAV like database and generates queries like the example 
below.

As you could see the query includes castings, we noticed testing with Postgres 
12 that the castings of the CASE THEN statement (commented out below) where 
failing in some cases, of course if you do the INNER JOIN and CASE WHEN first 
our expectation is that the value can be casted.

Changing INNER JOIN to LEFT JOIN solved the issue in Postgres 12, testing with 
earlier versions of Postgres INNER JOIN worked perfectly. 

Has somebody already reported anything like this? Maybe an issue with some 
optimisation?

Best,
Juan

Example query:

SELECT DISTINCT t0.id
FROM samples t0
INNER JOIN sample_properties t1 ON t0.id = t1.samp_id
INNER JOIN sample_type_property_types t2 ON t1.stpt_id = t2.id
INNER JOIN property_types t3 ON t2.prty_id = t3.id
INNER JOIN data_types t4 ON t3.daty_id = t4.id
LEFT JOIN controlled_vocabulary_terms t5 ON t1.cvte_id = t5.id
LEFT JOIN materials t6 ON t1.mate_prop_id = t6.id
INNER JOIN sample_properties t7 ON t0.id = t7.samp_id
INNER JOIN sample_type_property_types t8 ON t7.stpt_id = t8.id
INNER JOIN property_types t9 ON t8.prty_id = t9.id
INNER JOIN data_types t10 ON t9.daty_id = t10.id
LEFT JOIN controlled_vocabulary_terms t11 ON t7.cvte_id = t11.id
LEFT JOIN materials t12 ON t7.mate_prop_id = t12.id
INNER JOIN sample_properties t13 ON t0.id = t13.samp_id
INNER JOIN sample_type_property_types t14 ON t13.stpt_id = t14.id
INNER JOIN property_types t15 ON t14.prty_id = t15.id
INNER JOIN data_types t16 ON t15.daty_id = t16.id
LEFT JOIN controlled_vocabulary_terms t17 ON t13.cvte_id = t17.id
LEFT JOIN materials t18 ON t13.mate_prop_id = t18.id
INNER JOIN sample_properties t19 ON t0.id = t19.samp_id
INNER JOIN sample_type_property_types t20 ON t19.stpt_id = t20.id
INNER JOIN property_types t21 ON t20.prty_id = t21.id
INNER JOIN data_types t22 ON t21.daty_id = t22.id
LEFT JOIN controlled_vocabulary_terms t23 ON t19.cvte_id = t23.id
LEFT JOIN materials t24 ON t19.mate_prop_id = t24.id
INNER JOIN sample_properties t25 ON t0.id = t25.samp_id
INNER JOIN sample_type_property_types t26 ON t25.stpt_id = t26.id
INNER JOIN property_types t27 ON t26.prty_id = t27.id
INNER JOIN data_types t28 ON t27.daty_id = t28.id
LEFT JOIN controlled_vocabulary_terms t29 ON t25.cvte_id = t29.id
LEFT JOIN materials t30 ON t25.mate_prop_id = t30.id
WHERE t0.saty_id IN (SELECT unnest(ARRAY[5])) AND t3.is_internal_namespace = 
true 
  AND t3.code = 'STORAGE_POSITION.STORAGE_CODE' 
  AND (lower(t1.value) = 'default_storage' OR lower(t5.code) = 
'default_storage' OR lower(t6.code) = 'default_storage')
  AND t7.stpt_id = (SELECT id FROM sample_type_property_types WHERE saty_id = 5 
AND prty_id = (SELECT id FROM property_types WHERE is_internal_namespace = true 
AND code = 'STORAGE_POSITION.STORAGE_RACK_ROW'))
  AND t7.value::numeric = 1 

   
--  AND 
--CASE WHEN t9.is_internal_namespace = true 
--  AND t9.code = 'STORAGE_POSITION.STORAGE_RACK_ROW' 
--  AND (t10.code = 'INTEGER' OR t10.code = 'REAL') 
--THEN t7.value::numeric = 1 
--ELSE false 
--END 
  AND t13.stpt_id = (SELECT id FROM sample_type_property_types WHERE saty_id = 
5 AND prty_id = (SELECT id FROM property_types WHERE is_internal_namespace = 
true AND code = 'STORAGE_POSITION.STORAGE_RACK_COLUMN'))
  AND t13.value::numeric = 2  
--  AND 
--CASE WHEN t15.is_internal_namespace = true 
--  AND t15.code = 'STORAGE_POSITION.STORAGE_RACK_COLUMN' 
--  AND (t16.code = 'INTEGER' OR t16.code = 'REAL') 
--THEN t13.value::numeric = 2 
--ELSE false 
--END 
  AND t21.is_internal_namespace = true 
  AND t21.code = 'STORAGE_POSITION.STORAGE_BOX_NAME' 
  AND (lower(t19.value) = 'box2' OR lower(t23.code) = 'box2' OR lower(t24.code) 
= 'box2') 
  AND t27.is_internal_namespace = true 
  AND t27.code = 'STORAGE_POSITION.STORAGE_BOX_POSITION' 
  AND (t25.value ILIKE '%a3%' OR t29.code ILIKE '%a3%' OR t30.code ILIKE 
'%a3%');

Re: Read access for pg_monitor to pg_replication_origin_status view

2020-06-04 Thread Kyotaro Horiguchi

Hi, Martin.

At Wed, 3 Jun 2020 13:32:28 -0300, Martín Marqués  
wrote in 
> Hi Kyotaro-san,
> 
> Thank you for taking the time to review my patches. Would you like to
> set yourself as a reviewer in the commit entry here?
> https://commitfest.postgresql.org/28/2577/

Done.

> > 0002:
> >
> > It is forgetting to grant pg_read_all_stats to execute
> > pg_show_replication_origin_status.  As the result pg_read_all_stats
> > gets error on executing the function, not on doing select on the view.
> 
> Seems I was testing on a cluster where I had already been digging, so
> pg_real_all_stats had execute privileges on
> pg_show_replication_origin_status (I had manually granted that) and
> didn't notice because I forgot to drop the cluster and initialize
> again.
> 
> Thanks for the pointer here!

Sorry for not mentioning it at that time, but about the following diff:

+GRANT SELECT ON pg_replication_origin_status TO pg_read_all_stats;

system_views.sql already has a REVOKE command on the view. We should
put the above just below the REVOKE command.

I'm not sure where to put the GRANT on
pg_show_replication_origin_status(), but maybe it also should be at
the same place.

> > 0003:
> >
> > It seems to be a take-after of adminpack's documentation, but a
> > superuser is not the only one on PostgreSQL.  The something like the
> > description in 27.2.2 Viewing Statistics looks more suitable.
> >
> > > Superusers and members of the built-in role pg_read_all_stats (see
> > > also Section 21.5) can see all the information about all sessions.
> >
> > Section 21.5 is already saying as follows.
> >
> > > pg_monitor
> > >   Read/execute various monitoring views and functions. This role is a
> > >   member of pg_read_all_settings, pg_read_all_stats and
> > >   pg_stat_scan_tables.
> 
> I'm not sure if I got this right, but I added some more text to point
> out that the pg_read_all_stats role can also access one specific
> function. I personally think it's a bit too detailed, and if we wanted
> to add details it should be formatted differently, which would require
> a more invasive patch (would prefer leaving that out, as it might even
> mean moving parts which are not part of this patch).
> 
> In any case, I hope the change fits what you've kindly pointed out.

I forgot to mention it at that time, but the function
pg_show_replication_origin_status is a function to back up
system-views, like pg_stat_get_activity(), pg_show_all_file_settings()
and so on. Such functions are not documented since users don't need to
call them. pg_show_replication_origin_status is not documented for the
same readon. Thus we don't need to mention the function.

In the previous comment, one point I meant is that the "to the
superuser" should be "to superusers", because a PostgreSQL server
(cluster) can define multiple superusers. Another is that "permitted
to other users by using the GRANT command." might be obscure for
users. In this regard I found a more specific description in the same
file:

  Computes the total disk space used by the database with the specified
  name or OID.  To use this function, you must
  have CONNECT privilege on the specified database
  (which is granted by default) or be a member of
  the pg_read_all_stats role.

So, as the result it would be like the following: (Note that, as you
know, I'm not good at this kind of task..)

  Use of functions for replication origin is restricted to superusers.
  Use for these functions may be permitted to other users by granting
  EXECUTE privilege on the functions.

And in regard to the view, granting privileges on both the view and
function to individual user is not practical so we should mention only
granting pg_read_all_stats to users, like the attached patch.

> > 0004:
> >
> > Looks fine by me.
> 
> Here goes v3 of the patch

By the way, the attachements of your mail are out-of-order. I'm not
sure that that does something bad, though.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 700271fd40..66679e8045 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -11009,8 +11009,10 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
The pg_replication_origin_status view
contains information about how far replay for a certain origin has
-   progressed.  For more on replication origins
-   see .
+   progressed.  Use of this view is restricted to superusers and
+   pg_read_all_stats role.  To use this view, you must be a member of
+   the pg_read_all_stats role. For more on replication
+   origins see .

Re: Atomic operations within spinlocks

2020-06-04 Thread Michael Paquier

On Thu, Jun 04, 2020 at 09:40:31AM +1200, Thomas Munro wrote:
> Yeah.  It'd be fine to move that after the spinlock release.  Although
> it's really just for informational purposes only, not for any data
> integrity purpose, reading it before the spinlock acquisition would
> theoretically allow it to appear to be (reportedly) behind
> flushedUpto, which would be silly.

Indeed.  This could just be done after the spinlock section.  Sorry
about that.
--
Michael


signature.asc
Description: PGP signature

64 matches

Mail list logo