Re: A wrong index choose issue because of inaccurate statistics
pá 5. 6. 2020 v 8:19 odesílatel David Rowley napsal: > On Mon, 1 Jun 2020 at 01:24, Andy Fan wrote: > > The one-line fix describe the exact idea in my mind: > > > > +++ b/src/backend/optimizer/path/costsize.c > > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, > double loop_count, > > > > cpu_run_cost += cpu_per_tuple * tuples_fetched; > > > > + /* > > +* To make the planner more robust to handle some inaccurate > statistics > > +* issue, we will add a extra cost to qpquals so that the less > qpquals > > +* the lower cost it has. > > +*/ > > + cpu_run_cost += 0.01 * list_length(qpquals); > > + > > > > This change do fix the issue above, but will it make some other cases > worse? My > > answer is no based on my current knowledge, and this is most important > place > > I want to get advised. The mainly impact I can figure out is: it not only > > change the cost difference between (a, b) and (a, c) it also cause the > cost > > difference between Index scan on (a, c) and seq scan. However the > > cost different between index scan and seq scan are huge by practice so > > I don't think this impact is harmful. > > Didn't that idea already get shot down in the final paragraph on [1]? > > I understand that you wish to increase the cost by some seemingly > innocent constant to fix your problem case. Here are my thoughts > about that: Telling lies is not really that easy to pull off. Bad > liers think it's easy and good ones know it's hard. The problem is > that the lies can start small, but then at some point the future you > must fashion some more lies to account for your initial lie. Rinse and > repeat that a few times and before you know it, your course is set > well away from the point of truth. I feel the part about "rinse and > repeat" applies reasonably well to how join costing works. The lie is > likely to be amplified as the join level gets deeper. > > I think you need to think of a more generic solution and propose that > instead. There are plenty of other quirks in the planner that can > cause suffering due to inaccurate or non-existing statistics. For > example, due to how we multiply individual selectivity estimates, > having a few correlated columns in a join condition can cause the > number of join rows to be underestimated. Subsequent joins can then > end up making bad choices on which join operator to use based on those > inaccurate row estimates. There's also a problem with WHERE ORDER > BY col LIMIT n; sometimes choosing an index that provides pre-sorted > input to the ORDER BY but cannot use as an indexable condition. > We don't record any stats to make better choices there, maybe we > should, but either way, we're taking a bit risk there as all the rows > matching might be right at the end of the index and we might need > to scan the entire thing before hitting the LIMIT. For now, we just > assume completely even distribution of rows. i.e. If there are 50 rows > estimated in the path and the limit is for 5 rows, then we'll assume > we need to read 10% of those before finding all the ones we need. In > reality, everything matching might be 95% through the index and we > could end up reading 100% of rows. That particular problem is not just > caused by the uneven distribution of rows in the index, but also from > selectivity underestimation. > > I'd more recently wondered if we shouldn't have some sort of "risk" > factor to the cost model. I really don't have ideas on how exactly we > would calculate the risk factor in all cases, but in this case, say > the risk factor always starts out as 1. We could multiply that risk > factor by some >1 const each time we find another index filter qual. > add_path() can prefer lower risk paths when the costs are similar. > Unsure what the exact add_path logic would be. Perhaps a GUC would > need to assist with the decision there. Likewise, with > NestedLoopPaths which have a large number of join quals, the risk > factor could go up a bit with those so that we take a stronger > consideration for hash or merge joins instead. > > I thought about these ideas too. And I am not alone. https://hal.archives-ouvertes.fr/hal-01316823/document Regards Pavel Anyway, it's pretty much a large research subject which would take a > lot of work to iron out even just the design. It's likely not a > perfect idea, but I think it has a bit more merit that trying to > introduce lies to the cost modal to account for a single case where > there is a problem. > > David > > [1] > https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development > > >
Re: Removal of currtid()/currtid2() and some table AM cleanup
On Wed, Jun 03, 2020 at 10:10:21PM +0900, Inoue, Hiroshi wrote: > On 2020/06/03 11:14, Michael Paquier wrote: >> I have been looking at the ODBC driver and the need for currtid() as >> well as currtid2(), and as mentioned already in [1], matching with my >> lookup of things, these are actually not needed by the driver as long >> as we connect to a server newer than 8.2 able to support RETURNING. > > Though currtid2() is necessary even for servers which support RETURNING, > I don't object to remove it. In which cases is it getting used then? From what I can see there is zero coverage for that part in the tests. And based on a rough read of the code, this would get called with LATEST_TUPLE_LOAD being set, where there is some kind of bulk deletion involved. Couldn't that be a problem? -- Michael signature.asc Description: PGP signature
Re: A wrong index choose issue because of inaccurate statistics
On Mon, 1 Jun 2020 at 01:24, Andy Fan wrote: > The one-line fix describe the exact idea in my mind: > > +++ b/src/backend/optimizer/path/costsize.c > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double > loop_count, > > cpu_run_cost += cpu_per_tuple * tuples_fetched; > > + /* > +* To make the planner more robust to handle some inaccurate > statistics > +* issue, we will add a extra cost to qpquals so that the less qpquals > +* the lower cost it has. > +*/ > + cpu_run_cost += 0.01 * list_length(qpquals); > + > > This change do fix the issue above, but will it make some other cases worse? > My > answer is no based on my current knowledge, and this is most important place > I want to get advised. The mainly impact I can figure out is: it not only > change the cost difference between (a, b) and (a, c) it also cause the cost > difference between Index scan on (a, c) and seq scan. However the > cost different between index scan and seq scan are huge by practice so > I don't think this impact is harmful. Didn't that idea already get shot down in the final paragraph on [1]? I understand that you wish to increase the cost by some seemingly innocent constant to fix your problem case. Here are my thoughts about that: Telling lies is not really that easy to pull off. Bad liers think it's easy and good ones know it's hard. The problem is that the lies can start small, but then at some point the future you must fashion some more lies to account for your initial lie. Rinse and repeat that a few times and before you know it, your course is set well away from the point of truth. I feel the part about "rinse and repeat" applies reasonably well to how join costing works. The lie is likely to be amplified as the join level gets deeper. I think you need to think of a more generic solution and propose that instead. There are plenty of other quirks in the planner that can cause suffering due to inaccurate or non-existing statistics. For example, due to how we multiply individual selectivity estimates, having a few correlated columns in a join condition can cause the number of join rows to be underestimated. Subsequent joins can then end up making bad choices on which join operator to use based on those inaccurate row estimates. There's also a problem with WHERE ORDER BY col LIMIT n; sometimes choosing an index that provides pre-sorted input to the ORDER BY but cannot use as an indexable condition. We don't record any stats to make better choices there, maybe we should, but either way, we're taking a bit risk there as all the rows matching might be right at the end of the index and we might need to scan the entire thing before hitting the LIMIT. For now, we just assume completely even distribution of rows. i.e. If there are 50 rows estimated in the path and the limit is for 5 rows, then we'll assume we need to read 10% of those before finding all the ones we need. In reality, everything matching might be 95% through the index and we could end up reading 100% of rows. That particular problem is not just caused by the uneven distribution of rows in the index, but also from selectivity underestimation. I'd more recently wondered if we shouldn't have some sort of "risk" factor to the cost model. I really don't have ideas on how exactly we would calculate the risk factor in all cases, but in this case, say the risk factor always starts out as 1. We could multiply that risk factor by some >1 const each time we find another index filter qual. add_path() can prefer lower risk paths when the costs are similar. Unsure what the exact add_path logic would be. Perhaps a GUC would need to assist with the decision there. Likewise, with NestedLoopPaths which have a large number of join quals, the risk factor could go up a bit with those so that we take a stronger consideration for hash or merge joins instead. Anyway, it's pretty much a large research subject which would take a lot of work to iron out even just the design. It's likely not a perfect idea, but I think it has a bit more merit that trying to introduce lies to the cost modal to account for a single case where there is a problem. David [1] https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, May 29, 2020 at 8:31 PM Dilip Kumar wrote: > > Apart from this one more fix in 0005, basically, CheckLiveXid was > never reset, so I have fixed that as well. > I have made a number of modifications in the 0001 patch and attached is the result. I have changed/added comments, done some cosmetic cleanup, and ran pgindent. The most notable change is to remove the below code change: DecodeXactOp() { .. - * However, it's critical to process XLOG_XACT_ASSIGNMENT records even + * However, it's critical to process records with subxid assignment even * when the snapshot is being built: it is possible to get later records * that require subxids to be properly assigned. */ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT && - info != XLOG_XACT_ASSIGNMENT) + !TransactionIdIsValid(XLogRecGetTopXid(r))) .. } I have not only removed the change done by the patch but the check related to XLOG_XACT_ASSIGNMENT as well. That check has been added by commit bac2fae05c to ensure that we process XLOG_XACT_ASSIGNMENT even if snapshot state is not SNAPBUILD_FULL_SNAPSHOT. Now, with this patch that is not required because we are making the subtransaction and top-level transaction much earlier than this. I have verified that it doesn't reopen the bug by running the test provided in the original report [1]. Let me know what you think of the changes? If you find them okay, then feel to include them in the next patch-set. [1] - https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com v27-0001-Immediately-WAL-log-subtransaction-and-top-level.patch Description: Binary data
Re: Removal of currtid()/currtid2() and some table AM cleanup
On Thu, Jun 04, 2020 at 10:59:05AM -0700, Andres Freund wrote: > What's the point of that change? I think the differentiation between > heapam_handler.c and heapam.c could be clearer, but if anything, I'd > argue that heap_get_latest_tid is sufficiently low-level that it'd > belong in heapam.c. Well, heap_get_latest_tid() is only called in heapam_handler.c if anything, as it is not used elsewhere and not publish it. And IMO we should try to encourage using table_get_latest_tid() instead if some plugins need that. Anyway, if you are opposed to this change, I won't push hard for it either. -- Michael signature.asc Description: PGP signature
Re: BufFileRead() error signalling
On Thu, May 28, 2020 at 7:10 PM Michael Paquier wrote: > On Wed, May 27, 2020 at 11:59:59AM -0400, Alvaro Herrera wrote: > > In the discussion that led to 811b6e36a9e2 I did suggest to use "read > > only M of N" instead, but there wasn't enough discussion on that fine > > point so we settled on what you now call prevalent without a lot of > > support specifically on that. I guess it was enough of an improvement > > over what was there. But like Robert, I too prefer the wording that > > includes "only" and "bytes" over the wording that doesn't. > > > > I'll let it be known that from a translator's point of view, it's a > > ten-seconds job to update a fuzzy string from not including "only" and > > "bytes" to one that does. So let's not make that an argument for not > > changing. > > Using "only" would be fine by me, though I tend to prefer the existing > one. Now I think that we should avoid "bytes" to not have to worry > about pluralization of error messages. This has been a concern in the > past (see e5d11b9 and the likes). Sorry for the delay in producing a new patch. Here's one that tries to take into account the various feedback in this thread: Ibrar said: > You are checking file->dirty twice, first before calling the function > and within the function too. Same for the Assert. Fixed. Robert, Melanie, Michael and Alvaro put forward various arguments about the form of "short read" and block seek error message. While removing bogus cases of "%m", I changed them all to say "read only %zu of %zu bytes" in the same place. I removed the bogus "%m" from BufFileSeek() and BufFileSeekBlock() callers (its call to BufFileFlush() already throws). I added block numbers to the error messages about failure to seek by block. Following existing practice, I made write failure assume ENOSPC if errno is 0, so that %m would make sense (I am not sure what platform reports out-of-space that way, but there are certainly a lot of copies of that code in the tree so it seemed to be missing here). I didn't change BufFileWrite() to be void, to be friendly to existing callers outside the tree (if there are any), though I removed all the code that checks the return code. We can make it void later. For the future: it feels a bit like we're missing a one line way to say "read this many bytes and error out if you run out". From ab6f90f5e5f3b6e70de28a4ee734e732c2a8c30b Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 30 Nov 2019 12:44:47 +1300 Subject: [PATCH] Redesign error handling in buffile.c. Convert error handling in buffile.c functions to raise errors rather than expecting callers to check the return value, and improve the error messages. This fixes various cases that forgot to check for errors, or couldn't check for errors because EOF was indistinguishable from error, or checked for errors unless they fell on certain boundaries which they assumed to be EOF. This had been identified by code inspection, but was later identified as the probable cause of a crash report involving a corrupted hash join spill file when disk space ran out. Back-patch to all releases. For now, BufFileWrite() still returns a redundant value so as not to change the function prototype for the benefit of any extensions that might be using it. Reported-by: Amit Khandekar Reported-by: Alvaro Herrera Reviewed-by: Melanie Plageman Reviewed-by: Alvaro Herrera Reviewed-by: Robert Haas Reviewed-by: Ibrar Ahmed Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com --- src/backend/access/gist/gistbuildbuffers.c | 23 - src/backend/executor/nodeHashjoin.c| 24 -- src/backend/replication/backup_manifest.c | 2 +- src/backend/storage/file/buffile.c | 51 +++- src/backend/utils/sort/logtape.c | 18 --- src/backend/utils/sort/sharedtuplestore.c | 21 src/backend/utils/sort/tuplestore.c| 56 ++ 7 files changed, 91 insertions(+), 104 deletions(-) diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c index 4b562d8d3f..7ff513a5fe 100644 --- a/src/backend/access/gist/gistbuildbuffers.c +++ b/src/backend/access/gist/gistbuildbuffers.c @@ -757,26 +757,19 @@ gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate, static void ReadTempFileBlock(BufFile *file, long blknum, void *ptr) { + size_t nread; + if (BufFileSeekBlock(file, blknum) != 0) - elog(ERROR, "could not seek temporary file: %m"); - if (BufFileRead(file, ptr, BLCKSZ) != BLCKSZ) - elog(ERROR, "could not read temporary file: %m"); + elog(ERROR, "could not seek block %ld in temporary file", blknum); + if ((nread = BufFileRead(file, ptr, BLCKSZ)) != BLCKSZ) + elog(ERROR, "could not read temporary file: read only %zu of %zu bytes", + nread, (size_t) BLCKSZ); } static void WriteTempFileBlock(BufFile *file, long blknum, voi
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Fri, Jun 5, 2020 at 10:57 AM David Rowley wrote: > On Fri, 5 Jun 2020 at 14:36, Andy Fan wrote: > > On Mon, May 25, 2020 at 2:34 AM David Rowley > wrote: > >> > >> On Sun, 24 May 2020 at 04:14, Dmitry Dolgov <9erthali...@gmail.com> > wrote: > >> > > >> > > On Fri, May 22, 2020 at 08:40:17AM +1200, David Rowley wrote: > >> > > I imagine we'll set some required UniqueKeys during > >> > > standard_qp_callback() > >> > > >> > In standard_qp_callback, because pathkeys are computed at this point I > >> > guess? > >> > >> Yes. In particular, we set the pathkeys for DISTINCT clauses there. > >> > > > > Actually I have some issues to understand from here, then try to read > index > > skip scan patch to fully understand what is the requirement, but that > doesn't > > get it so far[1]. So what is the "UniqueKeys" in "UniqueKeys during > > standard_qp_callback()" and what is the "pathkeys" in "pathkeys are > computed > > at this point” means? I tried to think it as root->distinct_pathkeys, > however I > > didn't fully understand where root->distinct_pathkeys is used for as > well. > > In standard_qp_callback(), what we'll do with uniquekeys is pretty > much what we already do with pathkeys there. Basically pathkeys are > set there to have the planner attempt to produce a plan that satisfies > those pathkeys. Notice at the end of standard_qp_callback() we set the pathkeys according to the first upper planner operation that'll > need to make use of those pathkeys. e.g, If there's a GROUP BY and a > DISTINCT in the query, then use the pathkeys for GROUP BY, since that > must occur before DISTINCT. Thanks for your explanation. Looks I understand now based on your comments. Take root->group_pathkeys for example, the similar information also available in root->parse->groupClauses but we do use of root->group_pathkeys with pathkeys_count_contained_in function in many places, that is mainly because the content between between the 2 is different some times, like the case in pathkey_is_redundant. Likely uniquekeys will want to follow the > same rules there for the operations that can make use of paths with > uniquekeys, which in this case, I believe, will be the same as the > example I just mentioned for pathkeys, except we'll only be able to > support GROUP BY without any aggregate functions. > > All the places I want to use UniqueKey so far (like distinct, group by and others) have an input_relation (RelOptInfo), and the UniqueKey information can be get there. at the same time, all the pathkey in PlannerInfo is used for Upper planner but UniqueKey may be used in current planner some time, like reduce_semianti_joins/ remove_useless_join, I am not sure if we must maintain uniquekey in PlannerInfo. -- Best Regards Andy Fan
Re: A wrong index choose issue because of inaccurate statistics
> > > > Why will the (a, c) be choose? If planner think a = x has only 1 row .. > I just did more research and found above statement is not accurate, the root cause of this situation is because IndexSelectivity = 0. Even through I don't think we can fix anything here since IndexSelectivity is calculated from statistics and we don't know it is 1% wrong or 10% wrong or more just like What Tomas said. The way of fixing it is just add a "small" extra cost for pqquals for index scan. that should be small enough to not have impacts on others part. I will discuss how small is small later with details, we can say it just a guc variable for now. +++ b/src/backend/optimizer/path/costsize.c @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count, cpu_run_cost += cpu_per_tuple * tuples_fetched; + /* +* To make the planner more robust to handle some inaccurate statistics +* issue, we will add a extra cost to qpquals so that the less qpquals +* the lower cost it has. +*/ + cpu_run_cost += stat_stale_cost * list_length(qpquals); + If we want to reduce the impact of this change further, we can only add this if the IndexSelecivity == 0. How to set the value of stat_stale_cost? Since the minimum cost for a query should be a cpu_tuple_cost which is 0.01 default. Adding an 0.01 cost for each pqqual in index scan should not make a big difference. However sometimes we may set it to 0.13 if we consider index->tree_height was estimated wrongly for 1 (cost is 50 * 0.0025 = 0.125). I don't know how it happened, but looks it do happen in prod environment. At the same time it is unlikely index->tree_height is estimated wrongly for 2 or more. so basically we can set this value to 0(totally disable this feature), 0.01 (should be ok for most case), 0.13 (A bit aggressive). The wrong estimation of IndexSelectitity = 0 might be common case and if people just have 2 related index like (A, B) and (A, C). we have 50% chances to have a wrong decision, so I would say this case worth the troubles. My current implementation looks not cool, so any suggestion to research further is pretty welcome. -- Best Regards Andy Fan
Re: Atomic operations within spinlocks
Hi, On 2020-06-04 15:13:29 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2020-06-04 14:50:40 -0400, Tom Lane wrote: > >> 2. The computed completePasses value would go backwards. I bet > >> that wouldn't matter too much either, or at least we could teach > >> BgBufferSync to cope. (I notice the comments therein suggest that > >> it is already designed to cope with completePasses wrapping around, > >> so maybe nothing needs to be done.) > > > If we're not concerned about that, then we can remove the > > atomic-inside-spinlock, I think. The only reason for that right now is > > to avoid assuming a wrong pass number. > > Hmm. That might be a less-invasive path to a solution. I can take > a look, if you don't want to. First, I think it would be problematic: /* * Find out where the freelist clock sweep currently is, and how many * buffer allocations have happened since our last call. */ strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc); ... /* * Compute strategy_delta = how many buffers have been scanned by the * clock sweep since last time. If first time through, assume none. Then * see if we are still ahead of the clock sweep, and if so, how many * buffers we could scan before we'd catch up with it and "lap" it. Note: * weird-looking coding of xxx_passes comparisons are to avoid bogus * behavior when the passes counts wrap around. */ if (saved_info_valid) { int32 passes_delta = strategy_passes - prev_strategy_passes; strategy_delta = strategy_buf_id - prev_strategy_buf_id; strategy_delta += (long) passes_delta * NBuffers; Assert(strategy_delta >= 0); ISTM that if we can get an out-of-sync strategy_passes and strategy_buf_id we'll end up with a pretty wrong strategy_delta. Which, I think, can cause to reset bgwriter's position: else { /* * We're behind, so skip forward to the strategy point and start * cleaning from there. */ #ifdef BGW_DEBUG elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld", next_passes, next_to_clean, strategy_passes, strategy_buf_id, strategy_delta); #endif next_to_clean = strategy_buf_id; next_passes = strategy_passes; bufs_to_lap = NBuffers; } While I think that the whole logic in BgBufferSync doesn't make a whole lot of sense, it does seem to me this has a fair potential to make it worse. In a scenario with a decent cache hit ratio (leading to high usagecounts) and a not that large NBuffers, we can end up up doing quite a few passes (as in many a second), so it might not be that hard to hit this. I am not immediately coming up with a cheap solution that doesn't do the atomics-within-spinlock thing. The best I can come up with is using a 64bit atomic, with the upper 32bit containing the number of passes, and the lower 32bit containing the current buffer. Where the lower 32bit / the buffer is handled like it currently is, i.e. we "try" to keep it below NBuffers. So % is only used for the "cold" path. That'd just add a 64->32 bit cast in the hot path, which shouldn't be measurable. But it'd regress platforms without 64bit atomics substantially. We could obviously also just rewrite the BgBufferSync() logic, so it doesn't care about things like "lapping", but that's not an easy change. So the best I can really suggest, unless we were to agree on atomics being ok inside spinlocks, is probably to just replace the spinlock with an lwlock. That'd perhaps cause a small slowdown for a few cases, but it'd make workload that e.g. use the freelist a lot (e.g. when tables are dropped regularly) scale better. Greetings, Andres Freund
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Fri, 5 Jun 2020 at 14:36, Andy Fan wrote: > On Mon, May 25, 2020 at 2:34 AM David Rowley wrote: >> >> On Sun, 24 May 2020 at 04:14, Dmitry Dolgov <9erthali...@gmail.com> wrote: >> > >> > > On Fri, May 22, 2020 at 08:40:17AM +1200, David Rowley wrote: >> > > I imagine we'll set some required UniqueKeys during >> > > standard_qp_callback() >> > >> > In standard_qp_callback, because pathkeys are computed at this point I >> > guess? >> >> Yes. In particular, we set the pathkeys for DISTINCT clauses there. >> > > Actually I have some issues to understand from here, then try to read index > skip scan patch to fully understand what is the requirement, but that doesn't > get it so far[1]. So what is the "UniqueKeys" in "UniqueKeys during > standard_qp_callback()" and what is the "pathkeys" in "pathkeys are computed > at this point” means? I tried to think it as root->distinct_pathkeys, > however I > didn't fully understand where root->distinct_pathkeys is used for as well. In standard_qp_callback(), what we'll do with uniquekeys is pretty much what we already do with pathkeys there. Basically pathkeys are set there to have the planner attempt to produce a plan that satisfies those pathkeys. Notice at the end of standard_qp_callback() we set the pathkeys according to the first upper planner operation that'll need to make use of those pathkeys. e.g, If there's a GROUP BY and a DISTINCT in the query, then use the pathkeys for GROUP BY, since that must occur before DISTINCT. Likely uniquekeys will want to follow the same rules there for the operations that can make use of paths with uniquekeys, which in this case, I believe, will be the same as the example I just mentioned for pathkeys, except we'll only be able to support GROUP BY without any aggregate functions. David > [1] > https://www.postgresql.org/message-id/CAKU4AWq%3DwWkAo-CDOQ5Ea6UwYvZCgb501w6iqU0rtnTT-zg6bQ%40mail.gmail.com
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Mon, May 25, 2020 at 2:34 AM David Rowley wrote: > On Sun, 24 May 2020 at 04:14, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > > > On Fri, May 22, 2020 at 08:40:17AM +1200, David Rowley wrote: > > > I imagine we'll set some required UniqueKeys during > > > standard_qp_callback() > > > > In standard_qp_callback, because pathkeys are computed at this point I > > guess? > > Yes. In particular, we set the pathkeys for DISTINCT clauses there. > > Actually I have some issues to understand from here, then try to read index skip scan patch to fully understand what is the requirement, but that doesn't get it so far[1]. So what is the "UniqueKeys" in "UniqueKeys during standard_qp_callback()" and what is the "pathkeys" in "pathkeys are computed at this point” means? I tried to think it as root->distinct_pathkeys, however I didn't fully understand where root->distinct_pathkeys is used for as well. [1] https://www.postgresql.org/message-id/CAKU4AWq%3DwWkAo-CDOQ5Ea6UwYvZCgb501w6iqU0rtnTT-zg6bQ%40mail.gmail.com -- Best Regards Andy Fan
Re: Atomic operations within spinlocks
Hi, On 2020-06-04 15:07:34 -0400, Tom Lane wrote: > Andres Freund writes: > > I'd still like to know which problem we're actually trying to solve > > here. I don't understand the "error" issues you mentioned upthread. > > If you error out of getting the inner spinlock, the outer spinlock > is stuck, permanently, because there is no mechanism for spinlock > release during transaction abort. Admittedly it's not very likely > for the inner acquisition to fail, but it's possible. We PANIC on stuck spinlocks, so I don't think that's a problem. > * We map all spinlocks onto a set of NUM_SPINLOCK_SEMAPHORES semaphores. > * It's okay to map multiple spinlocks onto one semaphore because no process > * should ever hold more than one at a time. > > You've falsified that argument ... and no, I don't want to upgrade > the spinlock infrastructure enough to make this OK. Well, theoretically we take care to avoid this problem. That's why we have a separate define for spinlocks and atomics: /* * When we don't have native spinlocks, we use semaphores to simulate them. * Decreasing this value reduces consumption of OS resources; increasing it * may improve performance, but supplying a real spinlock implementation is * probably far better. */ #define NUM_SPINLOCK_SEMAPHORES 128 /* * When we have neither spinlocks nor atomic operations support we're * implementing atomic operations on top of spinlock on top of semaphores. To * be safe against atomic operations while holding a spinlock separate * semaphores have to be used. */ #define NUM_ATOMICS_SEMAPHORES 64 and #ifndef HAVE_SPINLOCKS /* * NB: If we're using semaphore based TAS emulation, be careful to use a * separate set of semaphores. Otherwise we'd get in trouble if an atomic * var would be manipulated while spinlock is held. */ s_init_lock_sema((slock_t *) &ptr->sema, true); #else SpinLockInit((slock_t *) &ptr->sema); #endif But it looks like that code is currently buggy (and looks like it always has been), because we don't look at the nested argument when initializing the semaphore. So we currently allocate too many semaphores, without benefiting from them :(. > We shouldn't ever be holding spinlocks long enough, or doing anything > complicated enough inside them, for such an upgrade to have merit. Well, I don't think atomic instructions are that complicated. And I think prohibiting atomics-within-spinlock adds a problematic restriction, without much in the way of benefits: There's plenty things where it's somewhat easy to make the fast-path lock-free, but the slow path still needs a lock (e.g. around a freelist). And for those it's really useful to still be able to have a coherent update to an atomic variable, to synchronize with the fast-path that doesn't take the spinlock. Greetings, Andres Freund
Re: REINDEX CONCURRENTLY and indisreplident
On Thu, Jun 04, 2020 at 11:23:36AM +0900, Michael Paquier wrote: > On Wed, Jun 03, 2020 at 12:40:38PM -0300, Euler Taveira wrote: > > On Wed, 3 Jun 2020 at 03:54, Michael Paquier wrote: > >> I have bumped into $subject, causing a replica identity index to > >> be considered as dropped if running REINDEX CONCURRENTLY on it. This > >> means that the old tuple information would get lost in this case, as > >> a REPLICA IDENTITY USING INDEX without a dropped index is the same as > >> NOTHING. > > > > LGTM. I tested in both versions (12, master) and it works accordingly. > > Thanks for the review. I'll try to get that fixed soon. Applied this one, just in time before the branching: https://www.postgresql.org/message-id/1931934b-09dc-e93e-fab9-78c5bc727...@postgresql.org -- Michael signature.asc Description: PGP signature
Re: significant slowdown of HashAggregate between 9.6 and 10
Hi, On 2020-06-04 18:22:03 -0700, Jeff Davis wrote: > On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote: > > +/* minimum number of initial hash table buckets */ > > +#define HASHAGG_MIN_BUCKETS 256 > > > > > > I don't really see much explanation for that part in the commit, > > perhaps > > Jeff can chime in? > > I did this in response to a review comment (point #5): > > > https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development > > Tomas suggested a min of 1024, and I thought I was being more > conservative choosing 256. Still too high, I guess? > I can lower it. What do you think is a reasonable minimum? I don't really see why there needs to be a minimum bigger than 1? Greetings, Andres Freund
Re: v13: Performance regression related to FORTIFY_SOURCE
Jeff Davis writes: > On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote: >> If it is something worth worrying about, let's discuss what's a good >> fix for it. > I did post a fix for it, but it's not a very clean fix. I'm slightly > inclined to proceed with that fix, but I was hoping someone else would > have a better suggestion. > How about if I wait another week, and if we still don't have a better > fix, I will commit this one. TBH, I don't think we should do this, at least not on the strength of the evidence you posted so far. It looks to me like you are micro-optimizing for one compiler on one platform. Moreover, you're basically trying to work around a compiler codegen bug that might not be there next year. I think what'd make more sense is to file this as a gcc bug ("why doesn't it remove the useless object size check?") and see what they say about that. If the answer is that this isn't a gcc bug for whatever reason, then we could think about whether we should work around it on the source-code level. Even then, I'd want more evidence than has been presented about this not causing a regression on other toolchains/CPUs. regards, tom lane
Re: significant slowdown of HashAggregate between 9.6 and 10
On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote: > +/* minimum number of initial hash table buckets */ > +#define HASHAGG_MIN_BUCKETS 256 > > > I don't really see much explanation for that part in the commit, > perhaps > Jeff can chime in? I did this in response to a review comment (point #5): https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development Tomas suggested a min of 1024, and I thought I was being more conservative choosing 256. Still too high, I guess? I can lower it. What do you think is a reasonable minimum? Regards, Jeff Davis
Re: v13: Performance regression related to FORTIFY_SOURCE
On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote: > If it is something worth worrying about, let's discuss what's a good > fix for it. I did post a fix for it, but it's not a very clean fix. I'm slightly inclined to proceed with that fix, but I was hoping someone else would have a better suggestion. How about if I wait another week, and if we still don't have a better fix, I will commit this one. Regards, Jeff Davis
Re: v13: Performance regression related to FORTIFY_SOURCE
On Sun, 2020-04-19 at 16:19 -0700, Peter Geoghegan wrote: > Is it possible that the issue has something to do with what the > compiler knows about the alignment of the tapes back when they were a > flexible array vs. now, where it's a separate allocation? Perhaps I'm > over reaching, but it occurs to me that MemSetAligned() is itself > concerned about the alignment of data returned from palloc(). Could > be > a similar issue here, too. Perhaps, but if so, what remedy would that suggest? Regards, Jeff Davis
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Mon, 25 May 2020 at 19:14, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Mon, May 25, 2020 at 06:34:30AM +1200, David Rowley wrote: > > The difference will be that you'd be setting some distinct_uniquekeys > > in standard_qp_callback() to explicitly request that some skip scan > > paths be created for the uniquekeys, whereas the patch here just does > > not bother doing DISTINCT if the upper relation already has unique > > keys that state that the DISTINCT is not required. The skip scans > > patch should check if the RelOptInfo for the uniquekeys set in > > standard_qp_callback() are already mentioned in the RelOptInfo's > > uniquekeys. If they are then there's no point in skip scanning as the > > rel is already unique for the distinct_uniquekeys. > > It sounds like it makes semantics of UniqueKey a bit more confusing, > isn't it? At the moment it says: > > Represents the unique properties held by a RelOptInfo. > > With the proposed changes it would be "unique properties, that are held" > and "unique properties, that are requested", which are partially > duplicated, but stored in some different fields. From the skip scan > patch perspective it's probably doesn't make any difference, seems like > the implementation would be almost the same, just created UniqueKeys > would be of different type. But I'm afraid potentiall future users of > UniqueKeys could be easily confused. If there's some comment that says UniqueKeys are for RelOptInfos, then perhaps that comment just needs to be expanded to mention the Path uniqueness when we add the uniquekeys field to Path. I think the main point of basing skip scans on top of this uniquekeys patch is to ensure it's the right thing for the job. I don't think it's realistic to be maintaining two different sets of infrastructure which serve a very similar purpose. It's important we make UniqueKeys general purpose enough to support future useful forms of optimisation. Basing skip scans on it seems like a good exercise towards that. I'm not expecting that we need to make zero changes here to allow it to work well with skip scans. David
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On 2020-Jun-04, Andres Freund wrote: > postgres[52656][1]=# SELECT 1; > ┌──┐ > │ ?column? │ > ├──┤ > │1 │ > └──┘ > (1 row) > > > I am very much not in love with the way that was implemented, but it's > there, and it's used as far as I know (cf tablesync.c). Ouch ... so they made IDENT in the replication grammar be a trigger to enter the regular grammar. Crazy. No way to put those worms back in the tin now, I guess. It is still my opinion that we should prohibit a logical replication connection from being used to do physical replication. Horiguchi-san, Sawada-san and Masao-san are all of the same opinion. Dave Cramer (of the JDBC team) is not opposed to the change -- he says they're just using it because they didn't realize they should be doing differently. Both Michael P. and you are saying we shouldn't break it because it works today, but there isn't a real use-case for it. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
Hi, On 2020-06-04 16:44:53 -0400, Alvaro Herrera wrote: > A logical replication connection cannot run SQL anyway, can it? You can: andres@awork3:~/src/postgresql$ psql 'replication=database' postgres[52656][1]=# IDENTIFY_SYSTEM; ┌─┬──┬┬──┐ │ systemid │ timeline │ xlogpos │ dbname │ ├─┼──┼┼──┤ │ 6821634567571961151 │1 │ 1/D256EC40 │ postgres │ └─┴──┴┴──┘ (1 row) postgres[52656][1]=# SELECT 1; ┌──┐ │ ?column? │ ├──┤ │1 │ └──┘ (1 row) I am very much not in love with the way that was implemented, but it's there, and it's used as far as I know (cf tablesync.c). Greetings, Andres Freund
Re: what can go in root.crt ?
Chapman Flack writes: > Sure. It seems sensible to me to start by documenting /what/ it is doing > now, and to what extent that should be called "its standard behavior" > versus "the way libpq is calling it", because even if nothing is to be > changed, there will be people who need to be able to find that information > to understand what will and won't work. Fair enough. I'm certainly prepared to believe that there might be things we're doing with that API that are not (anymore?) considered best practice. But I'd want to approach any changes as "what is considered best practice", not "how can we get this predetermined behavior". regards, tom lane
Re: what can go in root.crt ?
On 06/04/20 18:03, Tom Lane wrote: > It's possible that we could force openssl to validate cases it doesn't > accept now. Whether we *should* deviate from its standard behavior is > a fairly debatable question though. I would not be inclined to do so > unless we find that many other consumers of the library also do that. > Overriding a library in its specific area of expertise seems like a > good way to get your fingers burnt. Sure. It seems sensible to me to start by documenting /what/ it is doing now, and to what extent that should be called "its standard behavior" versus "the way libpq is calling it", because even if nothing is to be changed, there will be people who need to be able to find that information to understand what will and won't work. Regards, -Chap
Re: what can go in root.crt ?
Chapman Flack writes: > On 06/04/20 17:31, Andrew Dunstan wrote: >> Do we actually do any of this sort of thing? I confess my impression was >> this is all handled by the openssl libraries, we just hand over the >> certs and let openssl do its thing. Am I misinformed about that? > By analogy to other SSL libraries I have worked with, my guess would > be that there are certain settings and callbacks available that would > determine some of what it is doing. It's possible that we could force openssl to validate cases it doesn't accept now. Whether we *should* deviate from its standard behavior is a fairly debatable question though. I would not be inclined to do so unless we find that many other consumers of the library also do that. Overriding a library in its specific area of expertise seems like a good way to get your fingers burnt. regards, tom lane
Re: what can go in root.crt ?
On 06/04/20 17:31, Andrew Dunstan wrote: > Do we actually do any of this sort of thing? I confess my impression was > this is all handled by the openssl libraries, we just hand over the > certs and let openssl do its thing. Am I misinformed about that? I haven't delved very far into the code yet (my initial aim with this thread was not to pose a rhetorical question, but an ordinary one, and somebody would know the answer). By analogy to other SSL libraries I have worked with, my guess would be that there are certain settings and callbacks available that would determine some of what it is doing. In the javax.net.ssl package [1], for example, there are HostnameVerifier and TrustManager interfaces; client code can supply implementations of these that embody its desired policies. Regards, -Chap
Re: what can go in root.crt ?
On 6/3/20 7:57 PM, Chapman Flack wrote: > > In an ideal world, I think libpq would be using this algorithm: > > I'm looking at the server's certificate, s. > Is s unexpired and in the trust file? If so, SUCCEED. > > otherwise, loop: > get issuer certificate i from s (if s is self-signed, FAIL). > does i have CA:TRUE and Certificate Sign bits? If not, FAIL. > does i's Domain Constraint allow it to sign s? If not, FAIL. > is i unexpired, or has s a Signed Certificate Timestamp made > while i was unexpired? If not, FAIL. > is i in the trust file? If so, SUCCEED. > s := i, continue. > > (I left out steps like verify signature, check revocation, etc.) > > What it seems to be doing, though, is just: > > I'm looking at s > Follow chain all the way to a self-signed cert > is that in the file? > > which seems too simplistic. > Do we actually do any of this sort of thing? I confess my impression was this is all handled by the openssl libraries, we just hand over the certs and let openssl do its thing. Am I misinformed about that? cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: repeat() function, CHECK_FOR_INTERRUPTS(), and unlikely()
On 2020-May-28, Joe Conway wrote: > I backpatched and pushed the changes to the repeat() function. Any other > opinions regarding backpatch of the unlikely() addition to > CHECK_FOR_INTERRUPTS()? We don't use unlikely() in 9.6 at all, so I would stop that backpatching at 10 anyhow. (We did backpatch unlikely()'s definition afterwards.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: repeat() function, CHECK_FOR_INTERRUPTS(), and unlikely()
On 5/28/20 1:23 PM, Joe Conway wrote: > On 5/27/20 3:29 AM, Michael Paquier wrote: >>> I think that each of those tests should have a separate unlikely() marker, >>> since the whole point here is that we don't expect either of those tests >>> to yield true in the huge majority of CHECK_FOR_INTERRUPTS executions. >> >> +1. I am not sure that the addition of unlikely() should be >> backpatched though, that's not something usually done. > > I backpatched and pushed the changes to the repeat() function. Any other > opinions regarding backpatch of the unlikely() addition to > CHECK_FOR_INTERRUPTS()? So far I have Tom +1 Michael -1 me +0 on backpatching the addition of unlikely() to CHECK_FOR_INTERRUPTS(). Assuming no one else chimes in I will push the attached to all supported branches sometime before Tom creates the REL_13_STABLE branch on Sunday. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 14fa127..18bc8a7 100644 *** a/src/include/miscadmin.h --- b/src/include/miscadmin.h *** extern void ProcessInterrupts(void); *** 98,113 #define CHECK_FOR_INTERRUPTS() \ do { \ ! if (InterruptPending) \ ProcessInterrupts(); \ } while(0) #else /* WIN32 */ #define CHECK_FOR_INTERRUPTS() \ do { \ ! if (UNBLOCKED_SIGNAL_QUEUE()) \ pgwin32_dispatch_queued_signals(); \ ! if (InterruptPending) \ ProcessInterrupts(); \ } while(0) #endif /* WIN32 */ --- 98,113 #define CHECK_FOR_INTERRUPTS() \ do { \ ! if (unlikely(InterruptPending)) \ ProcessInterrupts(); \ } while(0) #else /* WIN32 */ #define CHECK_FOR_INTERRUPTS() \ do { \ ! if (unlikely(UNBLOCKED_SIGNAL_QUEUE())) \ pgwin32_dispatch_queued_signals(); \ ! if (unlikely(InterruptPending)) \ ProcessInterrupts(); \ } while(0) #endif /* WIN32 */ signature.asc Description: OpenPGP digital signature
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On 2020-Jun-04, Michael Paquier wrote: > On Wed, Jun 03, 2020 at 06:33:11PM -0700, Andres Freund wrote: > >> I don't think having a physical replication connection access catalog > >> data directly is a great idea. We already have gadgets like > >> IDENTIFY_SYSTEM for physical replication that can do that, and if you > >> need particular settings you can use SHOW (commit d1ecd539477). If > >> there was a strong need for even more than that, we can add something to > >> the grammar. > > > > Those special case things are a bad idea, and we shouldn't introduce > > more. It's unrealistic that we can ever make that support everything, > > and since we already have to support the database connected thing, I > > don't see the point. > > Let's continue discussing this part as well. A logical replication connection cannot run SQL anyway, can it? it's limited to the replication grammar. So it's not like you can run arbitrary queries to access catalog data. So even if we do need to access the catalogs, we'd have to add stuff to the replication grammar in order to support that. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: v13: Performance regression related to FORTIFY_SOURCE
Speaking with my RMT hat on, I'm concerned that this item is not moving forward at all. ISTM we first and foremost need to decide whether this is a problem worth worrying about, or not. If it is something worth worrying about, let's discuss what's a good fix for it. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Atomic operations within spinlocks
Andres Freund writes: > On 2020-06-04 14:50:40 -0400, Tom Lane wrote: >> 2. The computed completePasses value would go backwards. I bet >> that wouldn't matter too much either, or at least we could teach >> BgBufferSync to cope. (I notice the comments therein suggest that >> it is already designed to cope with completePasses wrapping around, >> so maybe nothing needs to be done.) > If we're not concerned about that, then we can remove the > atomic-inside-spinlock, I think. The only reason for that right now is > to avoid assuming a wrong pass number. Hmm. That might be a less-invasive path to a solution. I can take a look, if you don't want to. regards, tom lane
Re: Atomic operations within spinlocks
Andres Freund writes: > I'd still like to know which problem we're actually trying to solve > here. I don't understand the "error" issues you mentioned upthread. If you error out of getting the inner spinlock, the outer spinlock is stuck, permanently, because there is no mechanism for spinlock release during transaction abort. Admittedly it's not very likely for the inner acquisition to fail, but it's possible. Aside from timeout scenarios (e.g., process holding lock gets swapped out to Timbuktu), it could be that both spinlocks are mapped onto a single implementation lock by spin.c, which notes * We map all spinlocks onto a set of NUM_SPINLOCK_SEMAPHORES semaphores. * It's okay to map multiple spinlocks onto one semaphore because no process * should ever hold more than one at a time. You've falsified that argument ... and no, I don't want to upgrade the spinlock infrastructure enough to make this OK. We shouldn't ever be holding spinlocks long enough, or doing anything complicated enough inside them, for such an upgrade to have merit. regards, tom lane
Re: Atomic operations within spinlocks
Hi, On 2020-06-04 14:50:40 -0400, Tom Lane wrote: > Actually ... we could probably use this design with a uint32 counter > as well, on machines where the 64-bit operations would be slow. On skylake-x even a 32bit [i]div is still 26 cycles. That's more than an atomic operation 18 cycles. > 2. The computed completePasses value would go backwards. I bet > that wouldn't matter too much either, or at least we could teach > BgBufferSync to cope. (I notice the comments therein suggest that > it is already designed to cope with completePasses wrapping around, > so maybe nothing needs to be done.) If we're not concerned about that, then we can remove the atomic-inside-spinlock, I think. The only reason for that right now is to avoid assuming a wrong pass number. I don't think completePasses wrapping around is comparable in frequency to wrapping around nextVictimBuffer. It's not really worth worrying about bgwriter wrongly assuming it lapped the clock sweep once ever UINT32_MAX * NBuffers ticks, but there being a risk every NBuffers seems worth worrying about. Greetings, Andres Freund
Re: Atomic operations within spinlocks
Hi, On 2020-06-04 13:57:19 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2020-06-03 14:19:45 -0400, Tom Lane wrote: > >> This seems to me to be very bad code. > > > I think straight out prohibiting use of atomics inside a spinlock will > > lead to more complicated and slower code, rather than than improving > > anything. I'm to blame for the ClockSweepTick() case, and I don't really > > see how we could avoid the atomic-while-spinlocked without relying on > > 64bit atomics, which are emulated on more platforms, and without having > > a more complicated retry loop. > > Well, if you don't want to touch freelist.c then there is no point > worrying about what pg_stat_get_wal_receiver is doing. But having > now studied that freelist.c code, I don't like it one bit. It's > overly complicated and not very accurately commented, making it > *really* hard to convince oneself that it's not buggy. > > I think a good case could be made for ripping out what's there now > and just redefining nextVictimBuffer as a pg_atomic_uint64 that we > never reset (ie, make its comment actually true). Then ClockSweepTick > reduces to Note that we can't do that in the older back branches, there wasn't any 64bit atomics fallback before commit e8fdbd58fe564a29977f4331cd26f9697d76fc40 Author: Andres Freund Date: 2017-04-07 14:44:47 -0700 Improve 64bit atomics support. > pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1) % NBuffers > > and when we want to know how many cycles have been completed, we just > divide the counter by NBuffers. Now admittedly, this is relying on both > pg_atomic_fetch_add_u64 being fast and int64 division being fast ... but > to throw your own argument back at you, do we really care anymore about > performance on machines where those things aren't true? Furthermore, > since all this is happening in a code path that's going to lead to I/O, > I'm not exactly convinced that a few nanoseconds matter anyway. It's very easy to observe this code being a bottleneck. If we only performed a single clock tick before IO, sure, then the cost would obviously be swamped by the IO cost. But it's pretty common to end up having to do that ~ NBuffers * 5 times for a single buffer. I don't think it's realistic to rely on 64bit integer division being fast in this path. The latency is pretty darn significant (64bit div is 35-88 cycles on skylake-x, 64bit idiv 42-95). And unless I misunderstand, you'd have to do so (for % NBuffers) every single clock tick, not just when we wrap around. We could however avoid the spinlock if we were to use 64bit atomics, by storing the number of passes in the upper 32bit, and the next victim buffer in the lower. But that doesn't seem that simple either, and will regress performance on 32bit platforms. I don't the whole strategy logic at all, so I guess there's some argument to improve things from that end. It's probably possible to avoid the lock with 32bit atomics as well. I'd still like to know which problem we're actually trying to solve here. I don't understand the "error" issues you mentioned upthread. Greetings, Andres Freund
Re: Atomic operations within spinlocks
I wrote: > I think a good case could be made for ripping out what's there now > and just redefining nextVictimBuffer as a pg_atomic_uint64 that we > never reset (ie, make its comment actually true). Then ClockSweepTick > reduces to > pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1) % NBuffers > and when we want to know how many cycles have been completed, we just > divide the counter by NBuffers. Actually ... we could probably use this design with a uint32 counter as well, on machines where the 64-bit operations would be slow. In that case, integer overflow of nextVictimBuffer would happen from time to time, resulting in 1. The next actual victim buffer index would jump strangely. This doesn't seem like it'd matter at all, as long as it was infrequent. 2. The computed completePasses value would go backwards. I bet that wouldn't matter too much either, or at least we could teach BgBufferSync to cope. (I notice the comments therein suggest that it is already designed to cope with completePasses wrapping around, so maybe nothing needs to be done.) If NBuffers was large enough to be a significant fraction of UINT_MAX, then these corner cases would happen often enough to perhaps be problematic. But I seriously doubt that'd be possible on hardware that wasn't capable of using the 64-bit code path. regards, tom lane
REL_13_STABLE Branch
Hi, After conferring, the PostgreSQL 13 RMT[1] has decided that it is time to create the REL_13_STABLE branch. Tom has volunteered to create the branch this Sunday (2020-06-07). Please let us know if you have any questions. Thanks, Alvaro, Peter, Jonathan [1] https://wiki.postgresql.org/wiki/Release_Management_Team signature.asc Description: OpenPGP digital signature
Re: significant slowdown of HashAggregate between 9.6 and 10
Hi, On 2020-06-03 13:26:43 -0700, Andres Freund wrote: > On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote: > > So there seems to be +40% between 9.6 and 10, and further +25% between > > 10 and master. However, plain hashagg, measured e.g. like this: As far as I can tell the 10->master difference comes largely from the difference of the number of buckets in the hashtable. In 10 it is: Breakpoint 1, tuplehash_create (ctx=0x5628251775c8, nelements=75, private_data=0x5628251952f0) and in master it is: Breakpoint 1, tuplehash_create (ctx=0x5628293a0a70, nelements=256, private_data=0x5628293a0b90) As far as I can tell the timing difference simply is the cost of iterating 500k times over a hashtable with fairly few entries. Which is, unsurprisingly, more expensive if the hashtable is larger. The reason the hashtable got bigger in 12 is commit 1f39bce021540fde00990af55b4432c55ef4b3c7 Author: Jeff Davis Date: 2020-03-18 15:42:02 -0700 Disk-based Hash Aggregation. which introduced +/* minimum number of initial hash table buckets */ +#define HASHAGG_MIN_BUCKETS 256 I don't really see much explanation for that part in the commit, perhaps Jeff can chime in? I think optimizing for the gazillion hash table scans isn't particularly important. Rarely is a query going to have 500k scans of unchanging aggregated data. So I'm not too concerned about the 13 regression - but I also see very little reason to just always use 256 buckets? It's pretty darn common to end up with 1-2 groups, what's the point of this? I'll look into 9.6->10 after buying groceries... But I'd wish there were a relevant benchmark, I don't think it's worth optimizing for this case. Greetings, Andres Freund
Re: Regarding TZ conversion
Thanks for the clarification. Is it advisable to modify the Default? Will it override when we apply a patch or upgrade the DB? What about creating a new file like below and update the postgres.conf with the new name. # New tz offset @INCLUDE Default @OVERRDIE IST 19800 *Regards,* *Rajin * On Thu, Jun 4, 2020 at 7:23 PM Tom Lane wrote: > Rajin Raj writes: > > Option 1: AT TIME ZONE 'IST' > > Option 2: AT TIME ZONE 'Asia/Kolkata' > > In the first option, I get +2:00:00 offset (when *timezone_abbrevations = > > 'Default'*) and for option 2 , +5:30 offset. > > > I can see multiple entries for IST in pg_timezone_names with > > different utc_offset, but in pg_timezone_abbrev there is one entry. I > guess > > AT TIME ZONE function using the offset shown in pg_timezone_abbrev. > > No. If you use an abbreviation rather than a spelled-out zone name, > you get whatever the timezone_abbrevations file says, which by default > is > > $ grep IST .../postgresql/share/timezonesets/Default > # CONFLICT! IST is not unique > # - IST: Irish Standard Time (Europe) > # - IST: Indian Standard Time (Asia) > IST 7200# Israel Standard Time > > If that's not what you want, change it. See > > https://www.postgresql.org/docs/current/datetime-config-files.html > > and also > > > https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES > > regards, tom lane >
Re: Removal of currtid()/currtid2() and some table AM cleanup
Hi, On 2020-06-03 11:14:48 +0900, Michael Paquier wrote: > I would like to remove those two functions and the surrounding code > for v14, leading to some cleanup: > 6 files changed, 326 deletions(-) +1 > While on it, I have noticed that heap_get_latest_tid() is still > located within heapam.c, but we can just move it within > heapam_handler.c. What's the point of that change? I think the differentiation between heapam_handler.c and heapam.c could be clearer, but if anything, I'd argue that heap_get_latest_tid is sufficiently low-level that it'd belong in heapam.c. Greetings, Andres Freund
Re: Atomic operations within spinlocks
Andres Freund writes: > On 2020-06-03 14:19:45 -0400, Tom Lane wrote: >> This seems to me to be very bad code. > I think straight out prohibiting use of atomics inside a spinlock will > lead to more complicated and slower code, rather than than improving > anything. I'm to blame for the ClockSweepTick() case, and I don't really > see how we could avoid the atomic-while-spinlocked without relying on > 64bit atomics, which are emulated on more platforms, and without having > a more complicated retry loop. Well, if you don't want to touch freelist.c then there is no point worrying about what pg_stat_get_wal_receiver is doing. But having now studied that freelist.c code, I don't like it one bit. It's overly complicated and not very accurately commented, making it *really* hard to convince oneself that it's not buggy. I think a good case could be made for ripping out what's there now and just redefining nextVictimBuffer as a pg_atomic_uint64 that we never reset (ie, make its comment actually true). Then ClockSweepTick reduces to pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1) % NBuffers and when we want to know how many cycles have been completed, we just divide the counter by NBuffers. Now admittedly, this is relying on both pg_atomic_fetch_add_u64 being fast and int64 division being fast ... but to throw your own argument back at you, do we really care anymore about performance on machines where those things aren't true? Furthermore, since all this is happening in a code path that's going to lead to I/O, I'm not exactly convinced that a few nanoseconds matter anyway. regards, tom lane
Re: should INSERT SELECT use a BulkInsertState?
Hi, On 2020-05-08 02:25:45 -0500, Justin Pryzby wrote: > Seems to me it should, at least conditionally. At least if there's a function > scan or a relation or .. Well, the problem is that this can cause very very significant regressions. As in 10x slower or more. The ringbuffer can cause constant XLogFlush() calls (due to the lsn interlock), and the eviction from shared_buffers (regardless of actual available) will mean future vacuums etc will be much slower. I think this is likely to cause pretty widespread regressions on upgrades. Now, it sucks that we have this problem in the general facility that's supposed to be used for this kind of bulk operation. But I don't really see it realistic as expanding use of bulk insert strategies unless we have some more fundamental fixes. Regards, Andres
Re: Expand the use of check_canonical_path() for more GUCs
Peter Eisentraut writes: > This (and some other messages in this thread) appears to assume that > canonicalize_path() turns relative paths into absolute paths, but > AFAICT, it does not do that. Ah, fair point --- I'd been assuming that we were applying canonicalize_path as cleanup for an absolute-ification operation, but you are right that check_canonical_path does not do that. Digging around, though, I notice a different motivation. In assign_pgstat_temp_directory we have /* check_canonical_path already canonicalized newval for us */ ... tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */ sprintf(tname, "%s/global.tmp", newval); fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */ sprintf(fname, "%s/global.stat", newval); and I believe what the comment is on about is that these path derivation operations are unreliable if newval isn't in canonical form. I seem to remember for example that in some Windows configurations, mixing slashes and backslashes doesn't work. So the real point here is that we could use the user's string unmodified as long as we only use it exactly as-is, but cases where we derive other pathnames from it require more work. Of course, we could leave the GUC string alone and only canonicalize while forming derived paths, but that seems mighty error-prone. In any case, just ripping out the check_canonical_path usages without any other mop-up will break things. regards, tom lane
Re: Expand the use of check_canonical_path() for more GUCs
On 2020-06-03 20:45, Tom Lane wrote: Robert Haas writes: On Tue, Jun 2, 2020 at 5:04 AM Peter Eisentraut wrote: The archeology reveals that these calls where originally added to canonicalize the data_directory and config_file settings (7b0f060d54), but that was then moved out of guc.c to be done early during postmaster startup (337ffcddba). The remaining calls of check_canonical_path() in guc.c appear to be leftovers from a previous regime. Thanks for looking into it. Sounds like it can just be ripped out, then, unless someone knows of a reason to do otherwise. In the abstract, I agree with Peter's point that we shouldn't alter user-given strings without need. However, I think there's strong reason for canonicalizing the data directory and config file locations. It is not proposed to change that. It is only debated whether the same canonicalization should be applied to other GUCs that represent paths. We access those both before and after chdir'ing into the datadir, so we'd better have absolute paths to them --- and at least for the datadir, it's documented that you can initially give it as a path relative to wherever you started the postmaster from. If the other files are only accessed after the chdir happens then we could likely do without canonicalizing them. But ... do we know which directory the user (thought he) specified them with reference to? Forced canonicalization does have the advantage that it's clear to all onlookers how we are interpreting the paths. This (and some other messages in this thread) appears to assume that canonicalize_path() turns relative paths into absolute paths, but AFAICT, it does not do that. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: libpq copy error handling busted
I wrote: > * As for control-C not getting out of it: there is > if (CancelRequested) > break; > in pgbench's loop, but this does nothing in this scenario because > fe-utils/cancel.c only sets that flag when it successfully sends a > Cancel ... which it certainly cannot if the postmaster is gone. > I suspect that this may be relatively recent breakage. It doesn't look > too well thought out, in any case. The places that are testing this > flag look like they'd rather not be bothered with the fine point of > whether the cancel request actually went anywhere. On closer inspection, it seems that scripts_parallel.c does have a dependency on the cancel request having been sent, because it insists on collecting a query result off the active connection after detecting CancelRequested. This seems dangerously overoptimistic to me; it will lock up if for any reason the server doesn't honor the cancel request. It's also pointless, because all the calling apps are just going to close their connections and exit(1) afterwards, so there's no use in trying to resynchronize the connection state. (Plus, disconnectDatabase will issue cancels on any busy connections; which would be necessary anyway in a parallel operation, since cancel.c could only have signaled one of them.) So the attached patch just removes the useless consumeQueryResult call, and then simplifies select_loop's API a bit. With that change, I don't see any place that wants the existing definition of CancelRequested rather than the simpler meaning of "SIGINT was received", so I just changed it to mean that. We could certainly also have a variable tracking whether a cancel request was sent, but I see no point in one right now. It's easiest to test this *without* the other patch -- just run the pgbench scenario Andres demonstrated, and see whether control-C gets pgbench to quit cleanly. regards, tom lane diff --git a/src/bin/scripts/scripts_parallel.c b/src/bin/scripts/scripts_parallel.c index 45c69b8d19..c3384d452a 100644 --- a/src/bin/scripts/scripts_parallel.c +++ b/src/bin/scripts/scripts_parallel.c @@ -28,7 +28,7 @@ #include "scripts_parallel.h" static void init_slot(ParallelSlot *slot, PGconn *conn); -static int select_loop(int maxFd, fd_set *workerset, bool *aborting); +static int select_loop(int maxFd, fd_set *workerset); static void init_slot(ParallelSlot *slot, PGconn *conn) @@ -41,23 +41,16 @@ init_slot(ParallelSlot *slot, PGconn *conn) /* * Loop on select() until a descriptor from the given set becomes readable. * - * If we get a cancel request while we're waiting, we forego all further - * processing and set the *aborting flag to true. The return value must be - * ignored in this case. Otherwise, *aborting is set to false. + * Returns -1 on failure (including getting a cancel request). */ static int -select_loop(int maxFd, fd_set *workerset, bool *aborting) +select_loop(int maxFd, fd_set *workerset) { int i; fd_set saveSet = *workerset; if (CancelRequested) - { - *aborting = true; return -1; - } - else - *aborting = false; for (;;) { @@ -90,7 +83,7 @@ select_loop(int maxFd, fd_set *workerset, bool *aborting) if (i < 0 && errno == EINTR) continue; /* ignore this */ if (i < 0 || CancelRequested) - *aborting = true; /* but not this */ + return -1; /* but not this */ if (i == 0) continue; /* timeout (Win32 only) */ break; @@ -135,7 +128,6 @@ ParallelSlotsGetIdle(ParallelSlot *slots, int numslots) { fd_set slotset; int maxFd = 0; - bool aborting; /* We must reconstruct the fd_set for each call to select_loop */ FD_ZERO(&slotset); @@ -157,19 +149,12 @@ ParallelSlotsGetIdle(ParallelSlot *slots, int numslots) } SetCancelConn(slots->connection); - i = select_loop(maxFd, &slotset, &aborting); + i = select_loop(maxFd, &slotset); ResetCancelConn(); - if (aborting) - { - /* - * We set the cancel-receiving connection to the one in the zeroth - * slot above, so fetch the error from there. - */ - consumeQueryResult(slots->connection); + /* failure? */ + if (i < 0) return NULL; - } - Assert(i != 0); for (i = 0; i < numslots; i++) { diff --git a/src/fe_utils/cancel.c b/src/fe_utils/cancel.c index eb4056a9a6..51fb67d384 100644 --- a/src/fe_utils/cancel.c +++ b/src/fe_utils/cancel.c @@ -43,11 +43,11 @@ static PGcancel *volatile cancelConn = NULL; /* - * CancelRequested tracks if a cancellation request has completed after - * a signal interruption. Note that if cancelConn is not set, in short - * if SetCancelConn() was never called or if ResetCancelConn() freed - * the cancellation object, then CancelRequested is switched to true after - * all cancellation attempts. + * CancelRequested is set when we receive SIGINT (or local equivalent). + * There is no provision in this module for resetting it; but applications + * might choose to clear it after successfully
Re: Just for fun: Postgres 20?
On Tue, Jun 2, 2020 at 2:45 PM Robert Haas wrote: > On Mon, Jun 1, 2020 at 3:20 PM Tom Lane wrote: > > Robert Haas writes: > > > As has already been pointed out, it could definitely happen, but we > > > could solve that by just using a longer version number, say, including > > > the month and, in case we ever do multiple major releases in the same > > > month, also the day. In fact, we might as well take it one step > > > further and use the same format for the release number that we use for > > > CATALOG_VERSION_NO: MMDDN. So this fall, piggybacking on the > > > success of PostgreSQL 10, 11, and 12, we could look then release > > > PostgreSQL 202009241 or so. > > > > But then where do you put the minor number for maintenance releases? > > Oh, well that's easy. The first maintenance release would just be > 202009241.1. > > Unless, of course, we want to simplify things by using the same format > for both parts of the version number. Then, supposing the first > maintenance release follows the major release by a month or so, it > would be PostgreSQL 202009241.202010291 or something of this sort. > Since there is a proposal to have a 64-bit transaction ID, we could rather have a 64-bit random number which could solve all of these problems. :P And then if I ask my customer what Postgres version is he/she using, it could be a postgres fun ride. > > It's hard to agree on anything around here but I suspect we can come > to near-unanimous agreement on the topic of how much merit this > proposal has. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- Regards, Avinash Vallarapu
Re: what can go in root.crt ?
On 06/04/20 11:04, Laurenz Albe wrote: > I was referring to the wish to *not* use a self-signed CA certificate, > but an intermediate certificate as the ultimate authority, based on > a distrust of the certification authority that your organization says > you should trust. Are you aware of any principled reason it should be impossible to include an end-entity certificate in the trust store used by a client? Are you aware of any principled reason it should be impossible to include a certificate that has the CA:TRUE and Certificate Sign bits in the trust store used by a client, whether it is its own signer or has been signed by another CA? Regards, -Chap
Re: Possible bug on Postgres 12 (CASE THEN evaluated prematurely) - Change of behaviour compared to 11, 10, 9
Thanks Tom! I was just hopping somebody could point out if this kind of issue has been reported before spending 2 days fabricating a simpler self contained example. Best, Juan > On 4 Jun 2020, at 16:26, Tom Lane wrote: > > Juan Fuentes writes: >> As you could see the query includes castings, we noticed testing with >> Postgres 12 that the castings of the CASE THEN statement (commented out >> below) where failing in some cases, of course if you do the INNER JOIN and >> CASE WHEN first our expectation is that the value can be casted. > > You're unlikely to get any useful comments on this if you don't provide > a self-contained example. The query by itself lacks too many details. > As an example, one way "t7.value::numeric = 1" could fail despite being > inside a CASE is if t7 is a view whose "value" column is actually a > constant. Flattening of the view would replace "t7.value" with that > constant, and then constant-folding would cause the failure, and neither > of those things are prevented by a CASE. I kind of doubt that that's > the specific issue here, but I'm not going to guess at what is in your > thirty-some input tables. > > regards, tom lane
Re: what can go in root.crt ?
On Thu, 2020-06-04 at 08:25 -0400, Chapman Flack wrote: > > I feel bad about bending the basic idea of certificates and trust to suit > > some misbegotten bureaucratic constraints on good security. > > Can you elaborate on what, in the email message you replied to here, > represented a bending of the basic idea of certificates and trust? > > I didn't notice any. I was referring to the wish to *not* use a self-signed CA certificate, but an intermediate certificate as the ultimate authority, based on a distrust of the certification authority that your organization says you should trust. Yours, Laurenz Albe
Re: Wrong width of UNION statement
Hello, Thank you for your quick response and sorry for my late reply. > (I suppose you're using UTF8 encoding...) It is right. As you said, my encoding of database is UTF8. >There's room for improvement there, but this is all bound up in the legacy >mess that we have in prepunion.c. At first,I think it is easy to fix it. Because I think that it is easy to fix by calculating in the same way as UNION ALL. But ,now,I understand it is not so easy. I'll report if I find some strong reason enough to throw away and rewrite prepunion.c. Thank you. Regards Kenichiro Tanaka. 2020年6月2日(火) 0:04 Tom Lane : > > Kenichiro Tanaka writes: > > I think table column width of UNION statement should be equal one of UNION > > ALL. > > I don't buy that argument, because there could be type coercions involved, > so that the result width isn't necessarily equal to any one of the inputs. > > Having said that, the example you give shows that we make use of > pg_statistic.stawidth values when estimating the width of immediate > relation outputs, but that data isn't available by the time we get to > a UNION output. So we fall back to get_typavgwidth, which in this > case is going to produce something involving the typmod times the > maximum encoding length. (I suppose you're using UTF8 encoding...) > > There's room for improvement there, but this is all bound up in the legacy > mess that we have in prepunion.c. For example, because we don't have > RelOptInfo nodes associated with individual set-operation outputs, it's > difficult to figure out where we might store data about the widths of such > outputs. Nor could we easily access the data if we had it, since the > associated Vars don't have valid RTE indexes. So to my mind, that code > needs to be thrown away and rewritten, using actual relations to represent > the different setop results and Paths to represent possible computations. > In the meantime, it's hard to get excited about layering some additional > hacks on top of what's there now. > > regards, tom lane
Re: Possible bug on Postgres 12 (CASE THEN evaluated prematurely) - Change of behaviour compared to 11, 10, 9
Juan Fuentes writes: > As you could see the query includes castings, we noticed testing with > Postgres 12 that the castings of the CASE THEN statement (commented out > below) where failing in some cases, of course if you do the INNER JOIN and > CASE WHEN first our expectation is that the value can be casted. You're unlikely to get any useful comments on this if you don't provide a self-contained example. The query by itself lacks too many details. As an example, one way "t7.value::numeric = 1" could fail despite being inside a CASE is if t7 is a view whose "value" column is actually a constant. Flattening of the view would replace "t7.value" with that constant, and then constant-folding would cause the failure, and neither of those things are prevented by a CASE. I kind of doubt that that's the specific issue here, but I'm not going to guess at what is in your thirty-some input tables. regards, tom lane
Re: Regarding TZ conversion
Rajin Raj writes: > Option 1: AT TIME ZONE 'IST' > Option 2: AT TIME ZONE 'Asia/Kolkata' > In the first option, I get +2:00:00 offset (when *timezone_abbrevations = > 'Default'*) and for option 2 , +5:30 offset. > I can see multiple entries for IST in pg_timezone_names with > different utc_offset, but in pg_timezone_abbrev there is one entry. I guess > AT TIME ZONE function using the offset shown in pg_timezone_abbrev. No. If you use an abbreviation rather than a spelled-out zone name, you get whatever the timezone_abbrevations file says, which by default is $ grep IST .../postgresql/share/timezonesets/Default # CONFLICT! IST is not unique # - IST: Irish Standard Time (Europe) # - IST: Indian Standard Time (Asia) IST 7200# Israel Standard Time If that's not what you want, change it. See https://www.postgresql.org/docs/current/datetime-config-files.html and also https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES regards, tom lane
Re: question regarding copyData containers
Hello, thank you for your feedback. I agree that modifying the COPY subprotocols is hard to do because it would have an impact on the client ecosystem. My understanding (which seems to be confirmed by what Tom Lane said) is that the server discards the framing and manages to make sense of the underlying data. > the expectation is that clients can send CopyData messages that are > split up however they choose; the message boundaries needn't correspond > to any semantic boundaries in the data stream. So I thought that a client could decide to have the same behavior and could start parsing the payload of a copyData message without assembling it first. It works perfectly with COPY TO but I hit a roadblock on copyBoth during logical replication with test_decoding because the subprotocol doesn't have any framing. > Right now all 'w' messages should be contained in one CopyData/'d' that > doesn't contain anything but the XLogData/'w'. The current format of the XLogData/'w' message is w lsn lsn time byten and even if it is maybe too late now I was wondering why it was not decided to be w lsn lsn time n byten because it seems to me that the missing n ties the XLogData to the copyData framing. >The input data exists in a linear >buffer already, so you're not going to reduce peak memory usage by >sending smaller CopyData chunks. That is very surprising to me. Do you mean that on the server in COPY TO mode, a full row is prepared in a linear buffer in memory before beeing sent as a copyData/d' I found the code around https://github.com/postgres/postgres/blob/master/src/backend/commands/copy.c#L2153 and indeed the whole row seems to be buffered in memory. Good thing or bad thing, users tend to use bigger fields (text, jsonb, bytea) and that can be very memory hungry. Do you know a case in postgres (other than large_objects I suppose) where the server can flush data from a field without buffering it in memory ? And then as you noted, there is the multiplexing of events. a very long copyData makes the communication impossible between the client and the server during the transfer. I briefly looked at https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c and I found /* * Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ. * * We don't have a good idea of what a good value would be; there's some * overhead per message in both walsender and walreceiver, but on the other * hand sending large batches makes walsender less responsive to signals * because signals are checked only between messages. 128kB (with * default 8k blocks) seems like a reasonable guess for now. */ #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16) so I thought that the maximum copyData/d' I would receive during logical replication was MAX_SEND_SIZE but it seems that this is not used for logical decoding. the whole output of the output plugin seem to be prepared in memory so for an insert like insert into mytable (col) values (repeat('-', pow(2, 27)::int) a 128MB linear buffer will be created on the server and sent as 1 copyData over many network chunks. So I understand that in the long term copyData framing should not carry any semantic to be able to keep messages small enough to allow multiplexing but that there are many steps to climb before that. Would it make sense one day in some way to try and do streaming at the sub-field level ? I guess that is a huge undertaking since most of the field unit interfaces are probably based on a buffer/field one-to-one mapping. Greetings, Jérôme On Thu, Jun 4, 2020 at 12:08 AM Andres Freund wrote: > Hi, > > On 2020-06-03 19:28:12 +0200, Jerome Wagner wrote: > > I have been working on a node.js streaming client for different COPY > > scenarios. > > usually, during CopyOut, clients tend to buffer network chunks until they > > have gathered a full copyData message and pass that to the user. > > > > In some cases, this can lead to very large copyData messages. when there > > are very long text fields or bytea fields it will require a lot of memory > > to be handled (up to 1GB I think in the worst case scenario) > > > > In COPY TO, I managed to relax that requirement, considering that > copyData > > is simply a transparent container. For each network chunk, the relevent > > message content is forwarded which makes for 64KB chunks at most. > > Uhm. > > > > We loose the semantics of the "row" that copyData has according to the > > documentation > > https://www.postgresql.org/docs/10/protocol-flow.html#PROTOCOL-COPY > > >The backend sends a CopyOutResponse message to the frontend, followed by > > zero or more >CopyData messages (**always one per row**), followed by > > CopyDone > > > > but it is not a problem because the raw bytes are still parsable (rows + > > fields) in text mode (tsv) and in binary mode) > > This seems like an extremely bad idea to me. Are we really going to ask > clients to incur the overhead (both in complexity and runtime) to parse > incoming
[PATCH] pg_dump: Add example and link for --encoding option
To let users know what kind of character set can be used add examples and a link to --encoding option. Thanks, Dong wook 0001-pg_dump-Add-example-and-link-for-encoding-option.patch Description: Binary data
Re: what can go in root.crt ?
On 06/04/20 02:07, Laurenz Albe wrote: > I feel bad about bending the basic idea of certificates and trust to suit > some misbegotten bureaucratic constraints on good security. Can you elaborate on what, in the email message you replied to here, represented a bending of the basic idea of certificates and trust? I didn't notice any. Regards, -Chap
Re: Read access for pg_monitor to pg_replication_origin_status view
Hi Kyotaro-san, > Sorry for not mentioning it at that time, but about the following diff: > > +GRANT SELECT ON pg_replication_origin_status TO pg_read_all_stats; > > system_views.sql already has a REVOKE command on the view. We should > put the above just below the REVOKE command. > > I'm not sure where to put the GRANT on > pg_show_replication_origin_status(), but maybe it also should be at > the same place. Yes, I agree that it makes the revoking/granting easier to read if it's grouped by objects, or groups of objects. Done. > In the previous comment, one point I meant is that the "to the > superuser" should be "to superusers", because a PostgreSQL server > (cluster) can define multiple superusers. Another is that "permitted > to other users by using the GRANT command." might be obscure for > users. In this regard I found a more specific description in the same > file: OK, now I understand what you were saying. :-) > Computes the total disk space used by the database with the specified > name or OID. To use this function, you must > have CONNECT privilege on the specified database > (which is granted by default) or be a member of > the pg_read_all_stats role. > > So, as the result it would be like the following: (Note that, as you > know, I'm not good at this kind of task..) > > Use of functions for replication origin is restricted to superusers. > Use for these functions may be permitted to other users by granting > EXECUTE privilege on the functions. > > And in regard to the view, granting privileges on both the view and > function to individual user is not practical so we should mention only > granting pg_read_all_stats to users, like the attached patch. I did some re-writing of the doc, which is pretty close to what you proposed above. > By the way, the attachements of your mail are out-of-order. I'm not > sure that that does something bad, though. That's likely Gmail giving them random order when you attach multiple files all at once. New patches attached. Regards, -- Martín Marquéshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From 87c00782c691f2c6c13768cd6d96b19de69cab16 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mart=C3=ADn=20Marqu=C3=A9s?= Date: Tue, 2 Jun 2020 10:44:42 -0300 Subject: [PATCH v4 1/4] Access to pg_replication_origin_status view has restricted access only for superusers due to it using pg_show_replication_origin_status(), and all replication origin functions requiring superuser rights. This patch removes the check for superuser priviledges when executing replication origin functions, which is hardcoded, and instead rely on ACL checks. This patch will also revoke execution of such functions from PUBLIC --- src/backend/catalog/system_views.sql | 16 src/backend/replication/logical/origin.c | 5 - 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 56420bbc9d6..6658f0c2eb2 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1469,6 +1469,22 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public; REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public; REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public; +-- +-- Permision to execute Replication Origin functions should be revoked from public +-- +REVOKE EXECUTE ON FUNCTION pg_replication_origin_advance(text, pg_lsn) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_create(text) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_drop(text) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_oid(text) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_progress(text, boolean) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_is_setup() FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_progress(boolean) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_reset() FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_setup(text) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_xact_reset() FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_xact_setup(pg_lsn, timestamp with time zone) FROM public; +REVOKE EXECUTE ON FUNCTION pg_show_replication_origin_status() FROM public; + -- -- We also set up some things as accessible to standard roles. -- diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c index 981d60f135d..1f223daf21f 100644 --- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -182,11 +182,6 @@ static ReplicationState *session_replication_state = NULL; static void replorigin_check_prerequisites(bool check_slots, bool recoveryOK) { - if (!superuser()) - ereport(ERROR, -(errcode(ERRCODE_INSUFFICIENT_PRIVILE
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, 29 May 2020 at 15:52, Amit Kapila wrote: > > On Wed, May 27, 2020 at 5:19 PM Mahendra Singh Thalor wrote: >> >> On Tue, 26 May 2020 at 16:46, Amit Kapila wrote: >> >> Hi all, >> On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary; >> >> Test parameters: >> wal_level= 'logical >> max_connections = '150' >> wal_receiver_timeout = '600s' >> max_wal_size = '2GB' >> min_wal_size = '2GB' >> autovacuum= 'off' >> checkpoint_timeout= '1d' >> >> Test results: >> >> CREATE index operationsAdd col int(date) operationsAdd col text operations >> SN.operation nameLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN change >> 1 >> 1 DDL without patch177280.89116 >> 1.624548 >> 9760.764393 >> 11.475409 >> 339040.80044 >> 2.80792 >> with patch180160.80486810880.763602348560.787108 >> 2 >> 2 DDL without patch198720.860348 >> 2.73752 >> 16320.763199 >> 13.7254902 >> 345600.806086 >> 3.078703 >> with patch204160.83906518560.733147356240.829281 >> 3 >> 3 DDL without patch220160.894891 >> 3.63372093 >> 2 2880.776871 >> 14.685314 >> 352160.803493 >> 3.339391186 >> with patch228160.82802826240.737177363920.800194 >> 4 >> 4 DDL without patch241600.901686 >> 4.4701986 >> 29440.768445 >> 15.217391 >> 358720.77489 >> 3.590544 >> with patch252400.88714333920.768382371600.82777 >> 5 >> 5 DDL without patch263280.901686 >> 4.9832877 >> 36000.751879 >> 15.55 >> 365280.817928 >> 3.832676 >> with patch276400.91407841600.74709379280.820621 >> 6 >> 6 DDL without patch284720.936385 >> 5.5071649 >> 42560.745179 >> 15.78947368 >> 371840.797043 >> 4.066265 >> with patch300400.95822649280.725321386960.814535 >> 7 >> 8 DDL without patch327601.0022203 >> 6.422466 >> 55680.757468 >> 16.091954 >> 384960.83207 >> 4.509559 >> with patch348640.96677764640.769072402320.903604 >> 8 >> 11 DDL without patch502961.0022203 >> 5.662478 >> 75360.748332 >> 16.66 >> 404640.822266 >> 5.179913 >> with patch531440.96677787920.750553425600.797133 >> 9 >> 15 DDL without patch588961.267253 >> 5.662478 >> 101840.776875 >> 16.496465 >> 431120.821916 >> 5.84524 >> with patch627681.27234118640.746844456320.812567 >> 10 >> 1 DDL & 3 DML without patch182400.812551 >> 1.6228 >> 11920.771993 >> 10.067114 >> 341200.849467 >> 2.8113599 >> with patch185360.81908913120.785117350800.855456 >> 11 >> 3 DDL & 5 DML without patch236560.926616 >> 3.4832606 >> 26560.758029 >> 13.55421687 >> 355840.829377 >> 3.372302 >> with patch244800.91551730160.797206367840.839176 >> 12 >> 10 DDL & 5 DML without patch527601.101005 >> 4.958301744 >> 72880.763065 >> 16.02634468 >> 402160.837843 >> 4.993037 >> with patch553761.10524184560.779257422240.835206 >> 13 >> 10 DML without patch10080.791091 >> 6.349206 >> 10080.81105 >> 6.349206 >> 10080.78817 >> 6.349206 >> with patch10720.80787510720.77111310720.759789 >> >> To see all operations, please see[2] test_results >> > > Why are you seeing any additional WAL in case-13 (10 DML) where there is no DDL? I think it is because you have used savepoints in that case which will add some additional WAL. You seems to have 9 savepoints in that test which should ideally generate 36 bytes of additional WAL (4-byte per transaction id for each subtransaction). Also, in other cases where you took data for DDL and DML, you have also used savepoints in those tests. I suggest for savepoints, let's do separate tests as you have done in case-13 but we can do it 3,5,7,10 savepoints and probably each transaction can update a row of 200 bytes or so. > Thanks Amit for reviewing results. Yes, you are correct. I used savepoints in DML so it was showing additional wal. As suggested above, I did testing for DML's, DDL's and savepoints. Below is the test results: *Test results:* CREATE index operations Add col int(date) operations Add col text operations SN. operation name LSN diff (in bytes) time (in sec) % LSN change LSN diff (in bytes) time (in sec) % LSN change LSN diff (in bytes) time (in sec) % LSN change 1 1 DDL without patch <#gid=0&range=B2> 17728 0.89116 1.624548 976 0.764393 11.475409 33904 0.80044 2.80792 with patch 18016 0.804868 1088 0.763602 34856 0.787108 2 2 DDL without patch <#gid=0&range=B3> 19872 0.860348 2.73752 1632 0.763199 13.7254902 34560 0.806086 3.078703 with patch 20416 0.839065 1856 0.733147 35624 0.829281 3 3 DDL without patch <#gid=0&range=B4> 22016 0.894891 3.63372093 2288 0.776871 14.685314 35216 0.803493 3.339391186 with patch 22816 0.828028 2624 0.737177 36392 0.800194 4 4 DDL without patch <#gid=0&range=B5> 24160 0.901686 4.4701986 2944 0.768445 15.217391 35872 0.77489 3.590544 with patch 25240 0.887143 3392 0.768382 37160 0.82777 5 5 DDL without patch <#gid=0&range=B6> 26328 0.901686 4.9832877 3600 0.751879 15.55 36528 0.817928 3.832676 with patch 27640 0.914078 4160 0.74709 37928 0.820621 6 6 DDL without patch <#gid=0&range=B7> 28472 0.936385 5.5071649 4
Re: proposal - function string_to_table
+{ oid => '2228', descr => 'split delimited text', + proname => 'string_to_table', prorows => '1000', proretset => 't', + prorettype => 'text', proargtypes => 'text text', + prosrc => 'text_to_table' }, +{ oid => '2282', descr => 'split delimited text with null string', + proname => 'string_to_table', prorows => '1000', proretset => 't', + prorettype => 'text', proargtypes => 'text text text', + prosrc => 'text_to_table_null' }, I go through the patch, and everything looks good to me. But I do not know why it needs a 'text_to_table_null()', it's ok to put a 'text_to_table' there, I think. Regards, Highgo Software (Canada/China/Pakistan) URL : www.highgo.ca EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
Regarding TZ conversion
Hi , What is the right approach for using AT TIME ZONE function? Option 1: AT TIME ZONE 'IST' Option 2: AT TIME ZONE 'Asia/Kolkata' In the first option, I get +2:00:00 offset (when *timezone_abbrevations = 'Default'*) and for option 2 , +5:30 offset. I can see multiple entries for IST in pg_timezone_names with different utc_offset, but in pg_timezone_abbrev there is one entry. I guess AT TIME ZONE function using the offset shown in pg_timezone_abbrev. ovdb=> select * from pg_timezone_names where abbrev = 'IST'; name | abbrev | utc_offset | is_dst -+++ Asia/Calcutta | IST| 05:30:00 | f Asia/Kolkata| IST| 05:30:00 | f Europe/Dublin | IST| 01:00:00 | t posix/Asia/Calcutta | IST| 05:30:00 | f posix/Asia/Kolkata | IST| 05:30:00 | f posix/Europe/Dublin | IST| 01:00:00 | t posix/Eire | IST| 01:00:00 | t Eire| IST| 01:00:00 | t ovdb=> select * from pg_timezone_abbrevs where abbrev = 'IST'; abbrev | utc_offset | is_dst ++ IST| 02:00:00 | f In my system, we receive TZ in abbrev format (3 character, like EST, PST ...). I have tried changing the timezone_abbrevations = 'India', then it worked fine (IST is giving +5:30 offset) So, What is recommended, use name instead of abbrev in TZ conversion function? Or Change the timezone_abbrevations to 'India'? *Regards,* *Rajin *
Re: libpq copy error handling busted
On Thu, Jun 4, 2020 at 6:22 PM Oleksandr Shulgin wrote: > On Thu, Jun 4, 2020 at 5:37 AM Thomas Munro wrote: >> Here's what I tested. First, I put this into pgdata/postgresql.conf: > Would it be feasible to capture this in a sort of a regression (TAP?) test? If I'm remembering correctly, it wouldn't work on Windows. After you've had an error sending to a socket, you can't receive (even if there was something sent to you earlier). At least that's how it seemed from the experiments on that other thread. The other problem is that it requires inserting a sleep into the code...
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila wrote: > > On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar wrote: > > > > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila wrote: > > > > > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar wrote: > > > > > > > > I thin for our use case BufFileCreateShared is more suitable. I think > > > > we need to do some modifications so that we can use these apps without > > > > SharedFileSet. Otherwise, we need to unnecessarily need to create > > > > SharedFileSet for each transaction and also need to maintain it in xid > > > > array or xid hash until transaction commit/abort. So I suggest > > > > following modifications in shared files set so that we can > > > > conveniently use it. > > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name) > > > > if fileset is NULL then select the DEFAULTTABLESPACEOID > > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace) > > > > If fileset is NULL then in directory path we can use MyProcPID or > > > > something instead of fileset->creator_pid. > > > > > > > > > > Hmm, I find these modifications a bit ad-hoc. So, not sure if it is > > > better than the patch maintains sharedfileset information. > > > > I think we might do something better here, maybe by supplying function > > pointer or so, but maintaining sharedfileset which contains different > > tablespace/mutext which we don't need at all for our purpose also > > doesn't sound very appealing. > > > > I think we can say something similar for Relation (rel cache entry as > well) maintained in LogicalRepRelMapEntry. I think we only need a > pointer to that information. Yeah, I see. > > Let me see if I can not come up with > > some clean way of avoiding the need to shared-fileset then maybe we > > can go with the shared fileset idea. > > > > Fair enough. While evaluating it further I feel there are a few more problems to solve if we are using BufFile, First thing is that in subxact file we maintain the information of xid and its offset in the changes file. So now, we will also have to store 'fileno' but that we can find using BufFileTell. Yet another problem is that currently, we don't have the truncate option in the BufFile, but we need it if the sub-transaction gets aborted. I think we can implement an extra interface with the BufFile and should not be very hard as we already know the fileno and the offset. I will evaluate this part further and let you know about the same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Is it useful to record whether plans are generic or custom?
On Mon, May 25, 2020 at 10:54 AM Atsushi Torikoshi wrote: > On Thu, May 21, 2020 at 5:10 PM Kyotaro Horiguchi > wrote: > >> Cost numbers would look better if it is cooked a bit. Is it worth >> being in core? > > > I didn't come up with ideas about how to use them. > IMHO they might not so necessary.. > > Perhaps plan_generation is not needed there. >> > > +1. > Now 'calls' is sufficient and 'plan_generation' may confuse users. > > BTW, considering 'calls' in pg_stat_statements is the number of times > statements were EXECUTED and 'plans' is the number of times > statements were PLANNED, how about substituting 'calls' for 'plans'? > I've modified the above points and also exposed the numbers of each generic plans and custom plans. I'm now a little bit worried about the below change which removed the overflow checking for num_custom_plans, which was introduced in 0001-Expose-counters-of-plancache-to-pg_prepared_statement.patch, but I've left it because the maximum of int64 seems enough large for counters. Also referencing other counters in pg_stat_user_tables, they don't seem to care about it. ``` - /* Accumulate total costs of custom plans, but 'ware overflow */ - if (plansource->num_custom_plans < INT_MAX) - { - plansource->total_custom_cost += cached_plan_cost(plan, true); - plansource->num_custom_plans++; - } ``` Regards, Atsushi Torikoshi > 0003-Expose-counters-of-plancache-to-pg_prepared_statement.patch Description: Binary data
Possible bug on Postgres 12 (CASE THEN evaluated prematurely) - Change of behaviour compared to 11, 10, 9
Greetings, Our system uses an EAV like database and generates queries like the example below. As you could see the query includes castings, we noticed testing with Postgres 12 that the castings of the CASE THEN statement (commented out below) where failing in some cases, of course if you do the INNER JOIN and CASE WHEN first our expectation is that the value can be casted. Changing INNER JOIN to LEFT JOIN solved the issue in Postgres 12, testing with earlier versions of Postgres INNER JOIN worked perfectly. Has somebody already reported anything like this? Maybe an issue with some optimisation? Best, Juan Example query: SELECT DISTINCT t0.id FROM samples t0 INNER JOIN sample_properties t1 ON t0.id = t1.samp_id INNER JOIN sample_type_property_types t2 ON t1.stpt_id = t2.id INNER JOIN property_types t3 ON t2.prty_id = t3.id INNER JOIN data_types t4 ON t3.daty_id = t4.id LEFT JOIN controlled_vocabulary_terms t5 ON t1.cvte_id = t5.id LEFT JOIN materials t6 ON t1.mate_prop_id = t6.id INNER JOIN sample_properties t7 ON t0.id = t7.samp_id INNER JOIN sample_type_property_types t8 ON t7.stpt_id = t8.id INNER JOIN property_types t9 ON t8.prty_id = t9.id INNER JOIN data_types t10 ON t9.daty_id = t10.id LEFT JOIN controlled_vocabulary_terms t11 ON t7.cvte_id = t11.id LEFT JOIN materials t12 ON t7.mate_prop_id = t12.id INNER JOIN sample_properties t13 ON t0.id = t13.samp_id INNER JOIN sample_type_property_types t14 ON t13.stpt_id = t14.id INNER JOIN property_types t15 ON t14.prty_id = t15.id INNER JOIN data_types t16 ON t15.daty_id = t16.id LEFT JOIN controlled_vocabulary_terms t17 ON t13.cvte_id = t17.id LEFT JOIN materials t18 ON t13.mate_prop_id = t18.id INNER JOIN sample_properties t19 ON t0.id = t19.samp_id INNER JOIN sample_type_property_types t20 ON t19.stpt_id = t20.id INNER JOIN property_types t21 ON t20.prty_id = t21.id INNER JOIN data_types t22 ON t21.daty_id = t22.id LEFT JOIN controlled_vocabulary_terms t23 ON t19.cvte_id = t23.id LEFT JOIN materials t24 ON t19.mate_prop_id = t24.id INNER JOIN sample_properties t25 ON t0.id = t25.samp_id INNER JOIN sample_type_property_types t26 ON t25.stpt_id = t26.id INNER JOIN property_types t27 ON t26.prty_id = t27.id INNER JOIN data_types t28 ON t27.daty_id = t28.id LEFT JOIN controlled_vocabulary_terms t29 ON t25.cvte_id = t29.id LEFT JOIN materials t30 ON t25.mate_prop_id = t30.id WHERE t0.saty_id IN (SELECT unnest(ARRAY[5])) AND t3.is_internal_namespace = true AND t3.code = 'STORAGE_POSITION.STORAGE_CODE' AND (lower(t1.value) = 'default_storage' OR lower(t5.code) = 'default_storage' OR lower(t6.code) = 'default_storage') AND t7.stpt_id = (SELECT id FROM sample_type_property_types WHERE saty_id = 5 AND prty_id = (SELECT id FROM property_types WHERE is_internal_namespace = true AND code = 'STORAGE_POSITION.STORAGE_RACK_ROW')) AND t7.value::numeric = 1 -- AND --CASE WHEN t9.is_internal_namespace = true -- AND t9.code = 'STORAGE_POSITION.STORAGE_RACK_ROW' -- AND (t10.code = 'INTEGER' OR t10.code = 'REAL') --THEN t7.value::numeric = 1 --ELSE false --END AND t13.stpt_id = (SELECT id FROM sample_type_property_types WHERE saty_id = 5 AND prty_id = (SELECT id FROM property_types WHERE is_internal_namespace = true AND code = 'STORAGE_POSITION.STORAGE_RACK_COLUMN')) AND t13.value::numeric = 2 -- AND --CASE WHEN t15.is_internal_namespace = true -- AND t15.code = 'STORAGE_POSITION.STORAGE_RACK_COLUMN' -- AND (t16.code = 'INTEGER' OR t16.code = 'REAL') --THEN t13.value::numeric = 2 --ELSE false --END AND t21.is_internal_namespace = true AND t21.code = 'STORAGE_POSITION.STORAGE_BOX_NAME' AND (lower(t19.value) = 'box2' OR lower(t23.code) = 'box2' OR lower(t24.code) = 'box2') AND t27.is_internal_namespace = true AND t27.code = 'STORAGE_POSITION.STORAGE_BOX_POSITION' AND (t25.value ILIKE '%a3%' OR t29.code ILIKE '%a3%' OR t30.code ILIKE '%a3%');
Re: Read access for pg_monitor to pg_replication_origin_status view
Hi, Martin. At Wed, 3 Jun 2020 13:32:28 -0300, Martín Marqués wrote in > Hi Kyotaro-san, > > Thank you for taking the time to review my patches. Would you like to > set yourself as a reviewer in the commit entry here? > https://commitfest.postgresql.org/28/2577/ Done. > > 0002: > > > > It is forgetting to grant pg_read_all_stats to execute > > pg_show_replication_origin_status. As the result pg_read_all_stats > > gets error on executing the function, not on doing select on the view. > > Seems I was testing on a cluster where I had already been digging, so > pg_real_all_stats had execute privileges on > pg_show_replication_origin_status (I had manually granted that) and > didn't notice because I forgot to drop the cluster and initialize > again. > > Thanks for the pointer here! Sorry for not mentioning it at that time, but about the following diff: +GRANT SELECT ON pg_replication_origin_status TO pg_read_all_stats; system_views.sql already has a REVOKE command on the view. We should put the above just below the REVOKE command. I'm not sure where to put the GRANT on pg_show_replication_origin_status(), but maybe it also should be at the same place. > > 0003: > > > > It seems to be a take-after of adminpack's documentation, but a > > superuser is not the only one on PostgreSQL. The something like the > > description in 27.2.2 Viewing Statistics looks more suitable. > > > > > Superusers and members of the built-in role pg_read_all_stats (see > > > also Section 21.5) can see all the information about all sessions. > > > > Section 21.5 is already saying as follows. > > > > > pg_monitor > > > Read/execute various monitoring views and functions. This role is a > > > member of pg_read_all_settings, pg_read_all_stats and > > > pg_stat_scan_tables. > > I'm not sure if I got this right, but I added some more text to point > out that the pg_read_all_stats role can also access one specific > function. I personally think it's a bit too detailed, and if we wanted > to add details it should be formatted differently, which would require > a more invasive patch (would prefer leaving that out, as it might even > mean moving parts which are not part of this patch). > > In any case, I hope the change fits what you've kindly pointed out. I forgot to mention it at that time, but the function pg_show_replication_origin_status is a function to back up system-views, like pg_stat_get_activity(), pg_show_all_file_settings() and so on. Such functions are not documented since users don't need to call them. pg_show_replication_origin_status is not documented for the same readon. Thus we don't need to mention the function. In the previous comment, one point I meant is that the "to the superuser" should be "to superusers", because a PostgreSQL server (cluster) can define multiple superusers. Another is that "permitted to other users by using the GRANT command." might be obscure for users. In this regard I found a more specific description in the same file: Computes the total disk space used by the database with the specified name or OID. To use this function, you must have CONNECT privilege on the specified database (which is granted by default) or be a member of the pg_read_all_stats role. So, as the result it would be like the following: (Note that, as you know, I'm not good at this kind of task..) Use of functions for replication origin is restricted to superusers. Use for these functions may be permitted to other users by granting EXECUTE privilege on the functions. And in regard to the view, granting privileges on both the view and function to individual user is not practical so we should mention only granting pg_read_all_stats to users, like the attached patch. > > 0004: > > > > Looks fine by me. > > Here goes v3 of the patch By the way, the attachements of your mail are out-of-order. I'm not sure that that does something bad, though. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 700271fd40..66679e8045 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -11009,8 +11009,10 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx The pg_replication_origin_status view contains information about how far replay for a certain origin has - progressed. For more on replication origins - see . + progressed. Use of this view is restricted to superusers and + pg_read_all_stats role. To use this view, you must be a member of + the pg_read_all_stats role. For more on replication + origins see .
Re: Atomic operations within spinlocks
On Thu, Jun 04, 2020 at 09:40:31AM +1200, Thomas Munro wrote: > Yeah. It'd be fine to move that after the spinlock release. Although > it's really just for informational purposes only, not for any data > integrity purpose, reading it before the spinlock acquisition would > theoretically allow it to appear to be (reportedly) behind > flushedUpto, which would be silly. Indeed. This could just be done after the spinlock section. Sorry about that. -- Michael signature.asc Description: PGP signature