Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
Hi > The default should always be to shrink, unless either the VACUUM > option or the reloption turn that off. (So it doesn't make sense to set > either the VACUUM option or the reloption to "on"). I think VACUUM option can be set to "on" by hand in order to override reloption only for this VACUUM call. regards, Sergei
Re: BUG #15623: Inconsistent use of default for updatable view
On 2019/02/27 18:37, Dean Rasheed wrote: > On Tue, 12 Feb 2019 at 10:33, Dean Rasheed wrote: >> Here's an updated patch ... > > So I pushed that. However, ... > > Playing around with it some more, I realised that whilst this appeared > to fix the reported problem, it exposes another issue which is down to > the interaction between rewriteTargetListIU() and rewriteValuesRTE() > --- for an INSERT with a VALUES RTE, rewriteTargetListIU() computes a > list of target attribute numbers (attrno_list) from the targetlist in > its original order, which rewriteValuesRTE() then relies on being the > same length (and in the same order) as each of the lists in the VALUES > RTE. That's OK for the initial invocation of those functions, but it > breaks down when they're recursively invoked for auto-updatable views. >> So actually, I think the right way to fix this is to give up trying to > compute attrno_list in rewriteTargetListIU(), and have > rewriteValuesRTE() work out the attribute assignments itself for each > column of the VALUES RTE by examining the rewritten targetlist. That > looks to be quite straightforward, and results in a cleaner separation > of logic between the 2 functions, per the attached patch. +1. Only rewriteValuesRTE needs to use that information, so it's better to teach it to figure it by itself. > I think that rewriteValuesRTE() should only ever see DEFAULT items for > columns that are simple assignments to columns of the target relation, > so it only needs to work out the target attribute numbers for TLEs > that contain simple Vars referring to the VALUES RTE. Any DEFAULT seen > in a column not matching that would be an error, but actually I think > such a thing ought to be a "can't happen" error because of the prior > checks during parse analysis, so I've used elog() to report if this > does happen. +if (attrno == 0) +elog(ERROR, "Cannot set value in column %d to DEFAULT", i); Maybe: s/Cannot/cannot/g +Assert(list_length(sublist) == numattrs); Isn't this Assert useless, because we're setting numattrs to list_length() and transformInsertStmt ensures that all sublists are same length? Thanks, Amit
Re: Drop type "smgr"?
On Thu, Feb 28, 2019 at 5:37 PM Tom Lane wrote: > Thomas Munro writes: > > On Thu, Feb 28, 2019 at 7:08 PM Tom Lane wrote: > >> I agree that smgrtype as it stands is pretty pointless, but what > >> will we be using instead to get to those other implementations? > > > Our current thinking is that smgropen() should know how to map a small > > number of special database OIDs to different smgr implementations > > Hmm. Maybe mapping based on tablespaces would be a better idea? > Thanks to bringing up this idea of mutliple smgr implementations. I also thought of implementing our own smgr implementation to support transparent data encryption on the disk based on tablespace mapping. Regards, Haribabu Kommi Fujitsu Australia
Re: Drop type "smgr"?
On Thu, Feb 28, 2019 at 7:37 PM Tom Lane wrote: > Thomas Munro writes: > > On Thu, Feb 28, 2019 at 7:08 PM Tom Lane wrote: > >> I agree that smgrtype as it stands is pretty pointless, but what > >> will we be using instead to get to those other implementations? > > > Our current thinking is that smgropen() should know how to map a small > > number of special database OIDs to different smgr implementations > > Hmm. Maybe mapping based on tablespaces would be a better idea? In the undo log proposal (about which more soon) we are using tablespaces for their real purpose, so we need that OID. If you SET undo_tablespaces = foo then future undo data created by your session will be written there, which might be useful for putting that IO on different storage. We also use the relation OID to chop up the undo address space into separate numbered undo logs, so that different sessions get their own space to insert data without trampling on each other's buffer locks. That leaves only the database OID to mess with (unless we widen buffer tags and the smgropen() interface). > > Unlike the ancestral code, it wouldn't need to appear in > > catalogs or ever be seen or typed in by users so there still wouldn't > > be a use for this type. > > Yeah, the $64 problem at this level is that you don't really want > to depend on catalog contents, because you cannot do a lookup to > find out what to do. Right. It would be a circular dependency if you had to read a catalog before you could consult low level SLRUs like (say) the clog* via shared buffers, since you need the clog to read the catalog, or (to peer a long way down the road and around several corners) if the catalogs themselves could optionally be stored in an undo-aware table access manager like zheap, which might require reading undo. I think this block storage stuff exists at a lower level than relations and transactions and therefore also catalogs, and there will be a small fixed number of them and it makes sense to hard-code the knowledge of them. *I'm at least a little bit aware of the history here: your 2001 commit 2589735d moved clog out of shared buffers! That enabled you to develop the system of segment files truncated from the front. That's sort of what this new smgr work is about; putting things in shared buffers, but just mapping the blocks to paths differently than md.c, as appropriate. -- Thomas Munro https://enterprisedb.com
Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
Alvaro Herrera wrote: > I think we should have a VACUUM option and a reloption, but no > GUC. The default should always be to shrink, unless either the VACUUM > option or the reloption turn that off. (So it doesn't make sense to set > either the VACUUM option or the reloption to "on"). +1 Yours, Laurenz Albe
Re: bgwriter_lru_maxpages limits in PG 10 sample conf
Hello postgresql.conf.sample was changed recently in REL_10_STABLE (commit ab1d9f066aee4f9b81abde6136771debe0191ae8) So config will be changed in next minor release anyway. We have another reason to not fix bgwriter_lru_maxpages comment? regards, Sergei
Re: Drop type "smgr"?
Thomas Munro writes: > On Thu, Feb 28, 2019 at 7:08 PM Tom Lane wrote: >> I agree that smgrtype as it stands is pretty pointless, but what >> will we be using instead to get to those other implementations? > Our current thinking is that smgropen() should know how to map a small > number of special database OIDs to different smgr implementations Hmm. Maybe mapping based on tablespaces would be a better idea? > Unlike the ancestral code, it wouldn't need to appear in > catalogs or ever be seen or typed in by users so there still wouldn't > be a use for this type. Yeah, the $64 problem at this level is that you don't really want to depend on catalog contents, because you cannot do a lookup to find out what to do. So I agree that we're pretty unlikely to resurrect an smgr type per se. But I'd been expecting an answer mentioning pg_am OIDs, and was wondering how that'd work exactly. Probably, it would still be down to some C code having hard-wired knowledge about some OIDs ... regards, tom lane
Re: POC: converting Lists into arrays
Andrew Gierth writes: > To get a reliable measurement of timing changes less than around 3%, > what you have to do is this: pick some irrelevant function and add > something like an asm directive that inserts a variable number of NOPs, > and do a series of test runs with different values. Good point. If you're looking at a microbenchmark that only exercises a small amount of code, it can be way worse than that. I was reminded of this the other day while fooling with the problem discussed in https://www.postgresql.org/message-id/flat/6970.1545327...@sss.pgh.pa.us in which we were spending huge amounts of time in a tight loop in match_eclasses_to_foreign_key_col. I normally run with --enable-cassert unless I'm trying to collect performance data; so I rebuilt with --disable-cassert, and was bemused to find out that that test case ran circa 20% *slower* in the non-debug build. This is silly on its face, and even more so when you notice that match_eclasses_to_foreign_key_col itself contains no Asserts and so its machine code is unchanged by the switch. (I went to the extent of comparing .s files to verify this.) So that had to have been down to alignment/cacheline issues triggered by moving said function around. I doubt the case would be exactly reproducible on different hardware or toolchain, but another platform would likely show similar issues on some case or other. tl;dr: even a 20% difference might be nothing more than an artifact. regards, tom lane
Re: Drop type "smgr"?
On Thu, Feb 28, 2019 at 7:08 PM Tom Lane wrote: > Thomas Munro writes: > > Motivation: A couple of projects propose to add new smgr > > implementations alongside md.c in order to use bufmgr.c for more kinds > > of files, but it seems entirely bogus to extend the unused smgr type > > to cover those. > > I agree that smgrtype as it stands is pretty pointless, but what > will we be using instead to get to those other implementations? Our current thinking is that smgropen() should know how to map a small number of special database OIDs to different smgr implementations (where currently it hard-codes smgr_which = 0). For example there would be a pseudo-database for undo logs, and another for the SLRUs that Shawn Debnath and others have been proposing to move into shared buffers. Another idea would be to widen the buffer tag to include the selector. Unlike the ancestral code, it wouldn't need to appear in catalogs or ever be seen or typed in by users so there still wouldn't be a use for this type. -- Thomas Munro https://enterprisedb.com
RE: Prevent extension creation in temporary schemas
Dear Michael, Chris and Tom, > Adding special cases to extensions strikes me as adding more > funny corners to the behavior of the db in this regard. I understand your arguments and its utility. > For most of extensions, this can randomly finish with strange error > messages, say that: > =# create extension file_fdw with schema pg_temp_3; > ERROR: 42883: function file_fdw_handler() does not exist > LOCATION: LookupFuncName, parse_func.c:2088 I found that this strange error appears after making temporary tables. test=> CREATE TEMPORARY TABLE temp (id int); CREATE TABLE test=> CREATE EXTENSION file_fdw WITH SCHEMA pg_temp_3; ERROR: function file_fdw_handler() does not exist I would try to understand this problem for community and my experience. Best Regards, Hayato Kuroda Fujitsu LIMITED
Re: Drop type "smgr"?
Thomas Munro writes: > Motivation: A couple of projects propose to add new smgr > implementations alongside md.c in order to use bufmgr.c for more kinds > of files, but it seems entirely bogus to extend the unused smgr type > to cover those. I agree that smgrtype as it stands is pretty pointless, but what will we be using instead to get to those other implementations? regards, tom lane
Drop type "smgr"?
Hello hackers, The type smgr has only one value 'magnetic disk'. ~15 years ago it also had a value 'main memory', and in Berkeley POSTGRES 4.2 there was a third value 'sony jukebox'. Back then, all tables had an associated block storage manager, and it was recorded as an attribute relsmgr of pg_class (or pg_relation as it was known further back). This was the type of that attribute, removed by Bruce in 3fa2bb31 (1997). Nothing seems to break if you remove it (except for some tests using it in an incidental way). See attached. Motivation: A couple of projects propose to add new smgr implementations alongside md.c in order to use bufmgr.c for more kinds of files, but it seems entirely bogus to extend the unused smgr type to cover those. -- Thomas Munro https://enterprisedb.com 0001-Remove-the-vestigial-smgr-type.patch Description: Binary data
Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
Alvaro Herrera writes: > Hopefully we'll get Tom's patch that addresses the failure-to-truncate > issues in pg12. Hm, are you speaking of the handwaving I did in https://www.postgresql.org/message-id/2348.1544474...@sss.pgh.pa.us ? I wasn't really working on that for v12 --- I figured it was way too late in the cycle to be starting on such a significant change. Still, if we did manage to make that work, it would remove the need for user-visible kluges like the one discussed in this thread. regards, tom lane
Re: POC: converting Lists into arrays
> "David" == David Rowley writes: David> I went and had a few adventures with this patch to see if I David> could figure out why the small ~1% regression exists. Just changing the number of instructions (even in a completely unrelated place that's not called during the test) can generate performance variations of this size, even when there's no real difference. To get a reliable measurement of timing changes less than around 3%, what you have to do is this: pick some irrelevant function and add something like an asm directive that inserts a variable number of NOPs, and do a series of test runs with different values. See http://tinyurl.com/op9qg8a for an example of the kind of variation that one can get; this plot records timing runs where each different padding size was tested 3 times (non-consecutively, so you can see how repeatable the test result is for each size), each timing is actually the average of the last 10 of 11 consecutive runs of the test. To establish a 1% performance benefit or regression you need to show that there's still a difference _AFTER_ taking this kind of spooky-action-at-a-distance into account. For example, in the test shown at the link, if a substantive change to the code moved the upper and lower bounds of the output from (6091,6289) to (6030,6236) then one would be justified in claiming it as a 1% improvement. Such is the reality of modern CPUs. -- Andrew (irc:RhodiumToad)
Re: POC: converting Lists into arrays
David Rowley writes: > I went and had a few adventures with this patch to see if I could > figure out why the small ~1% regression exists. Thanks for poking around! > ... I had > suspected it was the lcons() calls being expensive because then need > to push the elements up one place each time, not something that'll > scale well with larger lists. I just did some looking around at lcons() calls, and it's hard to identify any that seem like they would be performance-critical. I did notice a number of places that think that lcons'ing a item onto a list, and later stripping it off with list_delete_first, is efficient. With the new implementation it's far cheaper to lappend and then list_truncate instead, at least if the list is long. If the list order matters then that's not an option, but I found some places where it doesn't matter so we could get an easy win. Still, it wasn't obvious that this would move the needle at all. > I then tried hacking at the foreach() macro after wondering if the > lnext() call was somehow making things difficult for the compiler to > predict what cell would come next. Yeah, my gut feeling right now is that foreach() is producing crummy code, though it's not obvious why it would need to. Maybe some micro-optimization is called for. But I've not had time to pursue it. > The only thing that I did to manage to speed the patch up was to ditch > the additional NULL test in lnext(). I don't see why that's required > since lnext(NULL) would have crashed with the old implementation. Hmmm ... > Perhaps if we're not going to see gains from the patch alone then > we'll need to tag on some of the additional stuff that will take > advantage of list_nth() being fast and test the performance of it all > again. Yeah, evaluating this patch in complete isolation is a bit unfair. Still, it'd be nice to hold the line in advance of any follow-on improvements. regards, tom lane
Re: partitioned tables referenced by FKs
Hi Alvaro, I looked at the latest patch and most of the issues/bugs that I was going to report based on the late January version of the patch seem to have been taken care of; sorry that I couldn't post sooner which would've saved you some time. The patch needs to be rebased on top of ff11e7f4b9 which touched ri_triggers.c. In the following case: create table pk (a int primary key) partition by list (a); create table pk1 (a int primary key); create table fk (a int references pk1); -- adds FK referencing pk1 via ATAddForeignKeyConstraint recursion alter table pk attach partition pk1 for values in (1); alter table fk add foreign key (a) references pk; or: -- adds FK referencing pk1 via CloneForeignKeyConstraints alter table fk add foreign key (a) references pk; alter table pk attach partition pk1 for values in (1); \d fk Table "public.fk" Column │ Type │ Collation │ Nullable │ Default ┼─┼───┼──┼─ a │ integer │ │ │ Foreign-key constraints: "fk_a_fkey" FOREIGN KEY (a) REFERENCES pk1(a) "fk_a_fkey1" FOREIGN KEY (a) REFERENCES pk(a) "fk_a_fkey2" FOREIGN KEY (a) REFERENCES pk1(a) Could we have avoided creating fk_a_fkey2 which must be same as fk_a_fkey as far as the user-visible behavior is concerned? * Regarding 0003 I'd like to hear your thoughts on some suggestions to alter the structure of the reorganized code around foreign key addition/cloning. With this patch adding support for foreign keys to reference partitioned tables, the code now has to consider various cases due to the possibility of having partitioned tables on both sides of a foreign key, which is reflected in the complexity of the new code. Following functions have been introduced in the foreign key DDL code in the course of adding support for partitioning: addFkRecurseReferenced addFkRecurseReferencing CloneForeignKeyConstraints CloneFkReferencing CloneFkReferenced tryAttachPartitionForeignKey We have addFkRecurseReferencing and CloneForeignKeyConstraints calling CloneFkReferencing to recursively add a foreign key constraint being defined on (or being cloned to) a partitioned table to its partitions. The latter invokes tryAttachPartitionForeignKey to see if an existing foreign key constraint is functionally same as one being cloned to avoid creating a duplicate constraint. However, we have CloneFkReferenced() calling addFkRecurseReferenced, not the other way around. Neither of these functions attempts to reuse an existing, functionally equivalent, foreign key constraint referencing a given partition that's being recursed to. So we get the behavior in the above example, which again, I'm not sure is intentional. Also, it seems a bit confusing that there is a CreateConstraintEntry call in addFkRecurseReferenced() which is creating a constraint on the *referencing* relation, which I think should be in ATAddForeignKeyConstraint, the caller. I know we need to create a copy of the constraint to reference the partitions of the referenced table, but we might as well put it in CloneFkReferenced and reverse who calls who -- make addFkRecurseReferenced() call CloneFkReferenced and have the code to create the cloned constraint and action triggers in the latter. That will make the code to handle different sides of foreign key look similar, and imho, easier to follow. +/* + * addFkRecurseReferenced + * recursive subroutine for ATAddForeignKeyConstraint, referenced side How about: Subroutine for ATAddForeignKeyConstraint to add action trigger on referenced relation, recursing if partitioned, in which case, both the constraint referencing a given partition and the action trigger are cloned in all partitions +/* + * addFkRecurseReferenced + * recursive subroutine for ATAddForeignKeyConstraint, referenced side How about: Subroutine for ATAddForeignKeyConstraint to add check trigger on referencing relation, recursing if partitioned, in which case, both the constraint and the check trigger are cloned in all partitions I noticed however that this function itself is not recursively called; CloneFkReferencing handles the recursion. /* * CloneForeignKeyConstraints * Clone foreign keys from a partitioned table to a newly acquired * partition. Maybe: Clone foreign key of (or referencing) a partitioned table to a newly acquired partition * In 0002, /* + * Return the OID of the index, in the given partition, that is a child of the + * given index or InvalidOid if there isn't one. + */ +Oid +index_get_partition(Relation partition, Oid indexId) Maybe add a comma between "...given index" and "or InvalidOid", or maybe rewrite the sentence as: Return the OID of index of the given partition that is a child of the given index, or InvalidOid if there isn't one. * Unrelated to this thread, but while reviewing, I noticed this code in ATExecAttachPartitionIdx: /* no deadlock risk: RangeVarGetRelidExtended already acquire
Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
On 2019-Feb-28, Tsunakawa, Takayuki wrote: > From: Michael Paquier [mailto:mich...@paquier.xyz] > > So we could you consider adding an option for the VACUUM command as well > > as vacuumdb? The interactions with the current patch is that you need to > > define the behavior at the beginning of vacuum for a given heap, instead > > of reading the parameter at the time the truncation happens, and give > > I'm not confident whether this is the same as the above, I imagined this: > > * Add a new USERSET GUC vacuum_shrink_table = {on | off}, on by default. > This follows the naming style "verb_object" like log_connections and > enable_xxx. We may want to use enable_vacuum_shrink or something like that, > but enable_xxx seems to be used solely for planner control. Plus, > vacuum-related parameters seem to start with vacuum_. Robert used the phrase "attractive nuisance", which maybe sounds like a good thing to have to a non native speaker, but it actually isn't -- he was saying we should avoid a GUC at all, and I can see the reason for that. I think we should have a VACUUM option and a reloption, but no GUC. The default should always be to shrink, unless either the VACUUM option or the reloption turn that off. (So it doesn't make sense to set either the VACUUM option or the reloption to "on"). Disclaimer: I did write roughly the same patch using both a GUC and a VACUUM option, though I named my GUC truncate_on_vacuum and the VACUUM option "truncate_anyway" (so you can turn truncation off globally, then enable it selectively at manual vacuum execution time, but not autovacuum). However, the reason for doing this were concerns about robustness caused by data corruption induced by failing to truncate pages containing removed tuples ... not performance improvement, as your patch. So they wanted to turn off truncation for *all* tables, not just a select few. Hopefully we'll get Tom's patch that addresses the failure-to-truncate issues in pg12. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: converting Lists into arrays
On Tue, 26 Feb 2019 at 18:34, Tom Lane wrote: > > David Rowley writes: > > Using the attached patch (as text file so as not to upset the CFbot), > > which basically just measures and logs the time taken to run > > pg_plan_query. ... > > Surprisingly it took 1.13% longer. I did these tests on an AWS > > md5.large instance. > > Interesting. Seems to suggest that maybe the cases I discounted > as being infrequent aren't so infrequent? Another possibility > is that the new coding adds more cycles to foreach() loops than > I'd hoped for. I went and had a few adventures with this patch to see if I could figure out why the small ~1% regression exists. Profiling did not prove very useful as I saw none of the list functions come up. I had suspected it was the lcons() calls being expensive because then need to push the elements up one place each time, not something that'll scale well with larger lists. After changing things so that a new "first" element index in the List would allow new_head_cell() to just move everything to the end of the list and mark the start of the actual data... I discovered that slowed things down further... Likely due to all the additional arithmetic work required to find the first element. I then tried hacking at the foreach() macro after wondering if the lnext() call was somehow making things difficult for the compiler to predict what cell would come next. I experimented with the following monstrosity: for ((cell) = list_head(l); ((cell) && (cell) < &((List *) l)->elements[((List *) l)->first + ((List *) l)->length]) || (cell = NULL) != NULL; cell++) it made things worse again... It ended up much more ugly than I thought it would have as I had to account for an empty list being NIL and the fact that we need to make cell NULL after the loop is over. I tried a few other things... I didn't agree with your memmove() in list_concat(). I think memcpy() is fine, even when the list pointers are the same since we never overwrite any live cell values. Strangely I found memcpy slower than memmove... ? The only thing that I did to manage to speed the patch up was to ditch the additional NULL test in lnext(). I don't see why that's required since lnext(NULL) would have crashed with the old implementation. Removing this changed the 1.13% regression into a ~0.8% regression, which at least does show that the foreach() implementation can have an effect on performance. > Anyway, it's just a POC; the main point at this stage is to be > able to make such comparisons at all. If it turns out that we > *can't* make this into a win, then all that bellyaching about > how inefficient Lists are was misinformed ... My primary concern is how much we bend over backwards because list_nth() performance is not O(1). I know from my work on partitioning that ExecInitRangeTable()'s building of es_range_table_array has a huge impact for PREPAREd plans for simple PK lookup SELECT queries to partitioned tables with a large number of partitions, where only 1 of which will survive run-time pruning. I could get the execution speed of such a query with 300 partitions to within 94% of the non-partitioned version if the rtable could be looked up O(1) in the executor natively, (that some changes to ExecCheckRTPerms() to have it skip rtable entries that don't require permission checks.). Perhaps if we're not going to see gains from the patch alone then we'll need to tag on some of the additional stuff that will take advantage of list_nth() being fast and test the performance of it all again. Attached is the (mostly worthless) series of hacks I made to your patch. It might save someone some time if they happened to wonder the same thing as I did. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services list_hacks.patch Description: Binary data
Re: pgsql: Avoid creation of the free space map for small heap relations, t
On Thu, Feb 28, 2019 at 10:25 AM Amit Kapila wrote: > > Here's an updated patch based on comments by you. I will proceed with > this unless you have any more comments. Looks good to me. I would just adjust the grammar in the comment, from "This prevents us to use the map", to "This prevents us from using the map". Perhaps also a comma after "first". -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: converting Lists into arrays
David Rowley writes: > On Thu, 28 Feb 2019 at 09:26, Tom Lane wrote: >> 0002 below does this. I'm having a hard time deciding whether this >> part is a good idea or just code churn. It might be more readable >> (especially to newbies) but I can't evaluate that very objectively. > I'm less decided on this. Having this now means you'll need to break > the signature of the macro the same way as you'll need to break > lnext(). It's perhaps easier to explain in the release notes about > lnext() having changed so that extension authors can go fix their code > (probably they'll know already from compile failures, but ). On > the other hand, if the list_cell_is_last() is new, then there will be > no calls to that in extensions anyway. Maybe it's better to do it at > the same time as the List reimplementation to ensure nobody needs to > change anything twice? Yeah, I was considering the idea of setting up the macro as "list_cell_is_last(list, cell)" from the get-go, with the first argument just going unused for the moment. That would be a good way to back-patch it if we go through with this. On the other hand, if we end up not pulling the trigger on the main patch, that'd look pretty silly ... regards, tom lane
Re: POC: converting Lists into arrays
On Thu, 28 Feb 2019 at 09:26, Tom Lane wrote: > > I wrote: > > I did find a number of places where getting rid of explicit lnext() > > calls led to just plain cleaner code. Most of these were places that > > could be using forboth() or forthree() and just weren't. There's > > also several places that are crying out for a forfour() macro, so > > I'm not sure why we've stubbornly refused to provide it. I'm a bit > > inclined to just fix those things in the name of code readability, > > independent of this patch. > > 0001 below does this. I found a couple of places that could use > forfive(), as well. I think this is a clear legibility and > error-proofing win, and we should just push it. I've looked over this and I agree that it's a good idea. Reducing the number of lnext() usages seems like a good idea in order to reduce the footprint of the main patch. The only thing of interest that I saw during the review was the fact that you've chosen to assign colexpr and coldefexpr before the continue in get_tablefunc(). We may not end up using those values if we find an ordinality column. I'm pretty sure it's not worth breaking the mould for that case though, but just noting it anyway. > > I also noticed that there's quite a lot of places that are testing > > lnext(cell) for being NULL or not. What that really means is whether > > this cell is last in the list or not, so maybe readability would be > > improved by defining a macro along the lines of list_cell_is_last(). > > Any thoughts about that? > > 0002 below does this. I'm having a hard time deciding whether this > part is a good idea or just code churn. It might be more readable > (especially to newbies) but I can't evaluate that very objectively. > I'm particularly unsure about whether we need two macros; though the > way I initially tried it with just list_cell_is_last() seemed kind of > double-negatively confusing in the places where the test needs to be > not-last. Also, are these macro names too long, and if so what would > be better? I'm less decided on this. Having this now means you'll need to break the signature of the macro the same way as you'll need to break lnext(). It's perhaps easier to explain in the release notes about lnext() having changed so that extension authors can go fix their code (probably they'll know already from compile failures, but ). On the other hand, if the list_cell_is_last() is new, then there will be no calls to that in extensions anyway. Maybe it's better to do it at the same time as the List reimplementation to ensure nobody needs to change anything twice? > Also: if we accept either or both of these, should we back-patch the > macro additions, so that these new macros will be available for use > in back-patched code? I'm not sure that forfour/forfive have enough > use-cases to worry about that; but the is-last macros would have a > better case for that, I think. I see no reason not to put forfour() and forfive() in the back branches. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Why don't we have a small reserved OID range for patch revisions?
Peter Eisentraut writes: > On 2019-02-27 22:27, Tom Lane wrote: >>> OID collision doesn't seem to be a significant problem (for me). >> Um, I beg to differ. It's not at all unusual for pending patches to >> bit-rot for no reason other than suddenly getting an OID conflict. >> I don't have to look far for a current example: > I'm not saying it doesn't happen, but that it's not a significant > problem overall. I do not think you are correct. It may not be a big problem across all our incoming patches, but that's only because most of them don't have anything to do with hand-assigned OIDs. Among those that do, I think there is a significant problem. To try to quantify this a bit, I looked through v12-cycle and pending patches that touch the catalog data. We've only committed 12 patches adding new hand-assigned OIDs since v11 was branched off. (I suspect that's lower than in a typical cycle, but have not attempted to quantify things further back.) Of those, only two seem to have needed OID adjustments after initial posting, but that's mostly because most of them were committer-originated patches that got pushed within a week or two. That's certainly not the typical wait time for a patch submitted by anybody else. Also, a lot of these patches recycled OIDs that'd recently been freed by patches such as the abstime-ectomy, which means that the amount of OID conflict created for pending patches is probably *really* low in this cycle-so-far, compared to our historical norms. Of what's in the queue to be reviewed right now, there are just 20 (out of 150-plus) patches that touch the catalog/*.dat files. I got this number by groveling through the cfbot's reports of patch applications, to see which patches touched those files. It might omit some patches that the cfbot failed to make sense of. Also, I'm pretty sure that a few of these patches don't actually assign any new OIDs but just change existing entries, or create only entries with auto-assigned OIDs. I did not try to separate those out, however, since the point here is to estimate for how many patches a committer would even need to think about this. Of those twenty patches, three have unresolved OID conflicts right now: multivariate MCV lists and histograms commontype polymorphics log10/hyper functions Another one has recently had to resolve an OID conflict: SortSupport implementation on inet/cdir which is notable considering that that thread is less than three weeks old. (The log10 and commontype threads aren't really ancient either.) I spent some extra effort looking at the patches that both create more than a few new OIDs and have been around for awhile: Generic type subscripting KNN for btree Custom compression methods SQL/JSON: functions SQL/JSON: jsonpath Generated columns BRIN bloom indexes The first four of those have all had to reassign OIDs during their lifetime. jsonpath has avoided doing so by choosing fairly high-numbered OIDs (6K range) to begin with; which I trust you will agree is a solution that doesn't scale for long. I'm not entirely sure that the last two haven't had to renumber OIDs; I ran out of energy before poking through their history in detail. In short, this situation may look fine from the perspective of a committer with a relatively short timeline to commit, but it's pretty darn awful for everybody else. The only way to avoid a ~ 50% failure rate is to choose OIDs above 6K, and once everybody starts doing it like that, things are going to get very unpleasant very quickly. regards, tom lane
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Michael Paquier [mailto:mich...@paquier.xyz] > So we could you consider adding an option for the VACUUM command as well > as vacuumdb? The interactions with the current patch is that you need to > define the behavior at the beginning of vacuum for a given heap, instead > of reading the parameter at the time the truncation happens, and give I'm not confident whether this is the same as the above, I imagined this: * Add a new USERSET GUC vacuum_shrink_table = {on | off}, on by default. This follows the naming style "verb_object" like log_connections and enable_xxx. We may want to use enable_vacuum_shrink or something like that, but enable_xxx seems to be used solely for planner control. Plus, vacuum-related parameters seem to start with vacuum_. * Give priority to the reloption, because it's targeted at a particular table. If the reloption is not set, the GUC takes effect. * As a consequence, the user can change the behavior of VACUUM command by SETting the GUC in the same session in advance, when the reloption is not set. If the reloption is set, the user can ALTER TABLE SET, VACUUM, and ALTER TABLE again to restore the table's setting. But I don't think this use case (change whether to shrink per VACUUM command execution) is necessary. This is no more than simply possible. Regards Takayuki Tsunakawa
Re: psql display of foreign keys
On Wed, Feb 27, 2019 at 03:37:23PM -0300, Alvaro Herrera wrote: > It should have listed t2 too, but it doesn't. Since these functions > aren't supposed to work on legacy inheritance anyway, I think the right > action is to return the empty set. In the current version I just do > what pg_partition_tree does, but I think we should adjust that behavior. > I'll start a new thread about that. Yes, that's not good. The internal wrapper for ancestors should be reworked. The results of pg_partition_tree are what I would expect them to be though? Taking your example, t111 gets listed if listing the trees from t1 or t2. This seems natural to me. I am wondering the amount of work that it would take to actually have the function return both relations in this case.. > +pg_partition_ancestors(PG_FUNCTION_ARGS) > +{ > + Oid relid = PG_GETARG_OID(0); > + FuncCallContext *funcctx; > + ListCell **next; > + > + if (!check_rel_can_be_partition(relid)) > + PG_RETURN_NULL(); Not returning an empty set here? ;) I would have added tests with pg_partition_ancestors(NULL) and pg_partition_ancestors(0) for consistency with the rest. Except that and the ancestor tracking for inheritance, the shape of the patch looks good to me. -- Michael signature.asc Description: PGP signature
Re: pgsql: Avoid creation of the free space map for small heap relations, t
On Thu, Feb 28, 2019 at 8:10 AM Amit Kapila wrote: > > On Thu, Feb 28, 2019 at 7:45 AM John Naylor > wrote: > > > > On Thu, Feb 28, 2019 at 7:24 AM Amit Kapila wrote: > > > > The flaw in my thinking was treating extension too similarly too > > > > finding an existing block. With your patch clearing the local map in > > > > the correct place, it seems the call at hio.c:682 is now superfluous? > > > > > > What if get some valid block from the first call to > > > GetPageWithFreeSpace via local map which has required space? I think > > > in that case we will need the call at hio.c:682. Am I missing > > > something? > > > > Are you referring to the call at line 393? Then the map will be > > cleared on line 507 before we return that buffer. > > > > That's correct, I haven't looked at line number very carefully and > assumed that you are saying to remove the call at 507. I also think > that the call at line 682 doesn't seem to be required, but I would > prefer to keep it for the sake of consistency in this function and > safety purpose. However, I think we should add a comment on the lines > suggested by you. I will send the patch after making these > modifications. > Here's an updated patch based on comments by you. I will proceed with this unless you have any more comments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_loc_map_clear_3.patch Description: Binary data
Re: A note about recent ecpg buildfarm failures
On Tue, Feb 26, 2019 at 01:25:29PM -0500, Tom Lane wrote: > Since my commits 9e138a401 et al on Saturday, buildfarm members > blobfish, brotula, and wunderpus have been showing core dumps > in the ecpg preprocessor. This seemed inexplicable given what > the commits changed, and even more so seeing that only HEAD is > failing, while the change was back-patched into all branches. > > Mark Wong and I poked into this off-list, and what we find is that > this seems to be a compiler bug. Those animals are all running > nearly the same version of clang (3.8.x / ppc64le). Looking into > the assembly code for preproc.y, the crash is occurring at a branch > that is supposed to jump forward exactly 32768 bytes, but according > to gdb's disassembler it's jumping backwards exactly -32768 bytes, > into invalid memory. It will come as no surprise to hear that the > branch displacement field in PPC conditional branches is 16 bits > wide, so that positive 32768 doesn't fit but negative 32768 does. > Evidently what is happening is that either the compiler or the > assembler is failing to detect the edge-case field overflow and > switch to different coding. So the apparent dependency on 9e138a401 > is because that happened to insert exactly the right number of > instructions in-between to trigger this scenario. It's pure luck we > didn't trip over it before, although none of those buildfarm animals > have been around for all that long. > > Moral: don't use clang 3.8.x on ppc64. I think Mark is going > to upgrade those animals to some more recent compiler version. I've tried clang 3.9 and 4.0 by hand and they seem to be ok. These were the other two readily available versions on Debian stretch. I'll stop those other clang-3.8 animals... Regards, Mark -- Mark Wong 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
RE: Timeout parameters
Hi, I rewrote two TCP_USER_TIMEOUT patches. I changed the third argument of setsockopt() from 18 to TCP_USER_TIMEOUT. This revision has the following two merits. * Improve readability of source * Even if the definition of TCP_USER_TIMEOUT is changed, it is not affected. e.g., in the current linux, TCP_USER_TIMEOUT is defined, but even if it is changed to 19, it 'll be available. Best regards, - Ryohei Nagaura TCP_backend_v8.patch Description: TCP_backend_v8.patch TCP_interface_v8.patch Description: TCP_interface_v8.patch socket_timeout_v7.patch Description: socket_timeout_v7.patch
RE: proposal: pg_restore --convert-to-text
Hi, On Tue, Feb 19, 2019 at 8:20 PM, Euler Taveira wrote: > Em seg, 18 de fev de 2019 às 19:21, Tom Lane escreveu: > > > > Euler Taveira writes: > > > Since no one has stepped up, I took a stab at it. It will prohibit > > > standard output unless '-f -' be specified. -l option also has the > > > same restriction. > > > > Hm, don't really see the need to break -l usage here. > > > After thinking about it, revert it. > > > Pls add to next CF, if you didn't already. > > > Done. I saw the patch. Is there no need to rewrite the Description in the Doc to state we should specify either -d or -f option? (and also it might be better to write if -l option is given, neither -d nor -f option isn't necessarily needed.) I also have the simple question in the code. I thought the below if-else condition + if (filename && strcmp(filename, "-") == 0) + fn = fileno(stdout); + else if (filename) fn = -1; else if (AH->FH) can also be written by the form below. if (filename) { if(strcmp(filename, "-") == 0) fn = fileno(stdout); else fn = -1; } else if (AH->FH) I think the former one looks like pretty, but which one is preffered? -- Yoshikazu Imai
Re: pgsql: Avoid creation of the free space map for small heap relations, t
On Thu, Feb 28, 2019 at 7:45 AM John Naylor wrote: > > On Thu, Feb 28, 2019 at 7:24 AM Amit Kapila wrote: > > > The flaw in my thinking was treating extension too similarly too > > > finding an existing block. With your patch clearing the local map in > > > the correct place, it seems the call at hio.c:682 is now superfluous? > > > > What if get some valid block from the first call to > > GetPageWithFreeSpace via local map which has required space? I think > > in that case we will need the call at hio.c:682. Am I missing > > something? > > Are you referring to the call at line 393? Then the map will be > cleared on line 507 before we return that buffer. > That's correct, I haven't looked at line number very carefully and assumed that you are saying to remove the call at 507. I also think that the call at line 682 doesn't seem to be required, but I would prefer to keep it for the sake of consistency in this function and safety purpose. However, I think we should add a comment on the lines suggested by you. I will send the patch after making these modifications. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Allowing extensions to supply operator-/function-specific info
Paul Ramsey writes: > I added three indexes to my test table: > CREATE INDEX foo_g_gist_x ON foo USING GIST (g); > CREATE INDEX foo_g_gist_nd_x ON foo USING GIST (g gist_geometry_ops); > CREATE INDEX foo_g_spgist_x ON foo USING SPGIST (g); > They all support the overlaps (&&) operator. > So, SupportRequestIndexCondition happens three times, and each time I say > “yep, sure, you can construct an index condition by putting the && operator > between left_arg and right_arg”. Sounds right. > How does the planner end up deciding on which index to *actually* use? It's whichever has the cheapest cost estimate. In case of an exact tie, I believe it'll choose the index with lowest OID (or maybe highest OID, not sure). > The selectivity is the same, the operator is the same. I found that I got the > ND GIST one first, then the SPGIST and finally the 2d GIST, which is > unfortunate, because the 2D and SPGIST are almost certainly faster than the > ND GIST. Given that it'll be the same selectivity, the cost preference is likely to go to whichever index is physically smallest, at least for indexes of the same type. When they're not the same type there might be an issue with the index AM cost estimators not being lined up very well as to what they account for and how. I don't doubt that there's plenty of work to be done in making the cost estimates better in cases like this --- in particular, I don't think we have any way of accounting for the idea that one index opclass might be smarter than another one for the same query, unless that shakes out as a smaller index. But you'd have had the same issues with the old approach. regards, tom lane
Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
On Thu, Feb 28, 2019 at 01:05:07AM +, Tsunakawa, Takayuki wrote: > From: Robert Haas [mailto:robertmh...@gmail.com] >> I don't think that a VACUUM option would be out of place, but a GUC >> sounds like an attractive nuisance to me. It will encourage people to >> just flip it blindly instead of considering the particular cases where >> they need that behavior, and I think chances are good that most people >> who do that will end up being sad. I won't disagree with you on that. I hear enough about people disappointed that VACUUM does not clean up their garbage enough and that tables are bloated.. And making autovacuum too aggressive is no good either. > Ouch, I sent my previous mail before reading this. I can understand > it may be cumbersome to identify and specify each table, so I tend > to agree the parameter in postgresql, which is USERSET to allow > ALTER DATABASE/USER SET to tune specific databases and applications. > But should the vacuuming of system catalogs also follow this > setting? So we could you consider adding an option for the VACUUM command as well as vacuumdb? The interactions with the current patch is that you need to define the behavior at the beginning of vacuum for a given heap, instead of reading the parameter at the time the truncation happens, and give priority to the command-level option. -- Michael signature.asc Description: PGP signature
Re: pg_partition_tree crashes for a non-defined relation
Hi, On 2019/02/28 10:45, Michael Paquier wrote: > On Wed, Feb 27, 2019 at 03:48:08PM -0300, Alvaro Herrera wrote: >> I just happened to come across the result of this rationale in >> pg_partition_tree() (an SRF) while developing a new related function, >> pg_partition_ancestors(), and find the resulting behavior rather absurd >> -- it returns one row with all NULL columns, rather than no rows. I >> think the sensible behavior would be to do SRF_RETURN_DONE() before >> stashing any rows to the output, so that we get an empty result set >> instead. > > Hmm. Going through the thread again NULL was decided to make the > whole experience consistent, now by returning nothing we would get > a behavior as consistent as when NULL is used in input, so point taken > to tune the behavior for unsupported relkinds and undefined objects. > > Does the attached look fine to you? Reading the discussion, we just don't want to throw an "ERROR: unsupported relkind" error, to avoid, for example, aborting a query that's iteratively calling pg_partition_tree on all relations in a given set. Returning no rows at all seems like a better way of ignoring such relations. with rels as (select relname from pg_class where relnamespace = 'public'::regnamespace) select pg_partition_tree(relname::regclass) from rels; pg_partition_tree (pk1,pk,t,0) (pk,,f,0) (pk1,pk,t,1) (pk1_pkey,pk_pkey,t,0) (pk_pkey,,f,0) (pk1_pkey,pk_pkey,t,1) (another_pk1,another_pk,t,0) (another_pk,,f,0) (another_pk1,another_pk,t,1) (another_pk1_pkey,another_pk_pkey,t,0) (another_pk_pkey,,f,0) (another_pk1_pkey,another_pk_pkey,t,1) (13 rows) Now that Alvaro mentions it, I too think that returning no rows at all is better than 1 row filled with all NULLs, so +1. Thanks, Amit
Re: pgsql: Avoid creation of the free space map for small heap relations, t
On Thu, Feb 28, 2019 at 7:24 AM Amit Kapila wrote: > > The flaw in my thinking was treating extension too similarly too > > finding an existing block. With your patch clearing the local map in > > the correct place, it seems the call at hio.c:682 is now superfluous? > > What if get some valid block from the first call to > GetPageWithFreeSpace via local map which has required space? I think > in that case we will need the call at hio.c:682. Am I missing > something? Are you referring to the call at line 393? Then the map will be cleared on line 507 before we return that buffer. -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: readdir is incorrectly implemented at Windows
On Mon, Feb 25, 2019 at 06:38:16PM +0300, Konstantin Knizhnik wrote: > Small issue with readir implementation for Windows. > Right now it returns ENOENT in case of any error returned by FindFirstFile. > So all places in Postgres where opendir/listdir are used will assume that > directory is empty and > do nothing without reporting any error. > It is not so good if directory is actually not empty but there are not > enough permissions for accessing the directory and FindFirstFile > returns ERROR_ACCESS_DENIED: Yeah, I think that you are right here. We should have a clean error if permissions are incorrect, and your patch does what is mentioned on Windows docs: https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-findfirstfilea Could you add it to the next commit fest as a bug fix please? I think that I will be able to look at that in details soon, but if not it would be better to not lose track of your fix. -- Michael signature.asc Description: PGP signature
RE: Protect syscache from bloating with negative cache entries
From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com] > I measured the memory context accounting overhead using Tomas's tool > palloc_bench, > which he made it a while ago in the similar discussion. > https://www.postgresql.org/message-id/53f7e83c.3020...@fuzzy.cz > > This tool is a little bit outdated so I fixed it but basically I followed > him. > Things I did: > - make one MemoryContext > - run both palloc() and pfree() for 32kB area 1,000,000 times. > - And measure this time > > The result shows that master is 30 times faster than patched one. > So as Andres mentioned in upper thread it seems it has overhead. > > [master (without v15 patch)] > 61.52 ms > 60.96 ms > 61.40 ms > 61.42 ms > 61.14 ms > > [with v15 patch] > 1838.02 ms > 1754.84 ms > 1755.83 ms > 1789.69 ms > 1789.44 ms > I'm afraid the measurement is not correct. First, the older discussion below shows that the accounting overhead is much, much smaller, even with a more complex accounting. 9.5: Better memory accounting, towards memory-bounded HashAg https://www.postgresql.org/message-id/flat/1407012053.15301.53.camel%40jeff-desktop Second, allocation/free of memory > 8 KB calls malloc()/free(). I guess the accounting overhead will be more likely to be hidden under the overhead of malloc() and free(). What we'd like to know the overhead when malloc() and free() are not called. And are you sure you didn't enable assert checking? Regards Takayuki Tsunakawa
Re: Segfault when restoring -Fd dump on current HEAD
On Wed, Feb 27, 2019 at 12:02:43PM -0500, Tom Lane wrote: > Alvaro Herrera writes: >> I think we should save such a patch for whenever we next update the >> archive version number, which could take a couple of years given past >> history. I'm inclined to add a comment near K_VERS_SELF to remind >> whoever next patches it. > > +1. This isn't an unreasonable cleanup idea, but being only a cleanup > idea, it doesn't seem worth creating compatibility issues for. Let's > wait till there is some more-pressing reason to change the archive format, > and then fix this in the same release cycle. +1. Having a comment as reminder would be really nice. -- Michael signature.asc Description: PGP signature
RE: Libpq support to connect to standby server as priority
From: Haribabu Kommi [mailto:kommi.harib...@gmail.com] > Attached are the updated patches. Thanks, all look fixed. > The target_server_type option yet to be implemented. Please let me review once more and proceed to testing when the above is added, to make sure the final code looks good. I'd like to see how complex the if conditions in multiple places would be after adding target_server_type, and consider whether we can simplify them together with you. Even now, the if conditions seem complicated to me... that's probably due to the existence of read_write_host_index. Regards Takayuki Tsunakawa
Re: pg_partition_tree crashes for a non-defined relation
On Wed, Feb 27, 2019 at 03:48:08PM -0300, Alvaro Herrera wrote: > I just happened to come across the result of this rationale in > pg_partition_tree() (an SRF) while developing a new related function, > pg_partition_ancestors(), and find the resulting behavior rather absurd > -- it returns one row with all NULL columns, rather than no rows. I > think the sensible behavior would be to do SRF_RETURN_DONE() before > stashing any rows to the output, so that we get an empty result set > instead. Hmm. Going through the thread again NULL was decided to make the whole experience consistent, now by returning nothing we would get a behavior as consistent as when NULL is used in input, so point taken to tune the behavior for unsupported relkinds and undefined objects. Does the attached look fine to you? -- Michael diff --git a/src/backend/utils/adt/partitionfuncs.c b/src/backend/utils/adt/partitionfuncs.c index ffd66b6439..36d9f69cbc 100644 --- a/src/backend/utils/adt/partitionfuncs.c +++ b/src/backend/utils/adt/partitionfuncs.c @@ -69,9 +69,6 @@ pg_partition_tree(PG_FUNCTION_ARGS) FuncCallContext *funcctx; ListCell **next; - if (!check_rel_can_be_partition(rootrelid)) - PG_RETURN_NULL(); - /* stuff done only on the first call of the function */ if (SRF_IS_FIRSTCALL()) { @@ -82,6 +79,9 @@ pg_partition_tree(PG_FUNCTION_ARGS) /* create a function context for cross-call persistence */ funcctx = SRF_FIRSTCALL_INIT(); + if (!check_rel_can_be_partition(rootrelid)) + SRF_RETURN_DONE(funcctx); + /* switch to memory context appropriate for multiple function calls */ oldcxt = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); diff --git a/src/test/regress/expected/partition_info.out b/src/test/regress/expected/partition_info.out index a884df976f..73269ffd09 100644 --- a/src/test/regress/expected/partition_info.out +++ b/src/test/regress/expected/partition_info.out @@ -9,8 +9,7 @@ SELECT * FROM pg_partition_tree(NULL); SELECT * FROM pg_partition_tree(0); relid | parentrelid | isleaf | level ---+-++--- - | || -(1 row) +(0 rows) SELECT pg_partition_root(NULL); pg_partition_root @@ -163,14 +162,12 @@ CREATE MATERIALIZED VIEW ptif_test_matview AS SELECT 1; SELECT * FROM pg_partition_tree('ptif_test_view'); relid | parentrelid | isleaf | level ---+-++--- - | || -(1 row) +(0 rows) SELECT * FROM pg_partition_tree('ptif_test_matview'); relid | parentrelid | isleaf | level ---+-++--- - | || -(1 row) +(0 rows) SELECT pg_partition_root('ptif_test_view'); pg_partition_root signature.asc Description: PGP signature
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Robert Haas [mailto:robertmh...@gmail.com] > I don't think that a VACUUM option would be out of place, but a GUC > sounds like an attractive nuisance to me. It will encourage people to > just flip it blindly instead of considering the particular cases where > they need that behavior, and I think chances are good that most people > who do that will end up being sad. Ouch, I sent my previous mail before reading this. I can understand it may be cumbersome to identify and specify each table, so I tend to agree the parameter in postgresql, which is USERSET to allow ALTER DATABASE/USER SET to tune specific databases and applications. But should the vacuuming of system catalogs also follow this setting? Regards Takayuki Tsunakawa
Re: BUG #15293: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events
If you are trying to get around the issue for now, what my team did was cron an insert statement on the database server. We have a queue table that has some of these triggers setup so it was easy to write a no-op row to the queue. This had the side effect of flushing the notification queue. We haven't had any issues with this approach. I can also volunteer to help test out patches. Thanks, -mdj On Fri, Feb 22, 2019 at 11:37 AM Robert Welin wrote: > I have reproduced this bug on PostgreSQL version 10.7 and 11.2 using the > steps described in Michael Powers' original report. The issue also still > seems to be present even with the patch provided by Sergei Kornilov. > > Are there plans to address this issue any time soon or is there some way > I can assist in fixing it? It would be great to have notifications from > logical replication. > > Regards, > Robert Welin > >
Re: get_controlfile() can leak fds in the backend
On Wed, Feb 27, 2019 at 07:45:11PM -0500, Joe Conway wrote: > It seems to me that OpenTransientFile() is more appropriate. Patch done > that way attached. Works for me, thanks for sending a patch! While on it, could you clean up the comment on top of get_controlfile()? "char" is mentioned instead of "const char" for DataDir which is incorrect. I would remove the complete set of arguments from the description and just keep the routine name. -- Michael signature.asc Description: PGP signature
Re: get_controlfile() can leak fds in the backend
On Wed, Feb 27, 2019 at 11:50:17AM +0100, Fabien COELHO wrote: >> Shouldn't be necessary - the control file fits into a single page, and >> writes of that size ought to always be atomic. And I also think >> introducing flock usage for this would be quite disproportional. There are static assertions to make sure that the side of control file data never gets higher than 512 bytes for this purpose. > Note that my concern is not about the page size, but rather that as more > commands may change the cluster status by editing the control file, it would > be better that a postmaster does not start while a pg_rewind or enable > checksum or whatever is in progress, and currently there is a possible race > condition between the read and write that can induce an issue, at least > theoretically. Something that I think we could live instead is a special flag in the control file to mark the postmaster as in maintenance mode. This would be useful to prevent the postmaster to start if seeing this flag in the control file, as well to find out that a host has crashed in the middle of a maintenance operation. We don't give this insurance now when running pg_rewind, which is bad. That's also separate from the checksum-related patches and pg_rewind. flock() can be something hard to live with for cross-platform compatibility like Windows (LockFileEx) or fancy platforms. And note that we don't use it yet in the tree. And flock() would help in the first case I am mentioning, but not in the second. -- Michael signature.asc Description: PGP signature
Re: get_controlfile() can leak fds in the backend
Hi, On 2019-02-27 11:50:17 +0100, Fabien COELHO wrote: > Note that my concern is not about the page size, but rather that as more > commands may change the cluster status by editing the control file, it would > be better that a postmaster does not start while a pg_rewind or enable > checksum or whatever is in progress, and currently there is a possible race > condition between the read and write that can induce an issue, at least > theoretically. Seems odd to bring this up in this thread, it really has nothing to do with the topic. If we were to want to do more here, ISTM the right approach would use the postmaster pid file, not the control file. Greetings, Andres Freund
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Michael Paquier [mailto:mich...@paquier.xyz] > This makes the test page-size sensitive. While we don't ensure that tests > can be run with different page sizes, we should make a maximum effort to > keep the tests compatible if that's easy enough. In this case you could > just use > 0 as base comparison. I can fix that by myself, so no need to > send a new version. Good idea. Done. > Should we also document that the parameter is effective for autovacuum? > The name can lead to confusion regarding that. I'm not sure for the need because autovacuum is just an automatic execution of vacuum, and existing vacuum_xxx parameters also apply to autovacuum. But being specific is good anyway, so I added reference to autovacuum in the description. > Also, shouldn't the relopt check happen in should_attempt_truncation()? > It seems to me that if we use this routine somewhere else then it should > be controlled by the option. That makes sense. Done. > At the same time, we also have REL_TRUNCATE_FRACTION and > REL_TRUNCATE_MINIMUM which could be made equally user-tunnable. > That's more difficult to control, still why don't we also consider this > part? I thought of it, too. But I didn't have a good idea on how to explain those parameters. I'd like to separate it. > Another thing that seems worth thinking about is a system-level GUC, and > an option in the VACUUM command to control if truncation should happen or > not. We have a lot of infrastructure to control such options between vacuum > and autovacuum, so it could be a waste to not consider potential synergies. Being able to specify this parameter in postgresql.conf and SET (especially ALTER DATABASE/USER to target specific databases/applications) might be useful, but I'm not sure... I'm less confident about whether VACUUM command can specify this, because this is a property of table under a specific workload, not a changable property of each VACUUM action. Anyway, I expect it won't be difficult to add those configurability without contradicting the design, so I'm inclined to separate it. From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > Yeah, that would work. Or it's kind of hackie but the rolling back the > insertion instead of INSERT and DELETE might also work. That's good, because it allows us to keep running reloptions test in parallel with other tests. Done. Regards Takayuki Tsunakawa disable-vacuum-truncation_v4.patch Description: disable-vacuum-truncation_v4.patch
Re: get_controlfile() can leak fds in the backend
On 2/27/19 10:26 AM, Joe Conway wrote: > On 2/27/19 2:47 AM, Michael Paquier wrote: >> Hi all, >> (CC-ing Joe as of dc7d70e) > According to that comment BasicOpenFile does not seem to solve the issue > you are pointing out (leaking of file descriptor on ERROR). Perhaps > OpenTransientFile() is more appropriate? I am on the road at the moment > so have not looked very deeply at this yet though. It seems to me that OpenTransientFile() is more appropriate. Patch done that way attached. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c index abfe706..c3b1934 100644 *** a/src/common/controldata_utils.c --- b/src/common/controldata_utils.c *** *** 27,32 --- 27,35 #include "catalog/pg_control.h" #include "common/controldata_utils.h" #include "port/pg_crc32c.h" + #ifndef FRONTEND + #include "storage/fd.h" + #endif /* * get_controlfile(char *DataDir, const char *progname, bool *crc_ok_p) *** get_controlfile(const char *DataDir, con *** 51,63 ControlFile = palloc(sizeof(ControlFileData)); snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir); - if ((fd = open(ControlFilePath, O_RDONLY | PG_BINARY, 0)) == -1) #ifndef FRONTEND ereport(ERROR, (errcode_for_file_access(), errmsg("could not open file \"%s\" for reading: %m", ControlFilePath))); #else { fprintf(stderr, _("%s: could not open file \"%s\" for reading: %s\n"), progname, ControlFilePath, strerror(errno)); --- 54,67 ControlFile = palloc(sizeof(ControlFileData)); snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir); #ifndef FRONTEND + if ((fd = OpenTransientFile(ControlFilePath, O_RDONLY | PG_BINARY)) == -1) ereport(ERROR, (errcode_for_file_access(), errmsg("could not open file \"%s\" for reading: %m", ControlFilePath))); #else + if ((fd = open(ControlFilePath, O_RDONLY | PG_BINARY, 0)) == -1) { fprintf(stderr, _("%s: could not open file \"%s\" for reading: %s\n"), progname, ControlFilePath, strerror(errno)); *** get_controlfile(const char *DataDir, con *** 95,101 --- 99,109 #endif } + #ifndef FRONTEND + CloseTransientFile(fd); + #else close(fd); + #endif /* Check the CRC. */ INIT_CRC32C(crc); signature.asc Description: OpenPGP digital signature
Re: pgsql: Avoid creation of the free space map for small heap relations, t
On Wed, Feb 27, 2019 at 11:07 AM John Naylor wrote: > > On Wed, Feb 27, 2019 at 5:06 AM Amit Kapila wrote: > > I have tried this test many times (more than 1000 times) by varying > > thread count, but couldn't reproduce it. My colleague, Kuntal has > > tried a similar test overnight, but the issue didn't reproduce which > > is not surprising to me seeing the nature of the problem. As I could > > reproduce it via the debugger, I think we can go ahead with the fix. > > I have improved the comments in the attached patch and I have also > > tried to address Tom's concern w.r.t comments by adding additional > > comments atop of data-structure used to maintain the local map. > > The flaw in my thinking was treating extension too similarly too > finding an existing block. With your patch clearing the local map in > the correct place, it seems the call at hio.c:682 is now superfluous? > What if get some valid block from the first call to GetPageWithFreeSpace via local map which has required space? I think in that case we will need the call at hio.c:682. Am I missing something? > It wasn't sufficient before, so should this call be removed so as not > to mislead? Or maybe keep it to be safe, with a comment like "This > should already be cleared by now, but make sure it is." > > + * To maintain good performance, we only mark every other page as available > + * in this map. > > I think this implementation detail is not needed to comment on the > struct. I think the pointer to the README is sufficient. > Fair enough, I will remove that part of a comment once we agree on the above point. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Allowing extensions to supply operator-/function-specific info
> On Feb 27, 2019, at 3:40 PM, Tom Lane wrote: > >> Variable SupportRequestCost is very exciting, but given that variable cost >> is usually driven by the complexity of arguments, what kind of argument is >> the SupportRequestCost call fed during the planning stage? Constant >> arguments are pretty straight forward, but what gets sent in when a column >> is one (or all) of the arguments? > > You'll see whatever is in the post-constant-folding parse tree. If it's a > Const, you can look at the value. If it's a Var, you could perhaps look > at the pg_statistic info for that column, though whether that would give > you much of a leg up for cost estimation is hard to say. For any sort of > expression, you're probably going to be reduced to using a default > estimate. The core code generally doesn't try to be intelligent about > anything beyond the Const and Var cases. Actually, this is interesting, maybe there’s something to be done looking at the vertex density of the area under consideration… would require gathering extra stats, but could be useful (maybe, at some point feeding costs into plans has to degenerate into wankery…) Another question: I added three indexes to my test table: CREATE INDEX foo_g_gist_x ON foo USING GIST (g); CREATE INDEX foo_g_gist_nd_x ON foo USING GIST (g gist_geometry_ops); CREATE INDEX foo_g_spgist_x ON foo USING SPGIST (g); They all support the overlaps (&&) operator. So, SupportRequestIndexCondition happens three times, and each time I say “yep, sure, you can construct an index condition by putting the && operator between left_arg and right_arg”. How does the planner end up deciding on which index to *actually* use? The selectivity is the same, the operator is the same. I found that I got the ND GIST one first, then the SPGIST and finally the 2d GIST, which is unfortunate, because the 2D and SPGIST are almost certainly faster than the ND GIST. In practice, most people will just have one spatial index at a time, but I still wonder? P
Re: Allowing extensions to supply operator-/function-specific info
Paul Ramsey writes: > The documentation says that a support function should have a signature > "supportfn(internal) returns internal”, but doesn’t say which (if any) > annotations should be provided. IMMUTABLE? PARALLEL SAFE? STRICT? None? All? It doesn't matter much given that these things aren't callable from SQL. The builtin ones are marked immutable/safe/strict, but that's mostly because that's the default state for builtin functions. The only one I'd get excited about is marking it strict if you're not going to check for a null argument ... and even that is neatnik-ism, not something that will have any practical effect. > Variable SupportRequestCost is very exciting, but given that variable cost is > usually driven by the complexity of arguments, what kind of argument is the > SupportRequestCost call fed during the planning stage? Constant arguments are > pretty straight forward, but what gets sent in when a column is one (or all) > of the arguments? You'll see whatever is in the post-constant-folding parse tree. If it's a Const, you can look at the value. If it's a Var, you could perhaps look at the pg_statistic info for that column, though whether that would give you much of a leg up for cost estimation is hard to say. For any sort of expression, you're probably going to be reduced to using a default estimate. The core code generally doesn't try to be intelligent about anything beyond the Const and Var cases. regards, tom lane
Re: some hints to understand the plsql cursor.
Thanks Kumar. actually I was asking what the the cursor did in the server. By looking the code, looks it cache the previous Portal with the name is the cursor name, whenever we run the fetch from the portal, it will restore the previous Portal and run it. But your minimized and interactive code definitely makes the debug much quicker. Thank you very much! On Wed, Feb 27, 2019 at 11:35 PM Dilip Kumar wrote: > On Wed, Feb 27, 2019 at 4:42 PM Andy Fan wrote: > > > > actually I'm hacking pg for a function like : > > 1. define a select query. > > 2. client ask for some data. and server reply some data. server will do > NOTHING if client doesn't ask any more.. > > 3. client ask some data more data with a batch and SERVER reply some > data then. then do NOTHING. > > > > currently the simple "select * from t", the server will try to send the > data to client at one time which is not something I want. > > > > by looking into the plsql, looks it has some api like: > > > > fetch 10 from cursor_1; > > fetch 10 from cursor_1; > > > > I'm lacking of the experience to hack plsql. so my question are: > > 1. Does pg has some codes which act like the "ask -> reply -> ask > again -> reply again" on the server code?currently I'm not sure if the > above "fetch" really work like this. > > 2. any resources or hint or suggestion to understand the "fetch" > statement? > > I guess you are looking for these syntax? > > postgres=# BEGIN; > BEGIN > postgres=# DECLARE cur CURSOR FOR SELECT * FROM t; > DECLARE CURSOR > postgres=# FETCH NEXT cur; > a > --- > 1 > (1 row) > > postgres=# FETCH 10 cur; > a > --- > 2 > 3 > 4 > 5 > 1 > 2 > 3 > 4 > 5 > 6 > (10 rows) > > postgres=# FETCH NEXT cur; > a > --- > 7 > (1 row) > > postgres=# CLOSE cur; > CLOSE CURSOR > > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >
Re: patch to allow disable of WAL recycling
On 2019-Feb-05, Jerry Jelinek wrote: > First, since last fall, we have found another performance problem related > to initializing WAL files. I've described this issue in more detail below, > but in order to handle this new problem, I decided to generalize the patch > so the tunable refers to running on a Copy-On-Write filesystem instead of > just being specific to WAL recycling. Specifically, I renamed the GUC > tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it > more obvious what is being tuned and will also be more flexible if there > are other problems in the future which are related to running on a COW > filesystem. I'm happy to choose a different name for the tunable if people > don't like 'wal_cow_fs'. I think the idea of it being a generic tunable for assorted behavior changes, rather than specific to WAL recycling, is a good one. I'm unsure about your proposed name -- maybe "wal_cow_filesystem" is better? I'm rewording your doc addition a little bit. Here's my proposal: This parameter should only be set to on when the WAL resides on a Copy-On-Write (COW) filesystem. Enabling this option adjusts behavior to take advantage of the filesystem characteristics (for example, recycling WAL files and zero-filling new WAL files are disabled). This part sounds good enough to me -- further suggestions welcome. I'm less sure about this phrase: This setting is only appropriate for filesystems which allocate new disk blocks on every write. Is "... which allocate new disk blocks on every write" a technique distinct from CoW itself? I'm confused as to what it means, or how can the user tell whether they are on such a filesystem. Obviously you're thinking that ZFS is such a filesystem and everybody who has pg_wal on ZFS should enable this option. What about, say, Btrfs -- should they turn this option on? Browsing the wikipedia, I find that Windows has this ReFS thing that apparently is also CoW, but NTFS isn't. I don't think either Btrfs or ReFS are realistic options to put pg_wal on, so let's just list the common filesystems for which users are supposed to enable this option ... which I think nowadays is just ZFS. All in all, I would replace this phrase with something like: "This setting should be enabled when pg_wal resides on a ZFS filesystem or similar." That should be weasely enough that it's clear that we expect users to do the homework when on unusual systems, while actively pointing out the most common use case. > Finally, the patch now includes bypassing the zero-fill for new WAL files > when wal_cow_fs is true. That makes sense. I think all these benchmarks Tomas Vondra run are not valid anymore ... The attached v2 has assorted cosmetic cleanups. If you can validate it, I would appreciate it. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From a4ce02f5cc8ad983c34712083f9cba7fda6d5b38 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 27 Feb 2019 19:41:05 -0300 Subject: [PATCH v2] pg_wal on COW fs --- doc/src/sgml/config.sgml | 20 src/backend/access/transam/xlog.c | 101 -- src/backend/utils/misc/guc.c | 13 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/access/xlog.h | 1 + 5 files changed, 102 insertions(+), 34 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 8bd57f376b2..60a873273aa 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2959,6 +2959,26 @@ include_dir 'conf.d' + + wal_cow_filesystem (boolean) + + wal_cow_filesystem configuration parameter + + + + +This parameter should only be set to on when the WAL +resides on a Copy-On-Write (COW) +filesystem. +Enabling this option adjusts some behavior to take advantage of the +filesystem characteristics (for example, recycling WAL files and +zero-filling new WAL files are disabled). +This setting is only appropriate for filesystems which +allocate new disk blocks on every write. + + + + diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ecd12fc53ae..1acce1c70d6 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -94,6 +94,7 @@ bool wal_log_hints = false; bool wal_compression = false; char *wal_consistency_checking_string = NULL; bool *wal_consistency_checking = NULL; +bool wal_cow_filesystem = false; bool log_checkpoints = false; int sync_method = DEFAULT_SYNC_METHOD; int wal_level = WAL_LEVEL_MINIMAL; @@ -3216,6 +3217,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) XLogSegNo max_segno;
Re: Allowing extensions to supply operator-/function-specific info
A few more questions… The documentation says that a support function should have a signature "supportfn(internal) returns internal”, but doesn’t say which (if any) annotations should be provided. IMMUTABLE? PARALLEL SAFE? STRICT? None? All? Variable SupportRequestCost is very exciting, but given that variable cost is usually driven by the complexity of arguments, what kind of argument is the SupportRequestCost call fed during the planning stage? Constant arguments are pretty straight forward, but what gets sent in when a column is one (or all) of the arguments? Thanks, P
Re: Row Level Security − leakproof-ness and performance implications
On 2/20/19 11:24 AM, Tom Lane wrote: > Pierre Ducroquet writes: >> For simple functions like enum_eq/enum_ne, marking them leakproof is an >> obvious fix (patch attached to this email, including also textin/textout). > > This is not nearly as "obvious" as you think. See prior discussions, > notably > https://www.postgresql.org/message-id/flat/31042.1546194242%40sss.pgh.pa.us > > Up to now we've taken a very strict definition of what leakproofness > means; as Noah stated, if a function can throw errors for some inputs, > it's not considered leakproof, even if those inputs should never be > encountered in practice. Most of the things we've been willing to > mark leakproof are straight-line code that demonstrably can't throw > any errors at all. > > The previous thread seemed to have consensus that the hazards in > text_cmp and friends were narrow enough that nobody had a practical > problem with marking them leakproof --- but we couldn't define an > objective policy statement that would allow making such decisions, > so nothing's been changed as yet. I think it is important to have > an articulable policy about this, not just a seat-of-the-pants > conclusion about the risk level in a particular function. What if we provided an option to redact all client messages (leaving logged messages as-is). Separately we could provide a GUC to force all functions to be resolved as leakproof. Depending on your requirements, having both options turned on could be perfectly acceptable. Patch for discussion attached. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c index e88c45d..8c32d6c 100644 *** a/src/backend/utils/cache/lsyscache.c --- b/src/backend/utils/cache/lsyscache.c *** *** 46,51 --- 46,54 #include "utils/syscache.h" #include "utils/typcache.h" + /* GUC to force functions to be treated as leakproof */ + extern bool force_leakproof; + /* Hook for plugins to get control in get_attavgwidth() */ get_attavgwidth_hook_type get_attavgwidth_hook = NULL; *** get_func_leakproof(Oid funcid) *** 1595,1600 --- 1598,1610 HeapTuple tp; bool result; + /* + * If client message suppression was requested, + * treat all functions as leakproof + */ + if (force_leakproof) + return true; + tp = SearchSysCache1(PROCOID, ObjectIdGetDatum(funcid)); if (!HeapTupleIsValid(tp)) elog(ERROR, "cache lookup failed for function %u", funcid); diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 8b4720e..9941756 100644 *** a/src/backend/utils/error/elog.c --- b/src/backend/utils/error/elog.c *** int Log_destination = LOG_DESTINATION_ *** 107,112 --- 107,113 char *Log_destination_string = NULL; bool syslog_sequence_numbers = true; bool syslog_split_messages = true; + bool suppress_client_messages = false; #ifdef HAVE_SYSLOG *** send_message_to_frontend(ErrorData *edat *** 3166,3189 /* M field is required per protocol, so always send something */ pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_PRIMARY); - if (edata->message) - err_sendstring(&msgbuf, edata->message); - else - err_sendstring(&msgbuf, _("missing error text")); ! if (edata->detail) { ! pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_DETAIL); ! err_sendstring(&msgbuf, edata->detail); ! } ! /* detail_log is intentionally not used here */ ! if (edata->hint) ! { ! pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_HINT); ! err_sendstring(&msgbuf, edata->hint); } if (edata->context) { --- 3167,3197 /* M field is required per protocol, so always send something */ pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_PRIMARY); ! /* skip sending if requested */ ! if (!suppress_client_messages) { ! if (edata->message) ! err_sendstring(&msgbuf, edata->message); ! else ! err_sendstring(&msgbuf, _("missing error text")); ! if (edata->detail) ! { ! pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_DETAIL); ! err_sendstring(&msgbuf, edata->detail); ! } ! /* detail_log is intentionally not used here */ ! ! if (edata->hint) ! { ! pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_HINT); ! err_sendstring(&msgbuf, edata->hint); ! } } + else + err_sendstring(&msgbuf, _("redacted message text")); if (edata->context) { *** send_message_to_frontend(ErrorData *edat *** 3274,3283 if (edata->show_funcname && edata->funcname) appendStringInfo(&buf, "%s: ", edata->funcname); ! if (edata->message) ! appendStringInfoString(&buf, edata->message); else ! appendStringInfoString(&buf, _("missing error text")); if (edata->cursorpos > 0) appendStringInfo(&buf, _(" at character %d"), --- 3282,3297 if (edat
Re: Why don't we have a small reserved OID range for patch revisions?
Peter Geoghegan writes: > On Wed, Feb 27, 2019 at 2:44 PM Peter Eisentraut > wrote: >> If this is the problem (although I think we'd find that OID collisions >> are rather rare compared to other gratuitous cfbot failures), why not >> have the cfbot build with a flag that ignores OID collisions? > How would that work? It could work for conflicting OIDs in different system catalogs (so that the "conflict" is an artifact of our assignment rules rather than an intrinsic problem). But I think the majority of new hand-assigned OIDs are in pg_proc, so that this kind of hack would not help as much as one might wish. regards, tom lane
Re: Why don't we have a small reserved OID range for patch revisions?
On Wed, Feb 27, 2019 at 2:44 PM Peter Eisentraut wrote: > If this is the problem (although I think we'd find that OID collisions > are rather rare compared to other gratuitous cfbot failures), why not > have the cfbot build with a flag that ignores OID collisions? How would that work? -- Peter Geoghegan
Re: Why don't we have a small reserved OID range for patch revisions?
On 2019-02-27 22:50, Peter Geoghegan wrote: > However, the continuous > integration stuff has created an expectation that your patch shouldn't > be left to bitrot for long. Silly mechanical bitrot now seems like a > much bigger problem than it was before these developments. It unfairly > puts reviewers off engaging. If this is the problem (although I think we'd find that OID collisions are rather rare compared to other gratuitous cfbot failures), why not have the cfbot build with a flag that ignores OID collisions? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Why don't we have a small reserved OID range for patch revisions?
On Wed, Feb 27, 2019 at 2:39 PM Peter Eisentraut wrote: > The changes of a patch (a) allocating a new OID, (b) a second patch > allocating a new OID, (c) both being in flight at the same time, (d) > actually picking the same OID, are small. But...they are. Most patches don't create new system catalog entries at all. Of those that do, the conventions around assigning new OIDs make it fairly likely that problems will emerge. > I guess the overall time lost > to this issue is perhaps 2 hours per year. On the other hand, with > about 2000 commits to master per year, if this renumbering business only > adds 2 seconds of overhead to committing, we're coming out behind. The time spent on the final commit is not the cost we're concerned about, though. It isn't necessary to do that more than once, whereas all but the most trivial of patches receive multiple rounds of review and revision. -- Peter Geoghegan
Re: Why don't we have a small reserved OID range for patch revisions?
On 2019-02-27 22:27, Tom Lane wrote: >> OID collision doesn't seem to be a significant problem (for me). > > Um, I beg to differ. It's not at all unusual for pending patches to > bit-rot for no reason other than suddenly getting an OID conflict. > I don't have to look far for a current example: I'm not saying it doesn't happen, but that it's not a significant problem overall. The changes of a patch (a) allocating a new OID, (b) a second patch allocating a new OID, (c) both being in flight at the same time, (d) actually picking the same OID, are small. I guess the overall time lost to this issue is perhaps 2 hours per year. On the other hand, with about 2000 commits to master per year, if this renumbering business only adds 2 seconds of overhead to committing, we're coming out behind. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Why don't we have a small reserved OID range for patch revisions?
Peter Geoghegan writes: > On Wed, Feb 27, 2019 at 1:27 PM Tom Lane wrote: >>> OID collision doesn't seem to be a significant problem (for me). >> Um, I beg to differ. It's not at all unusual for pending patches to >> bit-rot for no reason other than suddenly getting an OID conflict. >> I don't have to look far for a current example: >> https://travis-ci.org/postgresql-cfbot/postgresql/builds/498955351 > Patch authors shouldn't be left with any excuse for leaving their > patch to bitrot for long. And, more casual patch reviewers shouldn't > have any excuse for not downloading a patch and applying it locally, > so that they can spend a spare 10 minutes kicking the tires. Yeah, that latter point is really the killer argument. We don't want to make people spend valuable review time on fixing uninteresting OID conflicts. It's even more annoying that several people might have to duplicate the same work, if they're testing a patch independently. Given a convention that under-development patches use OIDs in the 9K range, the only time anybody would have to resolve OID conflicts for testing would be if they were trying to test the combination of two or more patches. Even then, an OID-renumbering script would make it pretty painless: apply patch 1, renumber its OIDs to someplace else, apply patch 2, repeat as needed. > Why not have unused_oids reference the convention as a "tip"? Hmm, could be helpful. regards, tom lane
Re: Why don't we have a small reserved OID range for patch revisions?
I wrote: > We do need a couple of pieces of new infrastructure to make this idea > conveniently workable. One is a tool to allow automatic OID renumbering > instead of having to do it by hand; Naylor has a draft for that upthread. Oh: arguably, something else we'd need to do to ensure that OID renumbering is trouble-free is to institute a strict rule that OID references in the *.dat files must be symbolic. We had not bothered to convert every single reference type before, reasoning that some of them were too little-used to be worth the trouble; but someday that'll rise up to bite us, if semi-automated renumbering becomes a thing. It looks to me like the following OID columns remain unconverted: pg_class.reltype pg_database.dattablespace pg_ts_config.cfgparser pg_ts_config_map.mapcfg, mapdict pg_ts_dict.dicttemplate pg_type.typcollation pg_type.typrelid regards, tom lane
Re: Why don't we have a small reserved OID range for patch revisions?
On Wed, Feb 27, 2019 at 1:27 PM Tom Lane wrote: > > OID collision doesn't seem to be a significant problem (for me). > > Um, I beg to differ. It's not at all unusual for pending patches to > bit-rot for no reason other than suddenly getting an OID conflict. > I don't have to look far for a current example: > https://travis-ci.org/postgresql-cfbot/postgresql/builds/498955351 OID conflicts are not that big of a deal when building a patch locally, because almost everyone knows what the exact problem is immediately, and because you probably have more than a passing interest in the patch to even do that much. However, the continuous integration stuff has created an expectation that your patch shouldn't be left to bitrot for long. Silly mechanical bitrot now seems like a much bigger problem than it was before these developments. It unfairly puts reviewers off engaging. Patch authors shouldn't be left with any excuse for leaving their patch to bitrot for long. And, more casual patch reviewers shouldn't have any excuse for not downloading a patch and applying it locally, so that they can spend a spare 10 minutes kicking the tires. > Perhaps it'd be useful for genbki.pl to spit out a warning (NOT an > error) if it sees OIDs in the reserved range. I'm not sure that that'd > really be worth the trouble though, since one could easily forget > about it while reviewing/testing just before commit, and it'd just be > useless noise up until it was time to commit. My sense is that we should err on the side of being informative. > Another issue, as Robert pointed out, is that this does need to be > a formal convention not something undocumented. Naylor's patch adds > a mention of it in bki.sgml, but I wonder if anyplace else should > talk about it. Why not have unused_oids reference the convention as a "tip"? > I concede your point that a prudent committer would do a rebuild and > retest rather than just trusting the tool. But really, how much > extra work is that? If you've spent any time optimizing your workflow, > a full rebuild and check-world should be under five minutes on any > hardware anyone would be using for development today. If you use the "check-world parallel" recipe on the committing checklist Wiki page, and if you use ccache, ~2 minutes is attainable for optimized builds (though the recipe doesn't work on all release branches). I don't think that a committer should be a committing anything if they're not willing to do this much. It's not just prudent -- it is the *bare minimum* when committing a patch that creates system catalog entries. -- Peter Geoghegan
Re: Refactoring the checkpointer's fsync request queue
On Thu, Feb 28, 2019 at 10:27 AM Shawn Debnath wrote: > We had a quick offline discussion to get on the same page and we agreed > to move forward with Andres' approach above. Attached is patch v10. > Here's the overview of the patch: Thanks. I will review, and try to rebase my undo patches on top of this and see what problems I crash into. > Ran make check-world and repeated the tests described in [1]. The > numbers show a 12% drop in total time for single run of 1000 clients and > ~62% drop in total time for 10 parallel runs with 200 clients: Hmm, good but unexpected. Will poke at this here. -- Thomas Munro https://enterprisedb.com
Re: POC: converting Lists into arrays
On 2019-Feb-27, Tom Lane wrote: > I'm particularly unsure about whether we need two macros; though the > way I initially tried it with just list_cell_is_last() seemed kind of > double-negatively confusing in the places where the test needs to be > not-last. Also, are these macro names too long, and if so what would > be better? I think "!list_cell_is_last()" is just as readable, if not more, than the "is_not_last" locution: appendStringInfoChar(&buf, '\''); if (!list_cell_is_last(l)) appendStringInfoString(&buf, ", "); I'd go with a single macro. +1 for backpatching the new macros, too. I suspect extension authors are going to need to provide compatibility versions anyway, to be compilable against older minors. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Why don't we have a small reserved OID range for patch revisions?
Peter Eisentraut writes: > On 2019-02-08 19:14, Tom Lane wrote: >> Quite a few people have used OIDs up around 8000 or 9000 for this purpose; >> I doubt we need a formally reserved range for it. The main problem with >> doing it is the hazard that the patch'll get committed just like that, >> suddenly breaking things for everyone else doing likewise. > For that reason, I'm not in favor of this. Forgetting to update the > catversion is already common enough (for me). Adding another step > between having a seemingly final patch and being able to actually commit > it doesn't seem attractive. Moreover, these "final adjustments" would > tend to require a full rebuild and retest, adding even more overhead. > OID collision doesn't seem to be a significant problem (for me). Um, I beg to differ. It's not at all unusual for pending patches to bit-rot for no reason other than suddenly getting an OID conflict. I don't have to look far for a current example: https://travis-ci.org/postgresql-cfbot/postgresql/builds/498955351 We do need a couple of pieces of new infrastructure to make this idea conveniently workable. One is a tool to allow automatic OID renumbering instead of having to do it by hand; Naylor has a draft for that upthread. Perhaps it'd be useful for genbki.pl to spit out a warning (NOT an error) if it sees OIDs in the reserved range. I'm not sure that that'd really be worth the trouble though, since one could easily forget about it while reviewing/testing just before commit, and it'd just be useless noise up until it was time to commit. Another issue, as Robert pointed out, is that this does need to be a formal convention not something undocumented. Naylor's patch adds a mention of it in bki.sgml, but I wonder if anyplace else should talk about it. I concede your point that a prudent committer would do a rebuild and retest rather than just trusting the tool. But really, how much extra work is that? If you've spent any time optimizing your workflow, a full rebuild and check-world should be under five minutes on any hardware anyone would be using for development today. And, yeah, we probably will make mistakes like this, just like we sometimes forget the catversion bump. As long as we have a tool for OID renumbering, I don't think that's the end of the world. Fixing it after the fact isn't going to be a big deal, any more than it is for catversion. regards, tom lane
Re: POC: converting Lists into arrays
Robert Haas writes: > On Wed, Feb 27, 2019 at 3:27 PM Tom Lane wrote: >> 0001 below does this. I found a couple of places that could use >> forfive(), as well. I think this is a clear legibility and >> error-proofing win, and we should just push it. > It sounds like some of these places might need a bigger restructuring > - i.e. to iterate over a list/vector of structs with 5 members instead > of iterating over five lists in parallel. Meh. Most of them are iterating over parsetree substructure, eg the components of a RowCompareExpr. So we could not change them without pretty extensive infrastructure changes including a catversion bump. Also, while the separated substructure is indeed a pain in the rear in some places, it's actually better for other uses. Two examples related to RowCompareExpr: * match_rowcompare_to_indexcol can check whether all the left-hand or right-hand expressions are nonvolatile with one easy call to contain_volatile_functions on the respective list. To do the same with a single list of sub-structs, it'd need bespoke code for each case to run through the list and consider only the correct subexpression of each sub-struct. * expand_indexqual_rowcompare can deal with commuted-clause cases just by swapping the list pointers at the start, it doesn't have to think about it over again for each pair of elements. So I'm not that excited about refactoring the data representation for these. I'm content (for now) with getting these places in line with the coding convention we use elsewhere for similar cases. regards, tom lane
Re: POC: converting Lists into arrays
On Wed, Feb 27, 2019 at 3:27 PM Tom Lane wrote: > 0001 below does this. I found a couple of places that could use > forfive(), as well. I think this is a clear legibility and > error-proofing win, and we should just push it. It sounds like some of these places might need a bigger restructuring - i.e. to iterate over a list/vector of structs with 5 members instead of iterating over five lists in parallel. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] EvalPlanQual behaves oddly for FDW queries involving system columns
Hi, On 2015-05-12 14:24:34 -0400, Tom Lane wrote: > I did a very basic update of your postgres_fdw patch to test this with, > and attach that so that you don't have to repeat the effort. I'm not sure > whether we want to try to convert that into something committable. I'm > afraid that the extra round trips involved in doing row locking this way > will be so expensive that no one really wants it for postgres_fdw. It's > more credible that FDWs operating against local storage would have use > for it. Fujita-san, do you know of any FDWs that use this? I'm currently converting the EPQ machinery to slots, and in course of that I (with Horiguchi-san's help), converted RefetchForeignRow to return a slot. But there's currently no in-core user of this facility... I guess I can rebase the preliminary postgres_fdw patch here, but it bitrotted significantly. I also feel like there should be some test coverage for an API in a nontrivial part of the code... Greetings, Andres Freund
Re: POC: converting Lists into arrays
I wrote: > I did find a number of places where getting rid of explicit lnext() > calls led to just plain cleaner code. Most of these were places that > could be using forboth() or forthree() and just weren't. There's > also several places that are crying out for a forfour() macro, so > I'm not sure why we've stubbornly refused to provide it. I'm a bit > inclined to just fix those things in the name of code readability, > independent of this patch. 0001 below does this. I found a couple of places that could use forfive(), as well. I think this is a clear legibility and error-proofing win, and we should just push it. > I also noticed that there's quite a lot of places that are testing > lnext(cell) for being NULL or not. What that really means is whether > this cell is last in the list or not, so maybe readability would be > improved by defining a macro along the lines of list_cell_is_last(). > Any thoughts about that? 0002 below does this. I'm having a hard time deciding whether this part is a good idea or just code churn. It might be more readable (especially to newbies) but I can't evaluate that very objectively. I'm particularly unsure about whether we need two macros; though the way I initially tried it with just list_cell_is_last() seemed kind of double-negatively confusing in the places where the test needs to be not-last. Also, are these macro names too long, and if so what would be better? Also: if we accept either or both of these, should we back-patch the macro additions, so that these new macros will be available for use in back-patched code? I'm not sure that forfour/forfive have enough use-cases to worry about that; but the is-last macros would have a better case for that, I think. regards, tom lane diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c index 47e80ae..832c3e9 100644 --- a/src/backend/access/common/tupdesc.c +++ b/src/backend/access/common/tupdesc.c @@ -902,23 +902,12 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations) desc = CreateTemplateTupleDesc(natts); attnum = 0; - - l2 = list_head(types); - l3 = list_head(typmods); - l4 = list_head(collations); - foreach(l1, names) + forfour(l1, names, l2, types, l3, typmods, l4, collations) { char *attname = strVal(lfirst(l1)); - Oid atttypid; - int32 atttypmod; - Oid attcollation; - - atttypid = lfirst_oid(l2); - l2 = lnext(l2); - atttypmod = lfirst_int(l3); - l3 = lnext(l3); - attcollation = lfirst_oid(l4); - l4 = lnext(l4); + Oid atttypid = lfirst_oid(l2); + int32 atttypmod = lfirst_int(l3); + Oid attcollation = lfirst_oid(l4); attnum++; diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c index db3777d..7cbf9d3 100644 --- a/src/backend/executor/execExpr.c +++ b/src/backend/executor/execExpr.c @@ -1683,7 +1683,6 @@ ExecInitExprRec(Expr *node, ExprState *state, *l_opfamily, *l_inputcollid; ListCell *lc; -int off; /* * Iterate over each field, prepare comparisons. To handle @@ -1695,20 +1694,11 @@ ExecInitExprRec(Expr *node, ExprState *state, Assert(list_length(rcexpr->opfamilies) == nopers); Assert(list_length(rcexpr->inputcollids) == nopers); -off = 0; -for (off = 0, - l_left_expr = list_head(rcexpr->largs), - l_right_expr = list_head(rcexpr->rargs), - l_opno = list_head(rcexpr->opnos), - l_opfamily = list_head(rcexpr->opfamilies), - l_inputcollid = list_head(rcexpr->inputcollids); - off < nopers; - off++, - l_left_expr = lnext(l_left_expr), - l_right_expr = lnext(l_right_expr), - l_opno = lnext(l_opno), - l_opfamily = lnext(l_opfamily), - l_inputcollid = lnext(l_inputcollid)) +forfive(l_left_expr, rcexpr->largs, + l_right_expr, rcexpr->rargs, + l_opno, rcexpr->opnos, + l_opfamily, rcexpr->opfamilies, + l_inputcollid, rcexpr->inputcollids) { Expr *left_expr = (Expr *) lfirst(l_left_expr); Expr *right_expr = (Expr *) lfirst(l_right_expr); diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c index 324356e..8b29437 100644 --- a/src/backend/executor/nodeIndexscan.c +++ b/src/backend/executor/nodeIndexscan.c @@ -1332,12 +1332,12 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index, { /* (indexkey, indexkey, ...) op (expression, expression, ...) */ RowCompareExpr *rc = (RowCompareExpr *) clause; - ListCell *largs_cell = list_head(rc->largs); - ListCell *rargs_cell = list_head(rc->rargs); - ListCell *opnos_cell = list_head(rc->opnos); - ListCell *collids_cell = list_head(rc->inputcollids); ScanKey first_sub_key; int n_sub_key; + ListCell *largs_cell; + ListCell *rargs_cell; + ListCell *opnos_cell; + ListCell *collids_cell; Assert(!isorderby); @@ -1346,19 +1346,22 @@ ExecIndexBuildScanKeys(PlanState
Re: Index INCLUDE vs. Bitmap Index Scan
Tomas Vondra writes: > On 2/27/19 6:36 AM, Markus Winand wrote: > On 27.02.2019, at 00:22, Tom Lane wrote: >>> For example, given a filter condition like "1.0/c > 0.1", people >>> would complain if it still got zero-divide failures even after they'd >>> deleted all rows with c=0 from their table. >> Ok, but I don’t see how this case different for key columns vs. INCLUDE >> columns. > Yeah, I'm a bit puzzled by this difference too - why would it be safe > for keys and not the other included columns? It's not about the column, it's about the operator. We assume that operators appearing in opclasses are safe to execute even for index entries that correspond to dead rows. INCLUDE columns don't have any associated opclass, hence no assumed-usable operators. > I do recall a discussion about executing expressions on index tuples > during IOS (before/without inspecting the heap tuple). I'm too lazy to > search for the thread now, but I recall it was somehow related to > leak-proof-ness. And AFAICS none of the "div" procs are marked as > leak-proof, so perhaps that's one of the reasons? Leak-proof-ness is kind of related, perhaps, but it's not quite the property we're after here --- at least not to my mind. It might be an interesting discussion exactly what the relationship is. Meanwhile, we don't have any separate concept of functions that are safe to apply to any index entry; opclass membership is it. You could probably argue that any clause containing only indexed variables and operators that are members of *some* opclass could be used as a filter in advance of heap liveness checking. But we lack any infrastructure for that, in either the planner or executor. regards, tom lane
Re: psql display of foreign keys
On 2019-Feb-27, Michael Paquier wrote: > On Tue, Feb 26, 2019 at 07:27:57PM -0300, Alvaro Herrera wrote: > > Thanks for committing pg_partition_root ... but it turns out to be > > useless for this purpose. > > Well, what's done is done. The thing is useful by itself in my > opinion. Eh, of course -- note that the psql query I added does use pg_partition_root, it's just that it is not useful *all by itself*. > In the second patch, pg_partition_ancestors always sets include_self > to true. What's the use case you have in mind to set it to false? In > the other existing functions we always include the argument itself, so > we may want to keep things consistent. Hmm, true. > I think that you should make the function return a record of regclass > elements instead of OIDs to be consistent. This could save casts for > the callers. Yeah, done. > Adding the self-member at the beginning of the record set is more > consistent with the order of the results returned by > get_partition_ancestors(). You're right, done. > It would be nice to add some tests in partition_info.sql for tables > and indexes (both work). Well. I tried this scenario create table t1 (a int); create table t11 () inherits (t1); create table t2 (b int); create table t111() inherits (t1, t2); and the result I get from my new function is not good: alvherre=# select * from pg_partition_ancestors('t111'); relid --- t111 t1 (2 filas) it should have listed t2 too, but it doesn't. Since these functions aren't supposed to work on legacy inheritance anyway, I think the right action is to return the empty set. In the current version I just do what pg_partition_tree does, but I think we should adjust that behavior. I'll start a new thread about that. > For the meaning of using pg_partition_ancestors, I see... Not only do > you want to show the foreign keys defined in the top-most parent, but > also these defined in intermediate layers. That makes sense. Using > only pg_partition_root would have been enough to show FKs in the > top-most parent, but the intermediate ones would be missed (using only > pg_partition_root() would miss the FKs fk_partitioned_fk_5_a_fkey1 and > fk_partitioned_fk_5_a_fkey when doing "\d fk_partitioned_fk_5_1" based > on the test set). Exactly -- that's the whole point. We need to list all FKs that are applicable to the partition, indicating which relation is the one where the FK generates, and without polluting the output with countless "internal" pg_constraint rows. The output psql presents for the PK-side relation when it's partitioned, with my patch to support that, is quite ugly when there are many partitions. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 9e5cc595ea952763fdcd879ec19f3cbff82edae6 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 26 Feb 2019 17:05:24 -0300 Subject: [PATCH v4] pg_partition_ancestors --- doc/src/sgml/func.sgml | 10 +++ src/backend/utils/adt/partitionfuncs.c | 49 + src/include/catalog/pg_proc.dat | 5 ++ src/test/regress/expected/partition_info.out | 72 +++- src/test/regress/sql/partition_info.sql | 18 - 5 files changed, 151 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 86ff4e5c9e2..ee3962a6c92 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -20274,6 +20274,16 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); their partitions, and so on. + + +pg_partition_ancestors +pg_partition_ancestors(regclass) + + setof regclass + +List the ancestor relations of the given partition. + + pg_partition_root diff --git a/src/backend/utils/adt/partitionfuncs.c b/src/backend/utils/adt/partitionfuncs.c index ffd66b64394..2e7bf01106b 100644 --- a/src/backend/utils/adt/partitionfuncs.c +++ b/src/backend/utils/adt/partitionfuncs.c @@ -208,3 +208,52 @@ pg_partition_root(PG_FUNCTION_ARGS) Assert(OidIsValid(rootrelid)); PG_RETURN_OID(rootrelid); } + +/* + * pg_partition_ancestors + * + * Produces a view with one row per ancestor of the given partition, + * including the input relation itself. + */ +Datum +pg_partition_ancestors(PG_FUNCTION_ARGS) +{ + Oid relid = PG_GETARG_OID(0); + FuncCallContext *funcctx; + ListCell **next; + + if (!check_rel_can_be_partition(relid)) + PG_RETURN_NULL(); + + if (SRF_IS_FIRSTCALL()) + { + MemoryContext oldcxt; + List *ancestors; + + funcctx = SRF_FIRSTCALL_INIT(); + + oldcxt = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); + + ancestors = get_partition_ancestors(relid); + ancestors = lcons_oid(relid, ancestors); + + next = (ListCell **) palloc(sizeof(ListCell *)); + *next = list_head(ancestors); + funcctx->user_fctx = (void *) next; + +
Re: pg_partition_tree crashes for a non-defined relation
On 2018-Dec-09, Tom Lane wrote: > Stephen Frost writes: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > >> ... especially in code that's highly unlikely to break once written. > > > I don't entirely buy off on the argument that it's code that's 'highly > > unlikely to break once written' though- we do add new relkinds from time > > to time, for example. Perhaps we could have these functions run just > > once per relkind. > > Well, the relevant code is likely to be "if relkind is not x, y, or z, > then PG_RETURN_NULL". If we add a new relkind and forget to consider the > function, the outcome is a NULL result that perhaps should not have been > NULL ... but a test like this won't help us notice that. I just happened to come across the result of this rationale in pg_partition_tree() (an SRF) while developing a new related function, pg_partition_ancestors(), and find the resulting behavior rather absurd -- it returns one row with all NULL columns, rather than no rows. I think the sensible behavior would be to do SRF_RETURN_DONE() before stashing any rows to the output, so that we get an empty result set instead. alvherre=# select * from pg_partition_tree('information_schema.sequences'); relid | parentrelid | isleaf | level ---+-++--- | || (1 fila) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Remove Deprecated Exclusive Backup Mode
> On Feb 26, 2019, at 11:38, Magnus Hagander wrote: > That should not be a wiki page, really, that should be part of the main > documentation. I was just suggesting using a wiki page to draft it before we drop it into the main documentation. I'm open to other suggestions, of course! -- -- Christophe Pettus x...@thebuild.com
Re: query logging of prepared statements
On Fri, Feb 15, 2019 at 08:57:04AM -0600, Justin Pryzby wrote: > I propose that the prepared statement associated with an EXECUTE should be > included in log "DETAIL" only when log_error_verbosity=VERBOSE, for both SQL > EXECUTE and PQexecPrepared (if at all). I'd like to be able to continue > logging DETAIL (including bind parameters), so want like to avoid setting > "TERSE" just to avoid redundant messages. (It occurs to me that the GUC > should > probably stick to its existing documented behavior rather than be extended, > which suggests that the duplicitive portions of the logs should simply be > removed, rather than conditionalized. Let me know what you think). I'm attaching a v2 patch which avoids repeated logging of PREPARE, rather than making such logs conditional on log_error_verbosity=VERBOSE, since log_error_verbosity is documented to control whether these are output: DETAIL/HINT/QUERY/CONTEXT/SQLSTATE. For SQL EXECUTE, excluding "detail" seems reasonable (perhaps for log_error_verbosityhttps://www.postgresql.org/docs/current/runtime-config-logging.html |Controls the amount of detail written in the server log for each message that |is logged. Valid values are TERSE, DEFAULT, and VERBOSE, each adding more |fields to displayed messages. TERSE excludes the logging of DETAIL, HINT, |QUERY, and CONTEXT error information. VERBOSE output includes the SQLSTATE |error code (see also Appendix A) and the source code file name, function name, |and line number that generated the error. Only superusers can change this |setting. As I mentioned in my original message, it seems odd that for SQL EXECUTE, the PREPARE is shown in "detail", but for the library call, it's shown in "message". This patch resolves that inconsistency by showing it in neither. >From d60957d9cd1108f389dde0125fadee71a96b4229 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 9 Feb 2019 19:20:43 -0500 Subject: [PATCH v2] Avoid repetitive log of PREPARE during EXECUTE of prepared statements --- src/backend/tcop/postgres.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 8b4d94c9a1..dac5362c81 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -1041,8 +1041,7 @@ exec_simple_query(const char *query_string) { ereport(LOG, (errmsg("statement: %s", query_string), - errhidestmt(true), - errdetail_execute(parsetree_list))); + errhidestmt(true))); was_logged = true; } @@ -2062,14 +2061,13 @@ exec_execute_message(const char *portal_name, long max_rows) if (check_log_statement(portal->stmts)) { ereport(LOG, -(errmsg("%s %s%s%s: %s", +(errmsg("%s %s%s%s", execute_is_fetch ? _("execute fetch from") : _("execute"), prepStmtName, *portal_name ? "/" : "", - *portal_name ? portal_name : "", - sourceText), + *portal_name ? portal_name : ""), errhidestmt(true), errdetail_params(portalParams))); was_logged = true; @@ -2150,15 +2148,14 @@ exec_execute_message(const char *portal_name, long max_rows) break; case 2: ereport(LOG, - (errmsg("duration: %s ms %s %s%s%s: %s", + (errmsg("duration: %s ms %s %s%s%s", msec_str, execute_is_fetch ? _("execute fetch from") : _("execute"), prepStmtName, *portal_name ? "/" : "", - *portal_name ? portal_name : "", - sourceText), + *portal_name ? portal_name : ""), errhidestmt(true), errdetail_params(portalParams))); break; -- 2.12.2
Re: New vacuum option to do only freezing
On 2/27/19, 2:08 AM, "Masahiko Sawada" wrote: >> + if (skip_index_vacuum) >> + appendStringInfo(&buf, ngettext("%.0f tuple is left as >> dead.\n", >> + >>"%.0f tuples are left as dead.\n", >> + >>nleft), >> +nleft); >> >> I think we could emit this metric for all cases, not only when >> DISABLE_INDEX_CLEANUP is used. > > I think that tups_vacuumed shows total number of vacuumed tuples and > is already shown in the log message. The 'nleft' counts the total > number of recorded dead tuple but not counts tuples are removed during > HOT-pruning. Is this a valuable for users in non-DISABLE_INDEX_CLEANUP > case? I think it is valuable. When DISABLE_INDEX_CLEANUP is not used or it is used for a relation with no indexes, it makes it clear that no tuples were left marked as dead. Also, it looks like all of the other information here is provided regardless of the options used. IMO it is good to list all of the stats so that users have the full picture of what VACUUM did. Nathan
Re: [HACKERS] Block level parallel vacuum
On Thu, Feb 14, 2019 at 5:17 AM Masahiko Sawada wrote: > Thank you. Attached the rebased patch. Here are some review comments. + started by a single utility command. Currently, the parallel + utility commands that support the use of parallel workers are + CREATE INDEX and VACUUM + without FULL option, and only when building + a B-tree index. Parallel workers are taken from the pool of That sentence is garbled. The end part about b-tree indexes applies only to CREATE INDEX, not to VACUUM, since VACUUM does build indexes. + Vacuum index and cleanup index in parallel + N background workers (for the detail + of each vacuum phases, please refer to . If the I have two problems with this. One is that I can't understand the English very well. I think you mean something like: "Perform the 'vacuum index' and 'cleanup index' phases of VACUUM in parallel using N background workers," but I'm not entirely sure. The other is that if that is what you mean, I don't think it's a sufficient description. Users need to understand whether, for example, only one worker can be used per index, or whether the work for a single index can be split across workers. + parallel degree N is omitted, + then VACUUM decides the number of workers based on + number of indexes on the relation which further limited by + . Also if this option Now this makes it sound like it's one worker per index, but you could be more explicit about it. + is specified multile times, the last parallel degree + N is considered into the account. Typo, but I'd just delete this sentence altogether; the behavior if the option is multiply specified seems like a triviality that need not be documented. +Setting a value for parallel_workers via + also controls how many parallel +worker processes will be requested by a VACUUM +against the table. This setting is overwritten by setting +N of PARALLEL +option. I wonder if we really want this behavior. Should a setting that controls the degree of parallelism when scanning the table also affect VACUUM? I tend to think that we probably don't ever want VACUUM of a table to be parallel by default, but rather something that the user must explicitly request. Happy to hear other opinions. If we do want this behavior, I think this should be written differently, something like this: The PARALLEL N option to VACUUM takes precedence over this option. + * parallel mode nor destories the parallel context. For updating the index Spelling. +/* DSM keys for parallel lazy vacuum */ +#define PARALLEL_VACUUM_KEY_SHARED UINT64CONST(0xFFF1) +#define PARALLEL_VACUUM_KEY_DEAD_TUPLES UINT64CONST(0xFFF2) +#define PARALLEL_VACUUM_KEY_QUERY_TEXT UINT64CONST(0xFFF3) Any special reason not to use just 1, 2, 3 here? The general infrastructure stuff uses high numbers to avoid conflicting with plan_node_id values, but end clients of the parallel infrastructure can generally just use small integers. + bool updated; /* is the stats updated? */ is -> are + * LVDeadTuples controls the dead tuple TIDs collected during heap scan. what do you mean by "controls", exactly? stores? + * This is allocated in a dynamic shared memory segment when parallel + * lazy vacuum mode, or allocated in a local memory. If this is in DSM, then max_tuples is a wart, I think. We can't grow the segment at that point. I'm suspicious that we need a better design here. It looks like you gather all of the dead tuples in backend-local memory and then allocate an equal amount of DSM to copy them. But that means that we are using twice as much memory, which seems pretty bad. You'd have to do that at least momentarily no matter what, but it's not obvious that the backend-local copy is ever freed. There's another patch kicking around to allocate memory for vacuum in chunks rather than preallocating the whole slab of memory at once; we might want to think about getting that committed first and then having this build on top of it. At least we need something smarter than this. -heap_vacuum_rel(Relation onerel, int options, VacuumParams *params, +heap_vacuum_rel(Relation onerel, VacuumOptions options, VacuumParams *params, We generally avoid passing a struct by value; copying the struct can be expensive and having multiple shallow copies of the same data sometimes leads to surprising results. I think it might be a good idea to propose a preliminary refactoring patch that invents VacuumOptions and gives it just a single 'int' member and refactors everything to use it, and then that can be committed first. It should pass a pointer, though, not the actual struct. + LVState*lvstate; It's not clear to me why we need this new LVState thing. What's the motivation for that? If it's a good idea, could it be done as a separate, preparatory patch? It seems to be responsible for a lot of code churn in this patch. It
Re: Index INCLUDE vs. Bitmap Index Scan
On 2/27/19 6:36 AM, Markus Winand wrote: > > >> On 27.02.2019, at 00:22, Tom Lane wrote: >> >> Markus Winand writes: >>> I think Bitmap Index Scan should take advantage of B-tree INCLUDE columns, >>> which it doesn’t at the moment (tested on master as of yesterday). >> >> Regular index scans don't do what you're imagining either (i.e., check >> filter conditions in advance of visiting the heap). There's a roadblock >> to implementing such behavior, which is that we might end up applying >> filter expressions to dead rows. That could make users unhappy. >> For example, given a filter condition like "1.0/c > 0.1", people >> would complain if it still got zero-divide failures even after they'd >> deleted all rows with c=0 from their table. > > Ok, but I don’t see how this case different for key columns vs. INCLUDE > columns. > Yeah, I'm a bit puzzled by this difference too - why would it be safe for keys and not the other included columns? > When I test this with the (a, b, c) index (no INCLUDE), different > plans are produced for "c=1" (my original example) vs. "1.0/c > 0.1”. I do recall a discussion about executing expressions on index tuples during IOS (before/without inspecting the heap tuple). I'm too lazy to search for the thread now, but I recall it was somehow related to leak-proof-ness. And AFAICS none of the "div" procs are marked as leak-proof, so perhaps that's one of the reasons? Of course, this does not explain why equality conditions and such (which are generally leak-proof) couldn't be moved to the bitmap index scan. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [RFC] [PATCH] Flexible "partition pruning" hook
On 2019-02-26 19:06, Mike Palmiotto wrote: > The desired effect would be to have `SELECT * from test.partpar;` > return check only the partitions where username can see any row in the > table based on column b. This is applicable, for instance, when a > partition of test.partpar (say test.partpar_b2) is given a label with > `SECURITY LABEL on TABLE test.partpar_b2 IS 'foo';` which is exactly > the same as the b column for every row in said partition. Using this > hook, we can simply check the table label and kick the entire > partition out early on. This should greatly improve performance for > the case where you can enforce that the partition SECURITY LABEL and > the b column are the same. To rephrase this: You have a partitioned table, and you have a RLS policy that hides certain rows, and you know based on your business logic that under certain circumstances entire partitions will be hidden, so they don't need to be scanned. So you want a planner hook that would allow you to prune those partitions manually. That sounds pretty hackish to me. We should give the planner and executor the appropriate information to make these decisions, like an additional partition constraint. If this information is hidden in user-defined functions in a way that cannot be reasoned about, who is enforcing these constraints and how do we know they are actually correct? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Pluggable Storage - Andres's take
Hi, On 2019-02-27 18:00:12 +0800, Heikki Linnakangas wrote: > I haven't been following this thread closely, but I looked briefly at some > of the patches posted here: Thanks! > On 21/01/2019 11:01, Andres Freund wrote: > > The patchset is now pretty granularly split into individual pieces. > > Wow, 42 patches, very granular indeed! That's nice for reviewing, but are > you planning to squash them before committing? Seems a bit excessive for the > git history. I've pushed a number of the preliminary patches since you replied. We're down to 23 in my local count... I do plan / did squash some, but not actually that many. I find that patches after a certain size are just too hard to do the necessary final polish to, especially if they do several independent things. Keeping things granular also allows to push incrementally, even when later patches aren't quite ready - imo pretty important for a project this size. > Patches 1-4: > > * v12-0001-WIP-Introduce-access-table.h-access-relation.h.patch > * v12-0002-Replace-heapam.h-includes-with-relation.h-table..patch > * v12-0003-Replace-uses-of-heap_open-et-al-with-table_open-.patch > * v12-0004-Remove-superfluous-tqual.h-includes.patch > > These look good to me. I think it would make sense to squash these together, > and commit now. I've pushed these a while ago. > Patches 20 and 21: > * v12-0020-WIP-Slotified-triggers.patch > * v12-0021-WIP-Slotify-EPQ.patch > > I like this slotification of trigger and EPQ code. It seems like a nice > thing to do, independently of the other patches. You said you wanted to > polish that up to committable state, and commit separately: +1 on > that. I pushed the trigger patch yesterday evening. Working to finalize the EPQ patch now - I've polished it a fair bit since the version posted on the list, but it still needs a bit more. Once the EPQ patch (and two other simple preliminary ones) is pushed, I plan to post a new rebased version to this thread. That's then really only the core table AM work. > > --- a/src/include/commands/trigger.h > > +++ b/src/include/commands/trigger.h > > @@ -35,8 +35,8 @@ typedef struct TriggerData > > HeapTuple tg_trigtuple; > > HeapTuple tg_newtuple; > > Trigger*tg_trigger; > > - Buffer tg_trigtuplebuf; > > - Buffer tg_newtuplebuf; > > + TupleTableSlot *tg_trigslot; > > + TupleTableSlot *tg_newslot; > > Tuplestorestate *tg_oldtable; > > Tuplestorestate *tg_newtable; > > } TriggerData; > > Do we still need tg_trigtuple and tg_newtuple? Can't we always use the > corresponding slots instead? Is it for backwards-compatibility with > user-defined triggers written in C? Yes, the external trigger interface currently relies on those being there. I think we probably ought to revise that, at the very least because it'll otherwise be noticably less efficient to have triggers on !heap tables, but also because it's just cleaner. But I feel like I don't want more significantly sized patches on my plate right now, so my current goal is to just put that on the todo after the pluggable storage work. Kinda wonder if we don't want to do that earlier in a release cycle too, so we can do other breaking changes to the trigger interface without breaking external code multiple times. There's probably also an argument for just not breaking the interface. > (Documentation also needs to be updated for the changes in this > struct) Ah, nice catch, will do that next. Greetings, Andres Freund
Re: Segfault when restoring -Fd dump on current HEAD
> On Wed, Feb 27, 2019 at 1:32 PM Alvaro Herrera > wrote: > > > > I think it would be better to just put back the .defn = "" (etc) to the > > > ArchiveEntry calls. > > > > Then we should do this not only for defn, but for owner and dropStmt too. > > Yeah, absolutely. Done, please find the attached patch. > > I can > > update the fix patch I've sent before, if it's preferrable approach in this > > particular situation. > > I'm not sure we need those changes, since we're forced to update all > callsites anyway. I guess we can keep the part about removing null checks before using strlen, since it's going to be useless. > On Wed, Feb 27, 2019 at 10:36 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Tue, Feb 26, 2019 at 11:53 PM Alvaro Herrera > > wrote: > > > > On 2019-Feb-26, Dmitry Dolgov wrote: > > > > > Yes, it should be rather simple, we can e.g. return to the old less > > > consistent > > > NULL handling approach something (like in the attached patch), or replace > > > a NULL > > > value with an empty string in WriteToc. Give me a moment, I'll check it > > > out. At > > > the same time I would suggest to keep replace_line_endings -> > > > sanitize_line, > > > since it doesn't break compatibility. > > > > Hmm, shouldn't we modify sanitize_line so that it returns strdup(hyphen) > > when input is empty and want_hyphen, too? > > Yes, you're right. I've looked closer, and looks like I was mistaken. In the only place where it matters we anyway pass NULL after verifying noOwner: sanitized_owner = sanitize_line(ropt->noOwner ? NULL : te->owner, true); So I haven't change sanitize_line yet, but can update it if there is a strong opinion about this function. 0001-ArchiveEntry-null-handling.patch Description: Binary data
Re: Segfault when restoring -Fd dump on current HEAD
Alvaro Herrera writes: > On 2019-Feb-27, Dmitry Dolgov wrote: >> But I hope there are no objections if I'll then submit the original >> changes with more consistent null handling separately to make decision >> about them more consciously. > I think we should save such a patch for whenever we next update the > archive version number, which could take a couple of years given past > history. I'm inclined to add a comment near K_VERS_SELF to remind > whoever next patches it. +1. This isn't an unreasonable cleanup idea, but being only a cleanup idea, it doesn't seem worth creating compatibility issues for. Let's wait till there is some more-pressing reason to change the archive format, and then fix this in the same release cycle. I'd also note that given what we've seen so far, there are going to be some slow-to-flush-out null pointer dereferencing bugs from this. I'm not really eager to introduce that towards the tail end of a devel cycle. regards, tom lane
Re: Index INCLUDE vs. Bitmap Index Scan
Markus Winand writes: >> On 27.02.2019, at 02:00, Justin Pryzby wrote: >> On Tue, Feb 26, 2019 at 09:07:01PM +0100, Markus Winand wrote: >>> (As a side node: I also dislike it how Bitmap Index Scan mixes search >>> conditions and filters in “Index Cond”) >> I don't think it's mixing them; it's using index scan on leading *and* >> nonleading column. That's possible even if frequently not efficient. > The distinction leading / non-leading is very important for performance. > Other database products use different names in the execution plan so that it > is immediately visible (without knowing the index definition). Other database products don't have the wide range of index types that we do. The concepts you propose using are pretty much meaningless for non-btree indexes. EXPLAIN doesn't really know which of the index conditions will usefully cut down the index search space for the particular type, so it just lists everything that has the right form to be passed to the index AM. Note that passing a condition to the AM, rather than executing it as a filter, is generally a win when possible even if it fails to cut the portion of the index searched at all. That's because it can save visits to the heap (tying back to the original point in this thread, that we test index conditions, then heap liveness check, then filter conditions). So the planner is aggressive about pushing things into that category when it can. It might help to point out that to be an index condition, a WHERE clause has to meet tighter conditions than just whether it mentions an index column. It generally has to be of the form "index_column indexable_operator pseudo_constant" (though some index types support some other cases like "index_column IS NULL" as index conditions too). Clauses mentioning INCLUDE columns fail this test a priori, because there are no indexable operators associated with an INCLUDE column. regards, tom lane
Re: [HACKERS] Can ICU be used for a database's default sort order?
Hello Andrey, we would like to see ICU collations become the default for entire databases as well. Therefore we would also review the patch. Unfortunately your Patch from late October does not apply on the current master. Besides of that I noticed the patch applies on master of October but results in errors when compiling without "--with-icu" and executing "make check-world": > libpq.so: Warning: undefined reference to »get_collprovider_name« > libpq.so: Warning: undefined reference to »is_valid_nondefault_collprovider« > libpq.so: Warning: undefined reference to »get_collprovider« May be caused by your last modification: > I've added LDFLAGS_INTERNAL += $(ICU_LIBS) in libpq, but I'm not > entirely sure this is correct way to deal with complaints on ICU > functions from libpq linking. Best regards, Marius Timmer -- Westfälische Wilhelms-Universität Münster (WWU) Zentrum für Informationsverarbeitung (ZIV) Röntgenstraße 7-13 48149 Münster +49 251 83 31158 marius.tim...@uni-muenster.de https://www.uni-muenster.de/ZIV signature.asc Description: OpenPGP digital signature
Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
On Mon, Feb 25, 2019 at 4:25 AM Michael Paquier wrote: > Another thing that seems worth thinking about is a system-level GUC, > and an option in the VACUUM command to control if truncation should > happen or not. We have a lot of infrastructure to control such > options between vacuum and autovacuum, so it could be a waste to not > consider potential synergies. I don't think that a VACUUM option would be out of place, but a GUC sounds like an attractive nuisance to me. It will encourage people to just flip it blindly instead of considering the particular cases where they need that behavior, and I think chances are good that most people who do that will end up being sad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Oddity with parallel safety test for scan/join target in grouping_planner
On Tue, Feb 26, 2019 at 7:26 AM Etsuro Fujita wrote: > The parallel safety of the final scan/join target is determined from the > grouping target, not that target, which [ is wrong ] OOPS. That's pretty embarrassing. Your patch looks right to me. I will now go look for a bag to put over my head. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Unneeded parallel safety tests in grouping_planner
On Wed, Feb 27, 2019 at 7:46 AM Etsuro Fujita wrote: > Yet another thing I noticed while working on [1] is this in > grouping_planner: > > /* > * If the input rel is marked consider_parallel and there's nothing > that's > * not parallel-safe in the LIMIT clause, then the final_rel can be > marked > * consider_parallel as well. Note that if the query has rowMarks or is > * not a SELECT, consider_parallel will be false for every relation > in the > * query. > */ > if (current_rel->consider_parallel && > is_parallel_safe(root, parse->limitOffset) && > is_parallel_safe(root, parse->limitCount)) > final_rel->consider_parallel = true; > > If there is a need to add a LIMIT node, we don't consider generating > partial paths for the final relation below (see commit > 0927d2f46ddd4cf7d6bf2cc84b3be923e0aedc52), so it seems unnecessary > anymore to assess the parallel-safety of the LIMIT and OFFSET clauses. > To save cycles, why not remove those tests from that function like the > attached? Because in the future we might want to consider generating partial_paths in cases where we don't do so today. I repeatedly made the mistake of believing that I could not bother setting consider_parallel entirely correctly for one reason or another, and I've gone through multiple iterations of fixing cases where I did that and it turned out to cause problems. I now believe that we should try to get it right in every case, whether or not we currently think it's possible for it to matter. Sometimes it matters in ways that aren't obvious, and it complicates further development. I don't think we'd save much by changing this test anyway. Those is_parallel_safe() tests aren't entirely free, of course, but they should be very cheap. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel query vs smart shutdown and Postmaster death
On Tue, Feb 26, 2019 at 5:44 PM Thomas Munro wrote: > Then perhaps we could do some more involved surgery on master that > achieves smart shutdown's stated goal here, and lets parallel queries > actually run? Better ideas welcome. I have noticed before that the smart shutdown code does not disclose to the rest of the system that a shutdown is in progress: no signals are sent, and no shared memory state is updated. That makes it a bit challenging for any other part of the system to respond to the smart shutdown in a way that is, well, smart. But I guess that's not really the problem in this case. It seems to me that we could fix pmdie() to skip parallel workers; I think that the postmaster could notice that they are lagged as BGWORKER_CLASS_PARALLEL. But we'd also have to fix things so that new parallel workers could be launched during a smart shutdown, which looks not quite so simple. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: some hints to understand the plsql cursor.
On Wed, Feb 27, 2019 at 4:42 PM Andy Fan wrote: > > actually I'm hacking pg for a function like : > 1. define a select query. > 2. client ask for some data. and server reply some data. server will do > NOTHING if client doesn't ask any more.. > 3. client ask some data more data with a batch and SERVER reply some data > then. then do NOTHING. > > currently the simple "select * from t", the server will try to send the data > to client at one time which is not something I want. > > by looking into the plsql, looks it has some api like: > > fetch 10 from cursor_1; > fetch 10 from cursor_1; > > I'm lacking of the experience to hack plsql. so my question are: > 1. Does pg has some codes which act like the "ask -> reply -> ask again -> > reply again" on the server code?currently I'm not sure if the above > "fetch" really work like this. > 2. any resources or hint or suggestion to understand the "fetch" statement? I guess you are looking for these syntax? postgres=# BEGIN; BEGIN postgres=# DECLARE cur CURSOR FOR SELECT * FROM t; DECLARE CURSOR postgres=# FETCH NEXT cur; a --- 1 (1 row) postgres=# FETCH 10 cur; a --- 2 3 4 5 1 2 3 4 5 6 (10 rows) postgres=# FETCH NEXT cur; a --- 7 (1 row) postgres=# CLOSE cur; CLOSE CURSOR -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: get_controlfile() can leak fds in the backend
On 2/27/19 2:47 AM, Michael Paquier wrote: > Hi all, > (CC-ing Joe as of dc7d70e) > > I was just looking at the offline checksum patch, and noticed some > sloppy coding in controldata_utils.c. The control file gets opened in > get_controlfile(), and if it generates an error then the file > descriptor remains open. As basic code rule in the backend we should > only use BasicOpenFile() when opening files, so I think that the issue > should be fixed as attached, even if this requires including fd.h for > the backend compilation which is kind of ugly. > > Another possibility would be to just close the file descriptor before > any error, saving appropriately errno before issuing any %m portion, > but that still does not respect the backend policy regarding open(). In fd.c I see: 8< * AllocateFile, AllocateDir, OpenPipeStream and OpenTransientFile are * wrappers around fopen(3), opendir(3), popen(3) and open(2), * respectively. They behave like the corresponding native functions, * except that the handle is registered with the current subtransaction, * and will be automatically closed at abort. These are intended mainly * for short operations like reading a configuration file; there is a * limit on the number of files that can be opened using these functions * at any one time. * * Finally, BasicOpenFile is just a thin wrapper around open() that can * release file descriptors in use by the virtual file descriptors if * necessary. There is no automatic cleanup of file descriptors returned * by BasicOpenFile, it is solely the caller's responsibility to close * the file descriptor by calling close(2). 8< According to that comment BasicOpenFile does not seem to solve the issue you are pointing out (leaking of file descriptor on ERROR). Perhaps OpenTransientFile() is more appropriate? I am on the road at the moment so have not looked very deeply at this yet though. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development signature.asc Description: OpenPGP digital signature
Re: [RFC] [PATCH] Flexible "partition pruning" hook
On Tue, Feb 26, 2019 at 1:06 PM Mike Palmiotto wrote: > > On Tue, Feb 26, 2019 at 1:55 AM Tsunakawa, Takayuki > wrote: > > > > From: Mike Palmiotto [mailto:mike.palmio...@crunchydata.com] > > > Attached is a patch which attempts to solve a few problems: Updated patch attached. > > > > > What concrete problems would you expect this patch to solve? What kind of > > extensions do you imagine? I'd like to hear about the examples. For > > example, "PostgreSQL 12 will not be able to filter out enough partitions > > when planning/executing SELECT ... WHERE ... statement. But an extension > > like this can extract just one partition early." > > My only application of the patch thus far has been to apply an RLS > policy driven by the extension's results. For example: > > CREATE TABLE test.partpar > ( > a int, > b text DEFAULT (extension_default_bfield('test.partpar'::regclass::oid)) > ) PARTITION BY LIST (extension_translate_bfield(b)); > > CREATE POLICY filter_select on test.partpar for SELECT > USING (extension_filter_by_bfield(b)); > > CREATE POLICY filter_select on test.partpar for INSERT > WITH CHECK (extension_generate_insert_bfield('test.partpar'::regclass::oid) > = b); > > CREATE POLICY filter_update on test.partpar for UPDATE > USING (extension_filter_by_bfield(b)) > WITH CHECK (extension_filter_by_bfield(b)); > > CREATE POLICY filter_delete on test.partpar for DELETE > USING (extension_filter_by_bfield(b)); > > The function would filter based on some external criteria relating to > the username and the contents of the b column. > > The desired effect would be to have `SELECT * from test.partpar;` > return check only the partitions where username can see any row in the > table based on column b. This is applicable, for instance, when a > partition of test.partpar (say test.partpar_b2) is given a label with > `SECURITY LABEL on TABLE test.partpar_b2 IS 'foo';` which is exactly > the same as the b column for every row in said partition. Using this > hook, we can simply check the table label and kick the entire > partition out early on. This should greatly improve performance for > the case where you can enforce that the partition SECURITY LABEL and > the b column are the same. Is this explanation suitable, or is more information required? Thanks, -- Mike Palmiotto Software Engineer Crunchy Data Solutions https://crunchydata.com diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c index 00f7957b3d..45c5de990f 100644 --- a/src/backend/catalog/pg_inherits.c +++ b/src/backend/catalog/pg_inherits.c @@ -24,7 +24,9 @@ #include "access/table.h" #include "catalog/indexing.h" #include "catalog/pg_inherits.h" +#include "optimizer/cost.h" #include "parser/parse_type.h" +#include "partitioning/partprune.h" #include "storage/lmgr.h" #include "utils/builtins.h" #include "utils/fmgroids.h" @@ -92,11 +94,16 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode) while ((inheritsTuple = systable_getnext(scan)) != NULL) { inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid; + + if (enable_partition_pruning && inhrelid && !partitionChildAccess_hook(inhrelid)) + continue; + if (numoids >= maxoids) { maxoids *= 2; oidarr = (Oid *) repalloc(oidarr, maxoids * sizeof(Oid)); } + oidarr[numoids++] = inhrelid; } diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 0debac75c6..185bfc10eb 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -1067,6 +1067,13 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, continue; } + if (enable_partition_pruning && !partitionChildAccess_hook(childRTE->relid)) + { + /* Implement custom partition pruning filter*/ + set_dummy_rel_pathlist(childrel); + continue; + } + /* * CE failed, so finish copying/modifying targetlist and join quals. * diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h index 2f75717ffb..968b0ccbe0 100644 --- a/src/include/partitioning/partprune.h +++ b/src/include/partitioning/partprune.h @@ -74,6 +74,12 @@ typedef struct PartitionPruneContext #define PruneCxtStateIdx(partnatts, step_id, keyno) \ ((partnatts) * (step_id) + (keyno)) +/* Custom partition child access hook. Provides further partition pruning given + * child OID. + */ +typedef bool (*partitionChildAccess_hook_type) (Oid childOID); +PGDLLIMPORT partitionChildAccess_hook_type partitionChildAccess_hook; + extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root, struct RelOptInfo *parentrel, List *subpaths,
Re: readdir is incorrectly implemented at Windows
Originally bug was reported by Yuri Kurenkov: https://github.com/postgrespro/pg_probackup/issues/48 As pg_probackup rely on readdir() for listing files to backup, wrong permissions could lead to a broken backup. On 02/25/2019 06:38 PM, Konstantin Knizhnik wrote: Hi hackers, Small issue with readir implementation for Windows. Right now it returns ENOENT in case of any error returned by FindFirstFile. So all places in Postgres where opendir/listdir are used will assume that directory is empty and do nothing without reporting any error. It is not so good if directory is actually not empty but there are not enough permissions for accessing the directory and FindFirstFile returns ERROR_ACCESS_DENIED: struct dirent * readdir(DIR *d) { WIN32_FIND_DATA fd; if (d->handle == INVALID_HANDLE_VALUE) { d->handle = FindFirstFile(d->dirname, &fd); if (d->handle == INVALID_HANDLE_VALUE) { errno = ENOENT; return NULL; } } Attached please find small patch fixing the problem. -- Grigory Smolkin Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
pg_dump/pg_restore fail for TAR_DUMP and CUSTOM_DUMP from v94/v95/v96 to v11/master.
Hi, I got a failure in pg_dump/pg_restore as below: pg_dump/pg_restore fails with 'ERROR: schema "public" already exists' for TAR_DUMP and CUSTOM_DUMP from v94/v95/v96 to v11/master. -- Take pg_dump in v94/v95/v96: [prabhat@localhost bin]$ ./pg_dump -f /tmp/*tar_dump_PG94.tar* -Ft postgres -p 9000 [prabhat@localhost bin]$ ./pg_dump -f /tmp/*custom_dump_PG94.sql* -Fc postgres -p 9000 -- Try to restore the above dump into v11/master: [prabhat@localhost bin]$ ./pg_restore -F t -U prabhat -d db3 -p 9001 /tmp/ *tar_dump_PG94.tar* pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 6; 2615 2200 SCHEMA public prabhat pg_restore: [archiver (db)] could not execute query: ERROR: schema "public" already exists Command was: CREATE SCHEMA public; WARNING: errors ignored on restore: 1 [prabhat@localhost bin]$ ./pg_restore -F c -U prabhat -d db4 -p 9001 /tmp/ *custom_dump_PG94.sql* pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 6; 2615 2200 SCHEMA public prabhat pg_restore: [archiver (db)] could not execute query: ERROR: schema "public" already exists Command was: CREATE SCHEMA public; WARNING: errors ignored on restore: 1 Note: I am able to perform "Plain dump/restore" across the branches. -- With Regards, Prabhat Kumar Sahu Skype ID: prabhat.sahu1984 EnterpriseDB Corporation The Postgres Database Company
Unneeded parallel safety tests in grouping_planner
Hi, Yet another thing I noticed while working on [1] is this in grouping_planner: /* * If the input rel is marked consider_parallel and there's nothing that's * not parallel-safe in the LIMIT clause, then the final_rel can be marked * consider_parallel as well. Note that if the query has rowMarks or is * not a SELECT, consider_parallel will be false for every relation in the * query. */ if (current_rel->consider_parallel && is_parallel_safe(root, parse->limitOffset) && is_parallel_safe(root, parse->limitCount)) final_rel->consider_parallel = true; If there is a need to add a LIMIT node, we don't consider generating partial paths for the final relation below (see commit 0927d2f46ddd4cf7d6bf2cc84b3be923e0aedc52), so it seems unnecessary anymore to assess the parallel-safety of the LIMIT and OFFSET clauses. To save cycles, why not remove those tests from that function like the attached? Best regards, Etsuro Fujita [1] https://commitfest.postgresql.org/22/1950/ *** a/src/backend/optimizer/plan/planner.c --- b/src/backend/optimizer/plan/planner.c *** *** 2141,2155 grouping_planner(PlannerInfo *root, bool inheritance_update, final_rel = fetch_upper_rel(root, UPPERREL_FINAL, NULL); /* ! * If the input rel is marked consider_parallel and there's nothing that's ! * not parallel-safe in the LIMIT clause, then the final_rel can be marked ! * consider_parallel as well. Note that if the query has rowMarks or is ! * not a SELECT, consider_parallel will be false for every relation in the ! * query. */ ! if (current_rel->consider_parallel && ! is_parallel_safe(root, parse->limitOffset) && ! is_parallel_safe(root, parse->limitCount)) final_rel->consider_parallel = true; /* --- 2141,2152 final_rel = fetch_upper_rel(root, UPPERREL_FINAL, NULL); /* ! * If the input rel is marked consider_parallel and there's no need to add ! * a LIMIT node, then the final_rel can be marked consider_parallel as ! * well. Note that if the query has rowMarks or is not a SELECT, ! * consider_parallel will be false for every relation in the query. */ ! if (current_rel->consider_parallel && !limit_needed(parse)) final_rel->consider_parallel = true; /* *** *** 2263,2272 grouping_planner(PlannerInfo *root, bool inheritance_update, * Generate partial paths for final_rel, too, if outer query levels might * be able to make use of them. */ ! if (final_rel->consider_parallel && root->query_level > 1 && ! !limit_needed(parse)) { ! Assert(!parse->rowMarks && parse->commandType == CMD_SELECT); foreach(lc, current_rel->partial_pathlist) { Path *partial_path = (Path *) lfirst(lc); --- 2260,2269 * Generate partial paths for final_rel, too, if outer query levels might * be able to make use of them. */ ! if (final_rel->consider_parallel && root->query_level > 1) { ! Assert(!parse->rowMarks && !limit_needed(parse) && ! parse->commandType == CMD_SELECT); foreach(lc, current_rel->partial_pathlist) { Path *partial_path = (Path *) lfirst(lc);
Re: Segfault when restoring -Fd dump on current HEAD
On 2019-Feb-27, Dmitry Dolgov wrote: > > On Tue, Feb 26, 2019 at 11:53 PM Alvaro Herrera > > wrote: > > > > I think it would be better to just put back the .defn = "" (etc) to the > > ArchiveEntry calls. > > Then we should do this not only for defn, but for owner and dropStmt too. Yeah, absolutely. > I can > update the fix patch I've sent before, if it's preferrable approach in this > particular situation. I'm not sure we need those changes, since we're forced to update all callsites anyway. > But I hope there are no objections if I'll then submit the original > changes with more consistent null handling separately to make decision > about them more consciously. I think we should save such a patch for whenever we next update the archive version number, which could take a couple of years given past history. I'm inclined to add a comment near K_VERS_SELF to remind whoever next patches it. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
some hints to understand the plsql cursor.
actually I'm hacking pg for a function like : 1. define a select query. 2. client ask for some data. and server reply some data. server will do NOTHING if client doesn't ask any more.. 3. client ask some data more data with a batch and SERVER reply some data then. then do NOTHING. currently the simple "select * from t", the server will try to send the data to client at one time which is not something I want. by looking into the plsql, looks it has some api like: fetch 10 from cursor_1; fetch 10 from cursor_1; I'm lacking of the experience to hack plsql. so my question are: 1. Does pg has some codes which act like the "ask -> reply -> ask again -> reply again" on the server code?currently I'm not sure if the above "fetch" really work like this. 2. any resources or hint or suggestion to understand the "fetch" statement? Thanks
Re: When is the MessageContext released?
Thanks you Andres for your time! this context is free with AllocSetReset rather than AllocSetDelete, that makes my breakpoint doesn't catch it. On Wed, Feb 27, 2019 at 2:15 PM Andres Freund wrote: > On 2019-02-27 14:08:47 +0800, Andy Fan wrote: > > Hi : > > I run a query like "select * from t" and set the break like this: > > > > break exec_simple_query > > break MemoryContextDelete > > commands > >p context->name > >c > > end > > > > I can see most of the MemoryContext is relased, but never MessageContext, > > when will it be released? > > It's released above exec_simple_query, as it actually contains the data > that lead us to call exec_simple_query(). See the main for loop in > PostgresMain(): > > /* > * Release storage left over from prior query cycle, and > create a new > * query input buffer in the cleared MessageContext. > */ > MemoryContextSwitchTo(MessageContext); > MemoryContextResetAndDeleteChildren(MessageContext); > > Greetings, > > Andres Freund >
Re: get_controlfile() can leak fds in the backend
However, while at it, there is also the question of whether the control file should be locked when updated, eg with flock(2) to avoid race conditions between concurrent commands. ISTM that there is currently not such thing in the code, but that it would be desirable. Shouldn't be necessary - the control file fits into a single page, and writes of that size ought to always be atomic. And I also think introducing flock usage for this would be quite disproportional. Ok, fine. Note that my concern is not about the page size, but rather that as more commands may change the cluster status by editing the control file, it would be better that a postmaster does not start while a pg_rewind or enable checksum or whatever is in progress, and currently there is a possible race condition between the read and write that can induce an issue, at least theoretically. -- Fabien.
Re: New vacuum option to do only freezing
On Wed, Feb 27, 2019 at 10:02 AM Bossart, Nathan wrote: > > Sorry for the delay. I finally got a chance to look through the > latest patches. > > On 2/3/19, 1:48 PM, "Masahiko Sawada" wrote: > > On Fri, Feb 1, 2019 at 11:43 PM Bossart, Nathan wrote: > >> > >> + if (skip_index_vacuum) > >> + ereport(elevel, > >> + (errmsg("\"%s\": marked %.0f row > >> versions as dead in %u pages", > >> + > >> RelationGetRelationName(onerel), > >> + tups_vacuumed, > >> vacuumed_pages))); > >> > >> IIUC tups_vacuumed will include tuples removed during HOT pruning, so > >> it could be inaccurate to say all of these tuples have only been > >> marked "dead." Perhaps we could keep a separate count of tuples > >> removed via HOT pruning in case we are using DISABLE_INDEX_CLEANUP. > >> There might be similar problems with the values stored in vacrelstats > >> that are used at the end of heap_vacuum_rel() (e.g. tuples_deleted). > > > > Yeah, tups_vacuumed include such tuples so the message is inaccurate. > > So I think that we should not change the message but we can report the > > dead item pointers we left in errdetail. That's more accurate and > > would help user. > > > > postgres(1:1130)=# vacuum (verbose, disable_index_cleanup) test; > > INFO: vacuuming "public.test" > > INFO: "test": removed 49 row versions in 1 pages > > INFO: "test": found 49 removable, 51 nonremovable row versions in 1 > > out of 1 pages > > DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 509 > > There were 0 unused item pointers. > > Skipped 0 pages due to buffer pins, 0 frozen pages. > > 0 pages are entirely empty. > > 49 tuples are left as dead. > > CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. > > VACUUM > > This seems reasonable to me. > > The current version of the patches builds cleanly, passes 'make > check-world', and seems to work well in my own manual tests. I have a > number of small suggestions, but I think this will be ready-for- > committer soon. Thank you for reviewing the patch! > > + Assert(!skip_index_vacuum); > > There are two places in lazy_scan_heap() that I see this without any > comment. Can we add a short comment explaining why this should be > true at these points? Agreed. Will add a comment. > > + if (skip_index_vacuum) > + appendStringInfo(&buf, ngettext("%.0f tuple is left as > dead.\n", > + > "%.0f tuples are left as dead.\n", > + > nleft), > +nleft); > > I think we could emit this metric for all cases, not only when > DISABLE_INDEX_CLEANUP is used. I think that tups_vacuumed shows total number of vacuumed tuples and is already shown in the log message. The 'nleft' counts the total number of recorded dead tuple but not counts tuples are removed during HOT-pruning. Is this a valuable for users in non-DISABLE_INDEX_CLEANUP case? > > + /* > +* Remove tuples from heap if the table has no index. > If the table > +* has index but index vacuum is disabled, we don't > vacuum and forget > +* them. The vacrelstats->dead_tuples could have > tuples which became > +* dead after checked at HOT-pruning time which are > handled by > +* lazy_vacuum_page() but we don't worry about > handling those because > +* it's a very rare condition and these would not be > a large number. > +*/ > > Based on this, it sounds like nleft could be inaccurate. Do you think > it is worth adjusting the log message to reflect that, or is this > discrepancy something we should just live with? I think adding > something like "at least N tuples left marked dead" is arguably a bit > misleading, since the only time the number is actually higher is when > this "very rare condition" occurs. Hmm, I think it's true that we leave 'nleft' dead tuples because it includes both tuples marked as dead and tuples not marked yet. So I think the log message works. Maybe the comment leads misreading. Will fix it. > > + /* > +* Skip index vacuum if it's requested for table with indexes. In this > +* case we use the one-pass strategy and don't remove tuple storage. > +*/ > + skip_index_vacuum = > + (options & VACOPT_DISABLE_INDEX_CLEANUP) != 0 && > vacrelstats->hasindex; > > AFAICT we don't actually need to adjust this based on > vacrelstats->hasindex because we are already checking for indexes > everywhere we check for this option. What do you think about leaving > that part out? >
Re: Set fallback_application_name for a walreceiver to cluster_name
On 2019-02-21 01:36, Euler Taveira wrote: >> By default, the fallback_application_name for a physical walreceiver is >> "walreceiver". This means that multiple standbys cannot be >> distinguished easily on a primary, for example in pg_stat_activity or >> synchronous_standby_names. >> > Although standby identification could be made by client_addr in > pg_stat_activity, it could be useful in multiple clusters in the same > host. Since it is a fallback application name, it can be overridden by > an application_name in primary_conninfo parameter. Yeah, plus that doesn't help with synchronous_standby_names. >> I propose, if cluster_name is set, use that for >> fallback_application_name in the walreceiver. (If it's not set, it >> remains "walreceiver".) If someone set cluster_name to identify their >> instance, we might as well use that by default to identify the node >> remotely as well. It's still possible to specify another >> application_name in primary_conninfo explicitly. >> > I tested it and both patches work as described. Passes all tests. Doc > describes the proposed feature. Doc builds without errors. Committed, thanks! -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Pluggable Storage - Andres's take
I haven't been following this thread closely, but I looked briefly at some of the patches posted here: On 21/01/2019 11:01, Andres Freund wrote: The patchset is now pretty granularly split into individual pieces. Wow, 42 patches, very granular indeed! That's nice for reviewing, but are you planning to squash them before committing? Seems a bit excessive for the git history. Patches 1-4: * v12-0001-WIP-Introduce-access-table.h-access-relation.h.patch * v12-0002-Replace-heapam.h-includes-with-relation.h-table..patch * v12-0003-Replace-uses-of-heap_open-et-al-with-table_open-.patch * v12-0004-Remove-superfluous-tqual.h-includes.patch These look good to me. I think it would make sense to squash these together, and commit now. Patches 20 and 21: * v12-0020-WIP-Slotified-triggers.patch * v12-0021-WIP-Slotify-EPQ.patch I like this slotification of trigger and EPQ code. It seems like a nice thing to do, independently of the other patches. You said you wanted to polish that up to committable state, and commit separately: +1 on that. Perhaps do that even before patches 1-4. --- a/src/include/commands/trigger.h +++ b/src/include/commands/trigger.h @@ -35,8 +35,8 @@ typedef struct TriggerData HeapTuple tg_trigtuple; HeapTuple tg_newtuple; Trigger*tg_trigger; - Buffer tg_trigtuplebuf; - Buffer tg_newtuplebuf; + TupleTableSlot *tg_trigslot; + TupleTableSlot *tg_newslot; Tuplestorestate *tg_oldtable; Tuplestorestate *tg_newtable; } TriggerData; Do we still need tg_trigtuple and tg_newtuple? Can't we always use the corresponding slots instead? Is it for backwards-compatibility with user-defined triggers written in C? (Documentation also needs to be updated for the changes in this struct) I didn't look a the rest of the patches yet... - Heikki
Re: Why don't we have a small reserved OID range for patch revisions?
On 2019-02-08 19:14, Tom Lane wrote: > Quite a few people have used OIDs up around 8000 or 9000 for this purpose; > I doubt we need a formally reserved range for it. The main problem with > doing it is the hazard that the patch'll get committed just like that, > suddenly breaking things for everyone else doing likewise. For that reason, I'm not in favor of this. Forgetting to update the catversion is already common enough (for me). Adding another step between having a seemingly final patch and being able to actually commit it doesn't seem attractive. Moreover, these "final adjustments" would tend to require a full rebuild and retest, adding even more overhead. OID collision doesn't seem to be a significant problem (for me). -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Oddity with parallel safety test for scan/join target in grouping_planner
(2019/02/26 21:25), Etsuro Fujita wrote: > While working on [1], I noticed $Subject, that is: > > /* > * If we have grouping or aggregation to do, the topmost scan/join > * plan node must emit what the grouping step wants; otherwise, it > * should emit grouping_target. > */ > have_grouping = (parse->groupClause || parse->groupingSets || > parse->hasAggs || root->hasHavingQual); > if (have_grouping) > { > scanjoin_target = make_group_input_target(root, final_target); > --> scanjoin_target_parallel_safe = > is_parallel_safe(root, (Node *) grouping_target->exprs); > } > else > { > scanjoin_target = grouping_target; > scanjoin_target_parallel_safe = grouping_target_parallel_safe; > } > > The parallel safety of the final scan/join target is determined from the > grouping target, not that target, which would cause to generate inferior > parallel plans as shown below: > > pgbench=# explain verbose select aid+bid, sum(abalance), random() from > pgbench_accounts group by aid+bid; > QUERY PLAN > > GroupAggregate (cost=137403.01..159903.01 rows=100 width=20) > Output: ((aid + bid)), sum(abalance), random() > Group Key: ((pgbench_accounts.aid + pgbench_accounts.bid)) > -> Sort (cost=137403.01..139903.01 rows=100 width=8) > Output: ((aid + bid)), abalance > Sort Key: ((pgbench_accounts.aid + pgbench_accounts.bid)) > -> Gather (cost=10.00..24070.67 rows=100 width=8) > Output: (aid + bid), abalance > Workers Planned: 2 > -> Parallel Seq Scan on public.pgbench_accounts > (cost=0.00..20560.67 rows=416667 width=12) > Output: aid, bid, abalance > (11 rows) > > The final scan/join target {(aid + bid), abalance} is definitely > parallel safe, but the target evaluation isn't parallelized across > workers, which is not good. Attached is a patch for fixing this. I added this to the upcoming commitfest. Best regards, Etsuro Fujita